1## AdaBound 2 3*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 4 5AdaBound is a variant of Adam which employs dynamic bounds on learning rates. 6 7#### Constructors 8 9 * `AdaBound()` 10 * `AdaBound(`_`stepSize, batchSize`_`)` 11 * `AdaBound(`_`stepSize, batchSize, finalLr, gamma, beta1, beta2, epsilon, maxIterations, tolerance, shuffle`_`)` 12 * `AdaBound(`_`stepSize, batchSize, finalLr, gamma, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 13 14Note that the `AdaBound` class is based on the `AdaBoundType<`_`UpdateRule`_`>` 15class with _`UpdateRule`_` = AdaBoundUpdate`. 16 17#### Attributes 18 19| **type** | **name** | **description** | **default** | 20|----------|----------|-----------------|-------------| 21| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 22| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 23| `double` | **`finalLr`** | The final (SGD) learning rate. | `0.1` | 24| `double` | **`gamma`** | The convergence speed of the bound functions. | `0.001` | 25| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 26| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 27| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 28| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 29| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 30| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 31| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 32| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 33 34The attributes of the optimizer may also be modified via the member methods 35`FinalLr()`, `Gamma()`, `StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, 36`Eps()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and 37`ExactObjective()`. 38 39#### Examples 40 41<details open> 42<summary>Click to collapse/expand example code. 43</summary> 44 45```c++ 46SphereFunction f(2); 47arma::mat coordinates = f.GetInitialPoint(); 48 49AdaBound optimizer(0.001, 2, 0.1, 1e-3, 0.9, 0.999, 1e-8, 500000, 1e-3); 50optimizer.Optimize(f, coordinates); 51``` 52 53</details> 54 55#### See also: 56 57 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 58 * [SGD](#standard-sgd) 59 * [Adaptive Gradient Methods with Dynamic Bound of Learning Rate](https://arxiv.org/abs/1902.09843) 60 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980) 61 * [Differentiable separable functions](#differentiable-separable-functions) 62 63## AdaDelta 64 65*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 66 67AdaDelta is an extension of [AdaGrad](#adagrad) that adapts learning rates 68based on a moving window of gradient updates, instead of accumulating all past 69gradients. Instead of accumulating all past squared gradients, the sum of 70gradients is recursively defined as a decaying average of all past squared 71gradients. 72 73#### Constructors 74 75 * `AdaDelta()` 76 * `AdaDelta(`_`stepSize`_`)` 77 * `AdaDelta(`_`stepSize, batchSize`_`)` 78 * `AdaDelta(`_`stepSize, batchSize, rho, epsilon, maxIterations, tolerance, shuffle`_`)` 79 * `AdaDelta(`_`stepSize, batchSize, rho, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 80 81#### Attributes 82 83| **type** | **name** | **description** | **default** | 84|----------|----------|-----------------|-------------| 85| `double` | **`stepSize`** | Step size for each iteration. | `1.0` | 86| `size_t` | **`batchSize`**| Number of points to process in one step. | `32` | 87| `double` | **`rho`** | Smoothing constant. Corresponding to fraction of gradient to keep at each time step. | `0.95` | 88| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-6` | 89| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 90| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 91| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 92| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 93| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 94 95Attributes of the optimizer may also be changed via the member methods 96`StepSize()`, `BatchSize()`, `Rho()`, `Epsilon()`, `MaxIterations()`, 97`Shuffle()`, `ResetPolicy()`, and `ExactObjective()`. 98 99#### Examples: 100 101<details open> 102<summary>Click to collapse/expand example code. 103</summary> 104 105```c++ 106AdaDelta optimizer(1.0, 1, 0.99, 1e-8, 1000, 1e-9, true); 107 108RosenbrockFunction f; 109arma::mat coordinates = f.GetInitialPoint(); 110optimizer.Optimize(f, coordinates); 111``` 112 113</details> 114 115#### See also: 116 117 * [Adadelta - an adaptive learning rate method](https://arxiv.org/abs/1212.5701) 118 * [AdaGrad](#adagrad) 119 * [Differentiable separable functions](#differentiable-separable-functions) 120 121## Adagrad 122 123*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 124 125AdaGrad is an optimizer with parameter-specific learning rates, which are 126adapted relative to how frequently a parameter gets updated during training. 127Larger updates for more sparse parameters and smaller updates for less sparse 128parameters. 129 130#### Constructors 131 132 - `AdaGrad()` 133 - `AdaGrad(`_`stepSize`_`)` 134 - `AdaGrad(`_`stepSize, batchSize`_`)` 135 - `AdaGrad(`_`stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle`_`)` 136 - `AdaGrad(`_`stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 137 138#### Attributes 139 140| **type** | **name** | **description** | **default** | 141|----------|----------|-----------------|-------------| 142| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 143| `size_t` | **`batchSize`** | Number of points to process in one step. | `32` | 144| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 145| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 146| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `tolerance` | 147| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 148| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 149| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 150 151Attributes of the optimizer may also be changed via the member methods 152`StepSize()`, `BatchSize()`, `Epsilon()`, `MaxIterations()`, `Tolerance()`, 153`Shuffle()`, `ResetPolicy()`, and `ExactObjective()`. 154 155#### Examples: 156 157<details open> 158<summary>Click to collapse/expand example code. 159</summary> 160 161```c++ 162AdaGrad optimizer(1.0, 1, 1e-8, 1000, 1e-9, true); 163 164RosenbrockFunction f; 165arma::mat coordinates = f.GetInitialPoint(); 166optimizer.Optimize(f, coordinates); 167``` 168 169</details> 170 171#### See also: 172 173 * [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) 174 * [AdaGrad in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#AdaGrad) 175 * [AdaDelta](#adadelta) 176 * [Differentiable separable functions](#differentiable-separable-functions) 177 178## Adam 179 180*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 181 182Adam is an algorithm for first-order gradient-based optimization of 183stochastic objective functions, based on adaptive estimates of lower-order 184moments. 185 186#### Constructors 187 188 * `Adam()` 189 * `Adam(`_`stepSize, batchSize`_`)` 190 * `Adam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)` 191 * `Adam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 192 193Note that the `Adam` class is based on the `AdamType<`_`UpdateRule`_`>` class 194with _`UpdateRule`_` = AdamUpdate`. 195 196#### Attributes 197 198| **type** | **name** | **description** | **default** | 199|----------|----------|-----------------|-------------| 200| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 201| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 202| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 203| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 204| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 205| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 206| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 207| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 208| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 209| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 210 211The attributes of the optimizer may also be modified via the member methods 212`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`, 213`Tolerance()`, `Shuffle()`, `ResetPolicy()`, and `ExactObjective()`. 214 215#### Examples 216 217<details open> 218<summary>Click to collapse/expand example code. 219</summary> 220 221```c++ 222RosenbrockFunction f; 223arma::mat coordinates = f.GetInitialPoint(); 224 225Adam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true); 226optimizer.Optimize(f, coordinates); 227``` 228 229</details> 230 231#### See also: 232 233 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 234 * [SGD](#standard-sgd) 235 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980) 236 * [Differentiable separable functions](#differentiable-separable-functions) 237 238## AdaMax 239 240*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 241 242AdaMax is simply a variant of Adam based on the infinity norm. 243 244#### Constructors 245 246 * `AdaMax()` 247 * `AdaMax(`_`stepSize, batchSize`_`)` 248 * `AdaMax(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)` 249 * `AdaMax(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, exactObjective, resetPolicy`_`)` 250 251Note that the `AdaMax` class is based on the `AdamType<`_`UpdateRule`_`>` class 252with _`UpdateRule`_` = AdaMaxUpdate`. 253 254#### Attributes 255 256| **type** | **name** | **description** | **default** | 257|----------|----------|-----------------|-------------| 258| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 259| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 260| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 261| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 262| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 263| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 264| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 265| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 266| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 267| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 268 269The attributes of the optimizer may also be modified via the member methods 270`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`, 271`Tolerance()`, `Shuffle()`, `ExactObjective()`, and `ResetPolicy()`. 272 273#### Examples 274 275<details open> 276<summary>Click to collapse/expand example code. 277</summary> 278 279```c++ 280RosenbrockFunction f; 281arma::mat coordinates = f.GetInitialPoint(); 282 283AdaMax optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true); 284optimizer.Optimize(f, coordinates); 285``` 286 287</details> 288 289#### See also: 290 291 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 292 * [SGD](#standard-sgd) 293 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980) (see section 7) 294 * [Differentiable separable functions](#differentiable-separable-functions) 295 296## AMSBound 297 298*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 299 300AMSBound is a variant of Adam which employs dynamic bounds on learning rates. 301 302#### Constructors 303 304 * `AMSBound()` 305 * `AMSBound(`_`stepSize, batchSize`_`)` 306 * `AMSBound(`_`stepSize, batchSize, finalLr, gamma, beta1, beta2, epsilon, maxIterations, tolerance, shuffle`_`)` 307 * `AMSBound(`_`stepSize, batchSize, finalLr, gamma, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 308 309Note that the `AMSBound` class is based on the `AdaBoundType<`_`UpdateRule`_`>` 310class with _`UpdateRule`_` = AdaBoundUpdate`. 311 312#### Attributes 313 314| **type** | **name** | **description** | **default** | 315|----------|----------|-----------------|-------------| 316| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 317| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 318| `double` | **`finalLr`** | The final (SGD) learning rate. | `0.1` | 319| `double` | **`gamma`** | The convergence speed of the bound functions. | `0.001` | 320| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 321| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 322| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 323| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 324| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 325| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 326| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 327| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 328 329The attributes of the optimizer may also be modified via the member methods 330`FinalLr()`, `Gamma()`, `StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, 331`Eps()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and 332`ExactObjective()`. 333 334#### Examples 335 336<details open> 337<summary>Click to collapse/expand example code. 338</summary> 339 340```c++ 341SphereFunction f(2); 342arma::mat coordinates = f.GetInitialPoint(); 343 344AMSBound optimizer(0.001, 2, 0.1, 1e-3, 0.9, 0.999, 1e-8, 500000, 1e-3); 345optimizer.Optimize(f, coordinates); 346``` 347 348</details> 349 350#### See also: 351 352 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 353 * [SGD](#standard-sgd) 354 * [Adaptive Gradient Methods with Dynamic Bound of Learning Rate](https://arxiv.org/abs/1902.09843) 355 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980) 356 * [Differentiable separable functions](#differentiable-separable-functions) 357 358## AMSGrad 359 360*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 361 362AMSGrad is a variant of Adam with guaranteed convergence. 363 364#### Constructors 365 366 * `AMSGrad()` 367 * `AMSGrad(`_`stepSize, batchSize`_`)` 368 * `AMSGrad(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)` 369 * `AMSGrad(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, exactObjective, resetPolicy`_`)` 370 371Note that the `AMSGrad` class is based on the `AdamType<`_`UpdateRule`_`>` class 372with _`UpdateRule`_` = AMSGradUpdate`. 373 374#### Attributes 375 376| **type** | **name** | **description** | **default** | 377|----------|----------|-----------------|-------------| 378| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 379| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 380| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 381| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 382| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 383| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 384| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 385| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 386| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 387| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 388 389The attributes of the optimizer may also be modified via the member methods 390`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`, 391`Tolerance()`, `Shuffle()`, `ExactObjective()`, and `ResetPolicy()`. 392 393#### Examples 394 395<details open> 396<summary>Click to collapse/expand example code. 397</summary> 398 399```c++ 400RosenbrockFunction f; 401arma::mat coordinates = f.GetInitialPoint(); 402 403AMSGrad optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true); 404optimizer.Optimize(f, coordinates); 405``` 406 407</details> 408 409#### See also: 410 411 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 412 * [SGD](#standard-sgd) 413 * [On the Convergence of Adam and Beyond](https://openreview.net/forum?id=ryQu7f-RZ) 414 * [Differentiable separable functions](#differentiable-separable-functions) 415 416## Augmented Lagrangian 417 418*An optimizer for [differentiable constrained functions](#constrained-functions).* 419 420The `AugLagrangian` class implements the Augmented Lagrangian method of 421optimization. In this scheme, a penalty term is added to the Lagrangian. 422This method is also called the "method of multipliers". Internally, the 423optimizer uses [L-BFGS](#l-bfgs). 424 425#### Constructors 426 427 * `AugLagrangian(`_`maxIterations, penaltyThresholdFactor, sigmaUpdateFactor`_`)` 428 429#### Attributes 430 431| **type** | **name** | **description** | **default** | 432|----------|----------|-----------------|-------------| 433| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `1000` | 434| `double` | **`penaltyThresholdFactor`** | When penalty threshold is updated, set it to this multiplied by the penalty. | `0.25` | 435| `double` | **`sigmaUpdateFactor`** | When sigma is updated, multiply it by this. | `10.0` | 436| `L_BFGS&` | **`lbfgs`** | Internal l-bfgs optimizer. | `L_BFGS()` | 437 438The attributes of the optimizer may also be modified via the member methods 439`MaxIterations()`, `PenaltyThresholdFactor()`, `SigmaUpdateFactor()` and `LBFGS()`. 440 441<details open> 442<summary>Click to collapse/expand example code. 443</summary> 444 445```c++ 446/** 447 * Optimize the function. The value '1' is used for the initial value of each 448 * Lagrange multiplier. To set the Lagrange multipliers yourself, use the 449 * other overload of Optimize(). 450 * 451 * @tparam LagrangianFunctionType Function which can be optimized by this 452 * class. 453 * @param function The function to optimize. 454 * @param coordinates Output matrix to store the optimized coordinates in. 455 */ 456template<typename LagrangianFunctionType> 457bool Optimize(LagrangianFunctionType& function, 458 arma::mat& coordinates); 459 460/** 461 * Optimize the function, giving initial estimates for the Lagrange 462 * multipliers. The vector of Lagrange multipliers will be modified to 463 * contain the Lagrange multipliers of the final solution (if one is found). 464 * 465 * @tparam LagrangianFunctionType Function which can be optimized by this 466 * class. 467 * @param function The function to optimize. 468 * @param coordinates Output matrix to store the optimized coordinates in. 469 * @param initLambda Vector of initial Lagrange multipliers. Should have 470 * length equal to the number of constraints. 471 * @param initSigma Initial penalty parameter. 472 */ 473template<typename LagrangianFunctionType> 474bool Optimize(LagrangianFunctionType& function, 475 arma::mat& coordinates, 476 const arma::vec& initLambda, 477 const double initSigma); 478``` 479 480</details> 481 482#### Examples 483 484<details open> 485<summary>Click to collapse/expand example code. 486</summary> 487 488```c++ 489GockenbachFunction f; 490arma::mat coordinates = f.GetInitialPoint(); 491 492AugLagrangian optimizer; 493optimizer.Optimize(f, coords); 494``` 495 496</details> 497 498#### See also: 499 500 * [Augmented Lagrangian method on Wikipedia](https://en.wikipedia.org/wiki/Augmented_Lagrangian_method) 501 * [L-BFGS](#l-bfgs) 502 * [Constrained functions](#constrained-functions) 503 504## Big Batch SGD 505 506*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 507 508Big-batch stochastic gradient descent adaptively grows the batch size over time 509to maintain a nearly constant signal-to-noise ratio in the gradient 510approximation, so the Big Batch SGD optimizer is able to adaptively adjust batch 511sizes without user oversight. 512 513#### Constructors 514 515 * `BigBatchSGD<`_`UpdatePolicy`_`>()` 516 * `BigBatchSGD<`_`UpdatePolicy`_`>(`_`stepSize`_`)` 517 * `BigBatchSGD<`_`UpdatePolicy`_`>(`_`stepSize, batchSize`_`)` 518 * `BigBatchSGD<`_`UpdatePolicy`_`>(`_`stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, exactObjective`_`)` 519 520The _`UpdatePolicy`_ template parameter refers to the way that a new step size 521is computed. The `AdaptiveStepsize` and `BacktrackingLineSearch` classes are 522available for use; custom behavior can be achieved by implementing a class 523with the same method signatures. 524 525For convenience the following typedefs have been defined: 526 527 * `BBS_Armijo = BigBatchSGD<BacktrackingLineSearch>` 528 * `BBS_BB = BigBatchSGD<AdaptiveStepsize>` 529 530#### Attributes 531 532| **type** | **name** | **description** | **default** | 533|----------|----------|-----------------|-------------| 534| `size_t` | **`batchSize`** | Initial batch size. | `1000` | 535| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 536| `double` | **`batchDelta`** | Factor for the batch update step. | `0.1` | 537| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 538| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 539| `bool` | **`shuffle`** | If true, the batch order is shuffled; otherwise, each batch is visited in linear order. | `true` | 540| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 541 542Attributes of the optimizer may also be changed via the member methods 543`BatchSize()`, `StepSize()`, `BatchDelta()`, `MaxIterations()`, `Tolerance()`, 544`Shuffle()`, and `ExactObjective()`. 545 546#### Examples: 547 548<details open> 549<summary>Click to collapse/expand example code. 550</summary> 551 552```c++ 553RosenbrockFunction f; 554arma::mat coordinates = f.GetInitialPoint(); 555 556// Big-Batch SGD with the adaptive stepsize policy. 557BBS_BB optimizer(10, 0.01, 0.1, 8000, 1e-4); 558optimizer.Optimize(f, coordinates); 559 560// Big-Batch SGD with backtracking line search. 561BBS_Armijo optimizer2(10, 0.01, 0.1, 8000, 1e-4); 562optimizer2.Optimize(f, coordinates); 563``` 564 565</details> 566 567#### See also: 568 569 * [Big Batch SGD: Automated Inference using Adaptive Batch Sizes](https://arxiv.org/pdf/1610.05792.pdf) 570 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 571 * [SGD](#standard-sgd) 572 573## CMAES 574 575*An optimizer for [separable functions](#separable-functions).* 576 577CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is a stochastic search 578algorithm. CMA-ES is a second order approach estimating a positive definite 579matrix within an iterative procedure using the covariance matrix. 580 581#### Constructors 582 583 * `CMAES<`_`SelectionPolicyType`_`>()` 584 * `CMAES<`_`SelectionPolicyType`_`>(`_`lambda, lowerBound, upperBound`_`)` 585 * `CMAES<`_`SelectionPolicyType`_`>(`_`lambda, lowerBound, upperBound, batchSize`_`)` 586 * `CMAES<`_`SelectionPolicyType`_`>(`_`lambda, lowerBound, upperBound, batchSize, maxIterations, tolerance, selectionPolicy`_`)` 587 588The _`SelectionPolicyType`_ template parameter refers to the strategy used to 589compute the (approximate) objective function. The `FullSelection` and 590`RandomSelection` classes are available for use; custom behavior can be achieved 591by implementing a class with the same method signatures. 592 593For convenience the following types can be used: 594 595 * **`CMAES<>`** (equivalent to `CMAES<FullSelection>`): uses all separable functions to compute objective 596 * **`ApproxCMAES`** (equivalent to `CMAES<RandomSelection>`): uses a small amount of separable functions to compute approximate objective 597 598#### Attributes 599 600| **type** | **name** | **description** | **default** | 601|----------|----------|-----------------|-------------| 602| `size_t` | **`lambda`** | The population size (0 uses a default size). | `0` | 603| `double` | **`lowerBound`** | Lower bound of decision variables. | `-10.0` | 604| `double` | **`upperBound`** | Upper bound of decision variables. | `10.0` | 605| `size_t` | **`batchSize`** | Batch size to use for the objective calculation. | `32` | 606| `size_t` | **`maxIterations`** | Maximum number of iterations. | `1000` | 607| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 608| `SelectionPolicyType` | **`selectionPolicy`** | Instantiated selection policy used to calculate the objective. | `SelectionPolicyType()` | 609 610Attributes of the optimizer may also be changed via the member methods 611`Lambda()`, `LowerBound()`, `UpperBound()`, `BatchSize()`, `MaxIterations()`, 612`Tolerance()`, and `SelectionPolicy()`. 613 614The `selectionPolicy` attribute allows an instantiated `SelectionPolicyType` to 615be given. The `FullSelection` policy has no need to be instantiated and thus 616the option is not relevant when the `CMAES<>` optimizer type is being used; the 617`RandomSelection` policy has the constructor `RandomSelection(`_`fraction`_`)` 618where _`fraction`_ specifies the percentage of separable functions to use to 619estimate the objective function. 620 621#### Examples: 622 623<details open> 624<summary>Click to collapse/expand example code. 625</summary> 626 627```c++ 628RosenbrockFunction f; 629arma::mat coordinates = f.GetInitialPoint(); 630 631// CMAES with the FullSelection policy. 632CMAES<> optimizer(0, -1, 1, 32, 200, 1e-4); 633optimizer.Optimize(f, coordinates); 634 635// CMAES with the RandomSelection policy. 636ApproxCMAES<> approxOptimizer(0, -1, 1. 32, 200, 1e-4); 637approxOptimizer.Optimize(f, coordinates); 638``` 639 640</details> 641 642#### See also: 643 644 * [Completely Derandomized Self-Adaptation in Evolution Strategies](http://www.cmap.polytechnique.fr/~nikolaus.hansen/cmaartic.pdf) 645 * [CMA-ES in Wikipedia](https://en.wikipedia.org/wiki/CMA-ES) 646 * [Evolution strategy in Wikipedia](https://en.wikipedia.org/wiki/Evolution_strategy) 647 648## CNE 649 650*An optimizer for [arbitrary functions](#arbitrary-functions).* 651 652Conventional Neural Evolution is an optimizer that works like biological evolution which selects best candidates based on their fitness scores and creates new generation by mutation and crossover of population. The initial population is generated based on a random normal distribution centered at the given starting point. 653 654#### Constructors 655 656 * `CNE()` 657 * `CNE(`_`populationSize, maxGenerations`_`)` 658 * `CNE(`_`populationSize, maxGenerations, mutationProb, mutationSize`_`)` 659 * `CNE(`_`populationSize, maxGenerations, mutationProb, mutationSize, selectPercent, tolerance`_`)` 660 661#### Attributes 662 663| **type** | **name** | **description** | **default** | 664|----------|----------|-----------------|-------------| 665| `size_t` | **`populationSize`** | The number of candidates in the population. This should be at least 4 in size. | `500` | 666| `size_t` | **`maxGenerations`** | The maximum number of generations allowed for CNE. | `5000` | 667| `double` | **`mutationProb`** | Probability that a weight will get mutated. | `0.1` | 668| `double` | **`mutationSize`** | The range of mutation noise to be added. This range is between 0 and mutationSize. | `0.02` | 669| `double` | **`selectPercent`** | The percentage of candidates to select to become the the next generation. | `0.2` | 670| `double` | **`tolerance`** | The final value of the objective function for termination. If set to negative value, tolerance is not considered. | `1e-5` | 671 672Attributes of the optimizer may also be changed via the member methods 673`PopulationSize()`, `MaxGenerations()`, `MutationProb()`, `SelectPercent()` 674and `Tolerance()`. 675 676#### Examples: 677 678<details open> 679<summary>Click to collapse/expand example code. 680</summary> 681 682```c++ 683RosenbrockFunction f; 684arma::mat coordinates = f.GetInitialPoint(); 685 686CNE optimizer(200, 10000, 0.2, 0.2, 0.3, 1e-5); 687optimizer.Optimize(f, coordinates); 688``` 689 690</details> 691 692#### See also: 693 694 * [Neuroevolution in Wikipedia](https://en.wikipedia.org/wiki/Neuroevolution) 695 * [Arbitrary functions](#arbitrary-functions) 696 697## DE 698 699*An optimizer for [arbitrary functions](#arbitrary-functions).* 700 701Differential Evolution is an evolutionary optimization algorithm which selects best candidates based on their fitness scores and creates new generation by mutation and crossover of population. 702 703#### Constructors 704 705* `DE()` 706* `DE(`_`populationSize, maxGenerations`_`)` 707* `DE(`_`populationSize, maxGenerations, crossoverRate`_`)` 708* `DE(`_`populationSize, maxGenerations, crossoverRate, differentialWeight`_`)` 709* `DE(`_`populationSize, maxGenerations, crossoverRate, differentialWeight, tolerance`_`)` 710 711#### Attributes 712 713| **type** | **name** | **description** | **default** | 714|----------|----------|-----------------|-------------| 715| `size_t` | **`populationSize`** | The number of candidates in the population. This should be at least 3 in size. | `100` | 716| `size_t` | **`maxGenerations`** | The maximum number of generations allowed for DE. | `2000` | 717| `double` | **`crossoverRate`** | Probability that a candidate will undergo crossover. | `0.6` | 718| `double` | **`differentialWeight`** | Amplification factor for differentiation. | `0.8` | 719| `double` | **`tolerance`** | The final value of the objective function for termination. If set to negative value, tolerance is not considered. | `1e-5` | 720 721Attributes of the optimizer may also be changed via the member methods 722`PopulationSize()`, `MaxGenerations()`, `CrossoverRate()`, `DifferentialWeight()` 723and `Tolerance()`. 724 725#### Examples: 726 727<details open> 728<summary>Click to collapse/expand example code. 729</summary> 730 731```c++ 732RosenbrockFunction f; 733arma::mat coordinates = f.GetInitialPoint(); 734 735DE optimizer(200, 1000, 0.6, 0.8, 1e-5); 736optimizer.Optimize(f, coordinates); 737``` 738 739</details> 740 741#### See also: 742 743 * [Differential Evolution - A simple and efficient adaptive scheme for global optimization over continuous spaces](http://www1.icsi.berkeley.edu/~storn/TR-95-012.pdf) 744 * [Differential Evolution in Wikipedia](https://en.wikipedia.org/wiki/Differential_Evolution) 745 * [Arbitrary functions](#arbitrary-functions) 746 747## Eve 748 749*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 750 751Eve is a stochastic gradient based optimization method with locally and globally adaptive learning rates. 752 753#### Constructors 754 755 * `Eve()` 756 * `Eve(`_`stepSize, batchSize`_`)` 757 * `Eve(`_`stepSize, batchSize, beta1, beta2, beta3, epsilon, clip, maxIterations, tolerance, shuffle`_`)` 758 759#### Attributes 760 761| **type** | **name** | **description** | **default** | 762|----------|----------|-----------------|-------------| 763| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 764| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 765| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 766| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 767| `double` | **`beta3`** | Exponential decay rate for relative change. | `0.999` | 768| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 769| `double` | **`clip`** | Clipping range to avoid extreme values. | `10` | 770| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 771| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 772| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 773| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 774 775The attributes of the optimizer may also be modified via the member methods 776`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Beta3()`, `Epsilon()`, `Clip()`, `MaxIterations()`, 777`Tolerance()`, `Shuffle()`, and `ExactObjective()`. 778 779#### Examples 780 781<details open> 782<summary>Click to collapse/expand example code. 783</summary> 784 785```c++ 786RosenbrockFunction f; 787arma::mat coordinates = f.GetInitialPoint(); 788 789Eve optimizer(0.001, 32, 0.9, 0.999, 0.999, 10, 1e-8, 100000, 1e-5, true); 790optimizer.Optimize(f, coordinates); 791``` 792 793</details> 794 795#### See also: 796 797 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 798 * [SGD](#standard-sgd) 799 * [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://arxiv.org/pdf/1611.01505.pdf) 800 * [Differentiable separable functions](#differentiable-separable-functions) 801 802## Frank-Wolfe 803 804*An optimizer for [differentiable functions](#differentiable-functions) that may also be constrained.* 805 806Frank-Wolfe is a technique to minimize a continuously differentiable convex function f over a compact convex subset D of a vector space. It is also known as conditional gradient method. 807 808#### Constructors 809 810 * `FrankWolfe<`_`LinearConstrSolverType, UpdateRuleType`_`>(`_`linearConstrSolver, updateRule`_`)` 811 * `FrankWolfe<`_`LinearConstrSolverType, UpdateRuleType`_`>(`_`linearConstrSolver, updateRule, maxIterations, tolerance`_`)` 812 813The _`LinearConstrSolverType`_ template parameter specifies the constraint 814domain D for the problem. The `ConstrLpBallSolver` and 815`ConstrStructGroupSolver<GroupLpBall>` classes are available for use; the former 816restricts D to the unit ball of the specified l-p norm. Other constraint types 817may be implemented as a class with the same method signatures as either of the 818existing classes. 819 820The _`UpdateRuleType`_ template parameter specifies the update rule used by the 821optimizer. The `UpdateClassic` and `UpdateLineSearch` classes are available for 822use and represent a simple update step rule and a line search based update rule, 823respectively. The `UpdateSpan` and `UpdateFulLCorrection` classes are also 824available and may be used with the `FuncSq` function class (which is a squared 825matrix loss). 826 827For convenience the following typedefs have been defined: 828 829 * `OMP` (equivalent to `FrankWolfe<ConstrLpBallSolver, UpdateSpan>`): a solver for the orthogonal matching pursuit problem 830 * `StandardFrankWolfe` (equivalent to `FrankWolfe<ConstrLpBallSolver, ClassicUpdate>`): the standard Frank-Wolfe algorithm with the solution restricted to lie within the unit ball 831 832#### Attributes 833 834| **type** | **name** | **description** | **default** | 835|----------|----------|-----------------|-------------| 836| `LinearConstrSolverType` | **`linearConstrSolver`** | Solver for linear constrained problem. | **n/a** | 837| `UpdateRuleType` | **`updateRule`** | Rule for updating solution in each iteration. | **n/a** | 838| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 839| `size_t` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-10` | 840 841Attributes of the optimizer may also be changed via the member methods 842`LinearConstrSolver()`, `UpdateRule()`, `MaxIterations()`, and `Tolerance()`. 843 844#### Examples: 845 846TODO 847 848#### See also: 849 850 * [An algorithm for quadratic programming](https://pdfs.semanticscholar.org/3a24/54478a94f1e66a3fc5d209e69217087acbc0.pdf) 851 * [Frank-Wolfe in Wikipedia](https://en.wikipedia.org/wiki/Frank%E2%80%93Wolfe_algorithm) 852 * [Differentiable functions](#differentiable-functions) 853 854## FTML (Follow the Moving Leader) 855 856*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 857 858Follow the Moving Leader (FTML) is an optimizer where recent samples are 859weighted more heavily in each iteration, so FTML can adapt more quickly to 860changes. 861 862#### Constructors 863 864 * `FTML()` 865 * `FTML(`_`stepSize, batchSize`_`)` 866 * `FTML(`_`stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle`_`)` 867 * `FTML(`_`stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 868 869#### Attributes 870 871| **type** | **name** | **description** | **default** | 872|----------|----------|-----------------|-------------| 873| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 874| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 875| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 876| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 877| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 878| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 879| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 880| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 881| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 882| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 883 884The attributes of the optimizer may also be modified via the member methods 885`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Epsilon()`, `MaxIterations()`, 886`Tolerance()`, `Shuffle()`, `ResetPolicy()`, and `ExactObjective()`. 887 888#### Examples 889 890<details open> 891<summary>Click to collapse/expand example code. 892</summary> 893 894```c++ 895RosenbrockFunction f; 896arma::mat coordinates = f.GetInitialPoint(); 897 898FTML optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true); 899optimizer.Optimize(f, coordinates); 900``` 901 902</details> 903 904#### See also: 905 * [Follow the Moving Leader in Deep Learning](http://proceedings.mlr.press/v70/zheng17a/zheng17a.pdf) 906 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 907 * [SGD](#standard-sgd) 908 * [Differentiable separable functions](#differentiable-separable-functions) 909 910## Gradient Descent 911 912*An optimizer for [differentiable functions](#differentiable-functions).* 913 914Gradient Descent is a technique to minimize a function. To find a local minimum 915of a function using gradient descent, one takes steps proportional to the 916negative of the gradient of the function at the current point. 917 918#### Constructors 919 920 * `GradientDescent()` 921 * `GradientDescent(`_`stepSize`_`)` 922 * `GradientDescent(`_`stepSize, maxIterations, tolerance`_`)` 923 924#### Attributes 925 926| **type** | **name** | **description** | **default** | 927|----------|----------|-----------------|-------------| 928| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 929| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 930| `size_t` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 931 932Attributes of the optimizer may also be changed via the member methods 933`StepSize()`, `MaxIterations()`, and `Tolerance()`. 934 935#### Examples: 936 937<details open> 938<summary>Click to collapse/expand example code. 939</summary> 940 941```c++ 942RosenbrockFunction f; 943arma::mat coordinates = f.GetInitialPoint(); 944 945GradientDescent optimizer(0.001, 0, 1e-15); 946optimizer.Optimize(f, coordinates); 947``` 948 949</details> 950 951#### See also: 952 953 * [Gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Gradient_descent) 954 * [Differentiable functions](#differentiable-functions) 955 956## Grid Search 957 958*An optimizer for [categorical functions](#categorical-functions).* 959 960An optimizer that finds the minimum of a given function by iterating through 961points on a multidimensional grid. 962 963#### Constructors 964 965 * `GridSearch()` 966 967#### Attributes 968 969The `GridSearch` class has no configurable attributes. 970 971**Note**: the `GridSearch` class can only optimize categorical functions where 972*every* parameter is categorical. 973 974#### See also: 975 976 * [Categorical functions](#categorical-functions) (includes an example for `GridSearch`) 977 * [Grid search on Wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search) 978 979## Hogwild! (Parallel SGD) 980 981*An optimizer for [sparse differentiable separable functions](#differentiable-separable-functions).* 982 983An implementation of parallel stochastic gradient descent using the lock-free 984HOGWILD! approach. This implementation requires OpenMP to be enabled during 985compilation (i.e., `-fopenmp` specified as a compiler flag). 986 987Note that the requirements for Hogwild! are slightly different than for most 988[differentiable separable functions](#differentiable-separable-functions) but it 989is often possible to use Hogwild! by implementing `Gradient()` with a template 990parameter. See the [sparse differentiable separable 991functions](#sparse-differentiable-separable-functions) documentation for more 992details. 993 994#### Constructors 995 996 * `ParallelSGD<`_`DecayPolicyType`_`>(`_`maxIterations, threadShareSize`_`)` 997 * `ParallelSGD<`_`DecayPolicyType`_`>(`_`maxIterations, threadShareSize, tolerance, shuffle, decayPolicy`_`)` 998 999The _`DecayPolicyType`_ template parameter specifies the policy used to update 1000the step size after each iteration. The `ConstantStep` class is available for 1001use. Custom behavior can be achieved by implementing a class with the same 1002method signatures. 1003 1004The default type for _`DecayPolicyType`_ is `ConstantStep`, so the shorter type 1005`ParallelSGD<>` can be used instead of the equivalent 1006`ParallelSGD<ConstantStep>`. 1007 1008#### Attributes 1009 1010| **type** | **name** | **description** | **default** | 1011|----------|----------|-----------------|-------------| 1012| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | **n/a** | 1013| `size_t` | **`threadShareSize`** | Number of datapoints to be processed in one iteration by each thread. | **n/a** | 1014| `double` | **`tolerance`** | Maximum absolute tolerance to terminate the algorithm. | `1e-5` | 1015| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 1016| `DecayPolicyType` | **`decayPolicy`** | An instantiated step size update policy to use. | `DecayPolicyType()` | 1017 1018Attributes of the optimizer may also be modified via the member methods 1019`MaxIterations()`, `ThreadShareSize()`, `Tolerance()`, `Shuffle()`, and 1020`DecayPolicy()`. 1021 1022Note that the default value for `decayPolicy` is the default constructor for the 1023`DecayPolicyType`. 1024 1025#### Examples 1026 1027<details open> 1028<summary>Click to collapse/expand example code. 1029</summary> 1030 1031```c++ 1032GeneralizedRosenbrockFunction f(50); // 50-dimensional Rosenbrock function. 1033arma::mat coordinates = f.GetInitialPoint(); 1034 1035ParallelSGD<> optimizer(100000, f.NumFunctions(), 1e-5, true); 1036optimizer.Optimize(f, coordinates); 1037``` 1038 1039</details> 1040 1041#### See also: 1042 1043 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 1044 * [SGD](#standard-sgd) 1045 * [HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent](https://arxiv.org/abs/1106.5730) 1046 * [Sparse differentiable separable functions](#sparse-differentiable-separable-functions) 1047 1048## IQN 1049 1050*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 1051 1052The Incremental Quasi-Newton belongs to the family of stochastic and incremental 1053methods that have a cost per iteration independent of n. IQN iterations are a 1054stochastic version of BFGS iterations that use memory to reduce the variance of 1055stochastic approximations. 1056 1057#### Constructors 1058 1059 * `IQN()` 1060 * `IQN(`_`stepSize`_`)` 1061 * `IQN(`_`stepSize, batchSize, maxIterations, tolerance`_`)` 1062 1063#### Attributes 1064 1065| **type** | **name** | **description** | **default** | 1066|----------|----------|-----------------|-------------| 1067| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 1068| `size_t` | **`batchSize`** | Size of each batch. | `10` | 1069| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 1070| `size_t` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 1071 1072Attributes of the optimizer may also be changed via the member methods 1073`StepSize()`, `BatchSize()`, `MaxIterations()`, and `Tolerance()`. 1074 1075#### Examples: 1076 1077<details open> 1078<summary>Click to collapse/expand example code. 1079</summary> 1080 1081```c++ 1082RosenbrockFunction f; 1083arma::mat coordinates = f.GetInitialPoint(); 1084 1085IQN optimizer(0.01, 1, 5000, 1e-5); 1086optimizer.Optimize(f, coordinates); 1087``` 1088 1089</details> 1090 1091#### See also: 1092 1093 * [IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate](https://arxiv.org/abs/1702.00709) 1094 * [A Stochastic Quasi-Newton Method for Large-Scale Optimization](https://arxiv.org/abs/1401.7020) 1095 * [Differentiable functions](#differentiable-functions) 1096 1097## Katyusha 1098 1099*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 1100 1101Katyusha is a direct, primal-only stochastic gradient method which uses a 1102"negative momentum" on top of Nesterov's momentum. Two types are 1103available---one that uses a proximal update step, and one that uses the standard 1104update step. 1105 1106#### Constructors 1107 1108 * `KatyushaType<`_`proximal`_`>()` 1109 * `KatyushaType<`_`proximal`_`>(`_`convexity, lipschitz`_`)` 1110 * `KatyushaType<`_`proximal`_`>(`_`convexity, lipschitz, batchSize`_`)` 1111 * `KatyushaType<`_`proximal`_`>(`_`convexity, lipschitz, batchSize, maxIterations, innerIterations, tolerance, shuffle, exactObjective`_`)` 1112 1113The _`proximal`_ template parameter is a boolean value (`true` or `false`) that 1114specifies whether or not the proximal update should be used. 1115 1116For convenience the following typedefs have been defined: 1117 1118 * `Katyusha` (equivalent to `KatyushaType<false>`): Katyusha with the standard update step 1119 * `KatyushaProximal` (equivalent to `KatyushaType<true>`): Katyusha with the proximal update step 1120 1121#### Attributes 1122 1123| **type** | **name** | **description** | **default** | 1124|----------|----------|-----------------|-------------| 1125| `double` | **`convexity`** | The regularization parameter. | `1.0` | 1126| `double` | **`lipschitz`** | The Lipschitz constant. | `10.0` | 1127| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` | 1128| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `1000` | 1129| `size_t` | **`innerIterations`** | The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. | `0` | 1130| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 1131| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 1132| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 1133 1134 1135Attributes of the optimizer may also be changed via the member methods 1136`Convexity()`, `Lipschitz()`, `BatchSize()`, `MaxIterations()`, 1137`InnerIterations()`, `Tolerance()`, `Shuffle()`, and `ExactObjective()`. 1138 1139#### Examples: 1140 1141<details open> 1142<summary>Click to collapse/expand example code. 1143</summary> 1144 1145```c++ 1146RosenbrockFunction f; 1147arma::mat coordinates = f.GetInitialPoint(); 1148 1149// Without proximal update. 1150Katyusha optimizer(1.0, 10.0, 1, 100, 0, 1e-10, true); 1151optimizer.Optimize(f, coordinates); 1152 1153// With proximal update. 1154KatyushaProximal proximalOptimizer(1.0, 10.0, 1, 100, 0, 1e-10, true); 1155proximalOptimizer.Optimize(f, coordinates); 1156``` 1157 1158</details> 1159 1160#### See also: 1161 1162 * [Katyusha: The First Direct Acceleration of Stochastic Gradient Methods](https://arxiv.org/abs/1603.05953) 1163 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 1164 * [Differentiable separable functions](#differentiable-separable-functions) 1165 1166## L-BFGS 1167 1168*An optimizer for [differentiable functions](#differentiable-functions)* 1169 1170L-BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm using a limited amount of computer memory. 1171 1172#### Constructors 1173 1174 * `L_BFGS()` 1175 * `L_BFGS(`_`numBasis, maxIterations`_`)` 1176 * `L_BFGS(`_`numBasis, maxIterations, armijoConstant, wolfe, minGradientNorm, factr, maxLineSearchTrials`_`)` 1177 * `L_BFGS(`_`numBasis, maxIterations, armijoConstant, wolfe, minGradientNorm, factr, maxLineSearchTrials, minStep, maxStep`_`)` 1178 1179#### Attributes 1180 1181| **type** | **name** | **description** | **default** | 1182|----------|----------|-----------------|-------------| 1183| `size_t` | **`numBasis`** | Number of memory points to be stored (default 10). | `10` | 1184| `size_t` | **`maxIterations`** | Maximum number of iterations for the optimization (0 means no limit and may run indefinitely). | `10000` | 1185| `double` | **`armijoConstant`** | Controls the accuracy of the line search routine for determining the Armijo condition. | `1e-4` | 1186| `double` | **`wolfe`** | Parameter for detecting the Wolfe condition. | `0.9` | 1187| `double` | **`minGradientNorm`** | Minimum gradient norm required to continue the optimization. | `1e-6` | 1188| `double` | **`factr`** | Minimum relative function value decrease to continue the optimization. | `1e-15` | 1189| `size_t` | **`maxLineSearchTrials`** | The maximum number of trials for the line search (before giving up). | `50` | 1190| `double` | **`minStep`** | The minimum step of the line search. | `1e-20` | 1191| `double` | **`maxStep`** | The maximum step of the line search. | `1e20` | 1192 1193Attributes of the optimizer may also be changed via the member methods 1194`NumBasis()`, `MaxIterations()`, `ArmijoConstant()`, `Wolfe()`, 1195`MinGradientNorm()`, `Factr()`, `MaxLineSearchTrials()`, `MinStep()`, and 1196`MaxStep()`. 1197 1198#### Examples: 1199 1200<details open> 1201<summary>Click to collapse/expand example code. 1202</summary> 1203 1204```c++ 1205RosenbrockFunction f; 1206arma::mat coordinates = f.GetInitialPoint(); 1207 1208L_BFGS optimizer(20); 1209optimizer.Optimize(f, coordinates); 1210``` 1211 1212</details> 1213 1214#### See also: 1215 1216 * [The solution of non linear finite element equations](https://onlinelibrary.wiley.com/doi/full/10.1002/nme.1620141104) 1217 * [Updating Quasi-Newton Matrices with Limited Storage](https://www.jstor.org/stable/2006193) 1218 * [Limited-memory BFGS in Wikipedia](https://en.wikipedia.org/wiki/Limited-memory_BFGS) 1219 * [Differentiable functions](#differentiable-functions) 1220 1221## Lookahead 1222 1223*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 1224 1225Lookahead is a stochastic gradient based optimization method which chooses a 1226search direction by looking ahead at the sequence of "fast weights" generated 1227by another optimizer. 1228 1229#### Constructors 1230 * `Lookahead<>()` 1231 * `Lookahead<>(`_`stepSize`_`)` 1232 * `Lookahead<>(`_`stepSize, k`_`)` 1233 * `Lookahead<>(`_`stepSize, k, maxIterations, tolerance, decayPolicy, exactObjective`_`)` 1234 * `Lookahead<>(`_`baseOptimizer, stepSize, k, maxIterations, tolerance, decayPolicy, exactObjective`_`)` 1235 1236Note that `Lookahead<>` is based on the templated type 1237`LookaheadType<`_`BaseOptimizerType, DecayPolicyType`_`>` with _`BaseOptimizerType`_` = Adam` and _`DecayPolicyType`_` = NoDecay`. 1238 1239Any optimizer that implements the differentiable separable functions interface 1240can be paired with the `Lookahead` optimizer. 1241 1242#### Attributes 1243 1244| **type** | **name** | **description** | **default** | 1245|----------|----------|-----------------|-------------| 1246| `BaseOptimizerType` | **`baseOptimizer`** | Optimizer for the forward step. | Adam | 1247| `double` | **`stepSize`** | Step size for each iteration. | `0.5` | 1248| `size_t` | **`k`** | The synchronization period. | `5` | 1249| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 1250| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 1251| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` | 1252| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 1253 1254The attributes of the optimizer may also be modified via the member methods 1255`BaseOptimizer()`, `StepSize()`, `K()`, `MaxIterations()`, 1256`Tolerance()`, `DecayPolicy()` and `ExactObjective()`. 1257 1258#### Examples 1259 1260<details open> 1261<summary>Click to collapse/expand example code. 1262</summary> 1263 1264```c++ 1265RosenbrockFunction f; 1266arma::mat coordinates = f.GetInitialPoint(); 1267 1268Lookahead<> optimizer(0.5, 5, 100000, 1e-5); 1269optimizer.Optimize(f, coordinates); 1270``` 1271 1272</details> 1273 1274#### See also: 1275 1276 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 1277 * [SGD](#standard-sgd) 1278 * [Lookahead Optimizer: k steps forward, 1 step back](https://arxiv.org/abs/1907.08610) 1279 * [Differentiable separable functions](#differentiable-separable-functions) 1280 1281## LRSDP (low-rank SDP solver) 1282 1283*An optimizer for [semidefinite programs](#semidefinite-programs).* 1284 1285LRSDP is the implementation of Monteiro and Burer's formulation of low-rank 1286semidefinite programs (LR-SDP). This solver uses the augmented Lagrangian 1287optimizer to solve low-rank semidefinite programs. 1288 1289The assumption here is that the solution matrix for the SDP is low-rank. If 1290this assumption is not true, the algorithm should not be expected to converge. 1291 1292#### Constructors 1293 1294 * `LRSDP<`_`SDPType`_`>()` 1295 1296The _`SDPType`_ template parameter specifies the type of SDP to solve. The 1297`SDP<arma::mat>` and `SDP<arma::sp_mat>` classes are available for use; these 1298represent SDPs with dense and sparse `C` matrices, respectively. The `SDP<>` 1299class is detailed in the [semidefinite program 1300documentation](#semidefinite-programs). 1301 1302Once the `LRSDP<>` object is constructed, the SDP may be specified by calling 1303the `SDP()` member method, which returns a reference to the _`SDPType`_. 1304 1305#### Attributes 1306 1307The attributes of the LRSDP optimizer may only be accessed via member methods. 1308 1309| **type** | **method name** | **description** | **default** | 1310|----------|----------|-----------------|-------------| 1311| `size_t` | **`MaxIterations()`** | Maximum number of iterations before termination. | `1000` | 1312| `AugLagrangian` | **`AugLag()`** | The internally-held Augmented Lagrangian optimizer. | **n/a** | 1313 1314#### See also: 1315 1316 * [A Nonlinear Programming Algorithm for Solving Semidefinite Programs via Low-rank Factorization](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.682.1520&rep=rep1&type=pdf) 1317 * [Semidefinite programming on Wikipedia](https://en.wikipedia.org/wiki/Semidefinite_programming) 1318 * [Semidefinite programs](#semidefinite-programs) (includes example usage of `PrimalDualSolver`) 1319 1320## Momentum SGD 1321 1322*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 1323 1324Stochastic Gradient Descent is a technique for minimizing a function which 1325can be expressed as a sum of other functions. This is an SGD variant that uses 1326momentum for its updates. Using momentum updates for parameter learning can 1327accelerate the rate of convergence, specifically in the cases where the surface 1328curves much more steeply (a steep hilly terrain with high curvature). 1329 1330#### Constructors 1331 1332 * `MomentumSGD()` 1333 * `MomentumSGD(`_`stepSize, batchSize`_`)` 1334 * `MomentumSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle`_`)` 1335 * `MomentumSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, momentumPolicy, decayPolicy, resetPolicy, exactObjective`_`)` 1336 1337Note that `MomentumSGD` is based on the templated type 1338`SGD<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` = 1339MomentumUpdate` and _`DecayPolicyType`_` = NoDecay`. 1340 1341#### Attributes 1342 1343| **type** | **name** | **description** | **default** | 1344|----------|----------|-----------------|-------------| 1345| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 1346| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` | 1347| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 1348| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 1349| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 1350| `MomentumUpdate` | **`updatePolicy`** | An instantiated `MomentumUpdate`. | `MomentumUpdate()` | 1351| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` | 1352| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` | 1353| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 1354 1355Attributes of the optimizer may also be modified via the member methods 1356`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `DecayPolicy()`, `ResetPolicy()`, and 1357`ExactObjective()`. 1358 1359Note that the `MomentumUpdate` class has the constructor 1360`MomentumUpdate(`_`momentum`_`)` with a default value of `0.5` for the momentum. 1361 1362#### Examples 1363 1364<details open> 1365<summary>Click to collapse/expand example code. 1366</summary> 1367 1368```c++ 1369RosenbrockFunction f; 1370arma::mat coordinates = f.GetInitialPoint(); 1371 1372MomentumSGD optimizer(0.01, 32, 100000, 1e-5, true, MomentumUpdate(0.5)); 1373optimizer.Optimize(f, coordinates); 1374``` 1375 1376</details> 1377 1378#### See also: 1379 1380 * [Standard SGD](#standard-sgd) 1381 * [Nesterov Momentum SGD](#nesterov-momentum-sgd) 1382 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 1383 * [Differentiable separable functions](#differentiable-separable-functions) 1384 1385## Nadam 1386 1387*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 1388 1389Nadam is a variant of Adam based on NAG (Nesterov accelerated gradient). It 1390uses Nesterov momentum for faster convergence. 1391 1392#### Constructors 1393 1394 * `Nadam()` 1395 * `Nadam(`_`stepSize, batchSize`_`)` 1396 * `Nadam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)` 1397 * `Nadam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy`_`)` 1398 1399Note that the `Nadam` class is based on the `AdamType<`_`UpdateRule`_`>` class 1400with _`UpdateRule`_` = NadamUpdate`. 1401 1402#### Attributes 1403 1404| **type** | **name** | **description** | **default** | 1405|----------|----------|-----------------|-------------| 1406| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 1407| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 1408| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 1409| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 1410| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 1411| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 1412| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 1413| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 1414| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 1415 1416The attributes of the optimizer may also be modified via the member methods 1417`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`, 1418`Tolerance()`, `Shuffle()`, and `ResetPolicy()`. 1419 1420#### Examples 1421 1422<details open> 1423<summary>Click to collapse/expand example code. 1424</summary> 1425 1426```c++ 1427RosenbrockFunction f; 1428arma::mat coordinates = f.GetInitialPoint(); 1429 1430Nadam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true); 1431optimizer.Optimize(f, coordinates); 1432``` 1433 1434</details> 1435 1436#### See also: 1437 1438 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 1439 * [SGD](#standard-sgd) 1440 * [Incorporating Nesterov Momentum into Adam](http://cs229.stanford.edu/proj2015/054_report.pdf) 1441 * [Differentiable separable functions](#differentiable-separable-functions) 1442 1443## NadaMax 1444 1445*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 1446 1447NadaMax is a variant of AdaMax based on NAG (Nesterov accelerated gradient). It 1448uses Nesterov momentum for faster convergence. 1449 1450#### Constructors 1451 1452 * `NadaMax()` 1453 * `NadaMax(`_`stepSize, batchSize`_`)` 1454 * `NadaMax(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)` 1455 * `NadaMax(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy`_`)` 1456 1457Note that the `NadaMax` class is based on the `AdamType<`_`UpdateRule`_`>` class 1458with _`UpdateRule`_` = NadaMaxUpdate`. 1459 1460#### Attributes 1461 1462| **type** | **name** | **description** | **default** | 1463|----------|----------|-----------------|-------------| 1464| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 1465| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 1466| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 1467| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 1468| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 1469| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 1470| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 1471| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 1472| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 1473 1474The attributes of the optimizer may also be modified via the member methods 1475`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`, 1476`Tolerance()`, `Shuffle()`, and `ResetPolicy()`. 1477 1478#### Examples 1479 1480<details open> 1481<summary>Click to collapse/expand example code. 1482</summary> 1483 1484```c++ 1485RosenbrockFunction f; 1486arma::mat coordinates = f.GetInitialPoint(); 1487 1488NadaMax optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true); 1489optimizer.Optimize(f, coordinates); 1490``` 1491 1492</details> 1493 1494#### See also: 1495 1496 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 1497 * [SGD](#standard-sgd) 1498 * [Incorporating Nesterov Momentum into Adam](http://cs229.stanford.edu/proj2015/054_report.pdf) 1499 * [Differentiable separable functions](#differentiable-separable-functions) 1500 1501## Nesterov Momentum SGD 1502 1503*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 1504 1505Stochastic Gradient Descent is a technique for minimizing a function which 1506can be expressed as a sum of other functions. This is an SGD variant that uses 1507Nesterov momentum for its updates. Nesterov Momentum application can accelerate 1508the rate of convergence to O(1/k^2). 1509 1510#### Constructors 1511 1512 * `NesterovMomentumSGD()` 1513 * `NesterovMomentumSGD(`_`stepSize, batchSize`_`)` 1514 * `NesterovMomentumSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle`_`)` 1515 * `NesterovMomentumSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, momentumPolicy, decayPolicy, resetPolicy, exactObjective`_`)` 1516 1517Note that `MomentumSGD` is based on the templated type 1518`SGD<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` = 1519NesterovMomentumUpdate` and _`DecayPolicyType`_` = NoDecay`. 1520 1521#### Attributes 1522 1523| **type** | **name** | **description** | **default** | 1524|----------|----------|-----------------|-------------| 1525| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 1526| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` | 1527| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 1528| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 1529| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 1530| `NesterovMomentumUpdate` | **`updatePolicy`** | An instantiated `MomentumUpdate`. | `NesterovMomentumUpdate()` | 1531| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` | 1532| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` | 1533| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 1534 1535Attributes of the optimizer may also be modified via the member methods 1536`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `DecayPolicy()`, `ResetPolicy()`, and 1537`ExactObjective()`. 1538 1539Note that the `NesterovMomentumUpdate` class has the constructor 1540`MomentumUpdate(`_`momentum`_`)` with a default value of `0.5` for the momentum. 1541 1542#### Examples 1543 1544<details open> 1545<summary>Click to collapse/expand example code. 1546</summary> 1547 1548```c++ 1549RosenbrockFunction f; 1550arma::mat coordinates = f.GetInitialPoint(); 1551 1552NesterovMomentumSGD optimizer(0.01, 32, 100000, 1e-5, true, 1553 MomentumUpdate(0.5)); 1554optimizer.Optimize(f, coordinates); 1555``` 1556 1557</details> 1558 1559#### See also: 1560 1561 * [Standard SGD](#standard-sgd) 1562 * [Momentum SGD](#momentum-sgd) 1563 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 1564 * [Differentiable separable functions](#differentiable-separable-functions) 1565 1566## MOEA/D-DE 1567*An optimizer for arbitrary multi-objective functions.* 1568MOEA/D-DE (Multi Objective Evolutionary Algorithm based on Decomposition - Differential Evolution) is a multi 1569objective optimization algorithm. It works by decomposing the problem into a number of scalar optimization 1570subproblems which are solved simultaneously per generation. MOEA/D in itself is a framework, this particular 1571algorithm uses Differential Crossover followed by Polynomial Mutation to create offsprings which are then 1572decomposed to form a Single Objective Problem. A diversity preserving mechanism is also employed which encourages 1573a varied set of solution. 1574 1575#### Constructors 1576* `MOEAD<`_`InitPolicyType, DecompPolicyType`_`>()` 1577* `MOEAD<`_`InitPolicyType, DecompPolicyType`_`>(`_`populationSize, maxGenerations, crossoverProb, neighborProb, neighborSize, distributionIndex, differentialWeight, maxReplace, epsilon, lowerBound, upperBound`_`)` 1578 1579The _`InitPolicyType`_ template parameter refers to the strategy used to 1580initialize the reference directions. 1581 1582The following types are available: 1583 1584 * **`Uniform`** 1585 * **`BayesianBootstrap`** 1586 * **`Dirichlet`** 1587 1588The _`DecompPolicyType`_ template parameter refers to the strategy used to 1589decompose the weight vectors to form a scalar objective function. 1590 1591The following types are available: 1592 1593 * **`Tchebycheff`** 1594 * **`WeightedAverage`** 1595 * **`PenaltyBoundaryIntersection`** 1596 1597For convenience the following types can be used: 1598 1599 * **`DefaultMOEAD`** (equivalent to `MOEAD<Uniform, Tchebycheff>`): utilizes Uniform method for weight initialization 1600 and Tchebycheff for weight decomposition. 1601 1602 * **`BBSMOEAD`** (equivalent to `MOEAD<BayesianBootstrap, Tchebycheff>`): utilizes Bayesian Bootstrap method for weight initialization and Tchebycheff for weight decomposition. 1603 1604 * **`DirichletMOEAD`** (equivalent to `MOEAD<Dirichlet, Tchebycheff>`): utilizes Dirichlet sampling for weight init 1605 and Tchebycheff for weight decomposition. 1606 1607#### Attributes 1608 1609| **type** | **name** | **description** | **default** | 1610|----------|----------|-----------------|-------------| 1611| `size_t` | **`populationSize`** | The number of candidates in the population. | `150` | 1612| `size_t` | **`maxGenerations`** | The maximum number of generations allowed. | `300` | 1613| `double` | **`crossoverProb`** | Probability that a crossover will occur. | `1.0` | 1614| `double` | **`neighborProb`** | The probability of sampling from neighbor. | `0.9` | 1615| `size_t` | **`neighborSize`** | The number of nearest-neighbours to consider per weight vector. | `20` | 1616| `double` | **`distributionIndex`** | The crowding degree of the mutation. | `20` | 1617| `double` | **`differentialWeight`** | Amplification factor of the differentiation. | `0.5` | 1618| `size_t` | **`maxReplace`** | The limit of solutions allowed to be replaced by a child. | `2`| 1619| `double` | **`epsilon`** | Handles numerical stability after weight initialization. | `1E-10`| 1620| `double`, `arma::vec` | **`lowerBound`** | Lower bound of the coordinates on the coordinates of the whole population during the search process. | `0` | 1621| `double`, `arma::vec` | **`upperBound`** | Lower bound of the coordinates on the coordinates of the whole population during the search process. | `1` | 1622| `InitPolicyType` | **`initPolicy`** | Instantiated init policy used to initialize weights. | `InitPolicyType()` | 1623| `DecompPolicyType` | **`decompPolicy`** | Instantiated decomposition policy used to create scalar objective problem. | `DecompPolicyType()` | 1624 1625Attributes of the optimizer may also be changed via the member methods 1626`PopulationSize()`, `MaxGenerations()`, `CrossoverRate()`, `NeighborProb()`, `NeighborSize()`, `DistributionIndex()`, 1627`DifferentialWeight()`, `MaxReplace()`, `Epsilon()`, `LowerBound()`, `UpperBound()`, `InitPolicy()` and `DecompPolicy()`. 1628 1629#### Examples: 1630 1631<details open> 1632<summary>Click to collapse/expand example code. 1633</summary> 1634 1635```c++ 1636SchafferFunctionN1<arma::mat> SCH; 1637arma::vec lowerBound("-10 -10"); 1638arma::vec upperBound("10 10"); 1639DefaultMOEAD opt(300, 300, 1.0, 0.9, 20, 20, 0.5, 2, 1E-10, lowerBound, upperBound); 1640typedef decltype(SCH.objectiveA) ObjectiveTypeA; 1641typedef decltype(SCH.objectiveB) ObjectiveTypeB; 1642arma::mat coords = SCH.GetInitialPoint(); 1643std::tuple<ObjectiveTypeA, ObjectiveTypeB> objectives = SCH.GetObjectives(); 1644// obj will contain the minimum sum of objectiveA and objectiveB found on the best front. 1645double obj = opt.Optimize(objectives, coords); 1646// Now obtain the best front. 1647arma::cube bestFront = opt.ParetoFront(); 1648``` 1649</details> 1650 1651#### See also 1652* [MOEA/D-DE Algorithm](https://ieeexplore.ieee.org/document/4633340) 1653* [Multi-objective Functions in Wikipedia](https://en.wikipedia.org/wiki/Test_functions_for_optimization#Test_functions_for_multi-objective_optimization) 1654* [Multi-objective functions](#multi-objective-functions) 1655 1656## NSGA2 1657 1658*An optimizer for arbitrary multi-objective functions.* 1659 1660NSGA2 (Non-dominated Sorting Genetic Algorithm - II) is a multi-objective 1661optimization algorithm. The algorithm works by generating a candidate population 1662from a fixed starting point. At each stage of optimization, a new population of 1663children is generated. This new population along with its predecessor is sorted 1664using non-domination as the metric. Following this, the population is further 1665segregated into fronts. A new population is generated from these fronts having 1666size equal to that of the starting population. 1667 1668#### Constructors 1669 1670 * `NSGA2()` 1671 * `NSGA2(`_`populationSize, maxGenerations, crossoverProb, mutationProb, mutationStrength, epsilon, lowerBound, upperBound`_`)` 1672 1673#### Attributes 1674 1675| **type** | **name** | **description** | **default** | 1676|----------|----------|-----------------|-------------| 1677| `size_t` | **`populationSize`** | The number of candidates in the population. This should be at least 4 in size and a multiple of 4. | `100` | 1678| `size_t` | **`maxGenerations`** | The maximum number of generations allowed for NSGA2. | `2000` | 1679| `double` | **`crossoverProb`** | Probability that a crossover will occur. | `0.6` | 1680| `double` | **`mutationProb`** | Probability that a weight will get mutated. | `0.3` | 1681| `double` | **`mutationStrength`** | The range of mutation noise to be added. This range is between 0 and mutationStrength. | `0.001` | 1682| `double` | **`epsilon`** | The value used internally to evaluate approximate equality in crowding distance based sorting. | `1e-6` | 1683| `double`, `arma::vec` | **`lowerBound`** | Lower bound of the coordinates on the coordinates of the whole population during the search process. | `0` | 1684| `double`, `arma::vec` | **`upperBound`** | Lower bound of the coordinates on the coordinates of the whole population during the search process. | `1` | 1685 1686Note that the parameters `lowerBound` and `upperBound` are overloaded. Data types of `double` or `arma::mat` may be used. If they are initialized as single values of `double`, then the same value of the bound applies to all the axes, resulting in an initialization following a uniform distribution in a hypercube. If they are initialized as matrices of `arma::mat`, then the value of `lowerBound[i]` applies to axis `[i]`; similarly, for values in `upperBound`. This results in an initialization following a uniform distribution in a hyperrectangle within the specified bounds. 1687 1688Attributes of the optimizer may also be changed via the member methods 1689`PopulationSize()`, `MaxGenerations()`, `CrossoverRate()`, `MutationProbability()`, `MutationStrength()`, `Epsilon()`, `LowerBound()` and `UpperBound()`. 1690 1691#### Examples: 1692 1693<details open> 1694<summary>Click to collapse/expand example code. 1695</summary> 1696 1697```c++ 1698SchafferFunctionN1<arma::mat> SCH; 1699arma::vec lowerBound("-1000 -1000"); 1700arma::vec upperBound("1000 1000"); 1701NSGA2 opt(20, 5000, 0.5, 0.5, 1e-3, 1e-6, lowerBound, upperBound); 1702 1703typedef decltype(SCH.objectiveA) ObjectiveTypeA; 1704typedef decltype(SCH.objectiveB) ObjectiveTypeB; 1705 1706arma::mat coords = SCH.GetInitialPoint(); 1707std::tuple<ObjectiveTypeA, ObjectiveTypeB> objectives = SCH.GetObjectives(); 1708 1709// obj will contain the minimum sum of objectiveA and objectiveB found on the best front. 1710double obj = opt.Optimize(objectives, coords); 1711// Now obtain the best front. 1712arma::cube bestFront = opt.Front(); 1713``` 1714 1715</details> 1716 1717#### See also: 1718 1719 * [NSGA-II Algorithm](https://www.iitk.ac.in/kangal/Deb_NSGA-II.pdf) 1720 * [Multi-objective Functions in Wikipedia](https://en.wikipedia.org/wiki/Test_functions_for_optimization#Test_functions_for_multi-objective_optimization) 1721 * [Multi-objective functions](#multi-objective-functions) 1722 1723## OptimisticAdam 1724 1725*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 1726 1727OptimisticAdam is an optimizer which implements the Optimistic Adam algorithm 1728which uses Optmistic Mirror Descent with the Adam Optimizer. It addresses the 1729problem of limit cycling while training GANs (generative adversarial networks). 1730It uses OMD to achieve faster regret rates in solving the zero sum game of 1731training a GAN. It consistently achieves a smaller KL divergence with~ respect to 1732the true underlying data distribution. The implementation here can be used with 1733any differentiable separable function, not just GAN training. 1734 1735#### Constructors 1736 1737 * `OptimisticAdam()` 1738 * `OptimisticAdam(`_`stepSize, batchSize`_`)` 1739 * `OptimisticAdam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)` 1740 * `OptimisticAdam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy`_`)` 1741 1742Note that the `OptimisticAdam` class is based on the 1743`AdamType<`_`UpdateRule`_`>` class with _`UpdateRule`_` = OptimisticAdamUpdate`. 1744 1745#### Attributes 1746 1747| **type** | **name** | **description** | **default** | 1748|----------|----------|-----------------|-------------| 1749| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 1750| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 1751| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 1752| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 1753| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 1754| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 1755| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 1756| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 1757| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 1758 1759The attributes of the optimizer may also be modified via the member methods 1760`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`, 1761`Tolerance()`, `Shuffle()`, and `ResetPolicy()`. 1762 1763#### Examples 1764 1765<details open> 1766<summary>Click to collapse/expand example code. 1767</summary> 1768 1769```c++ 1770RosenbrockFunction f; 1771arma::mat coordinates = f.GetInitialPoint(); 1772 1773OptimisticAdam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true); 1774optimizer.Optimize(f, coordinates); 1775``` 1776 1777</details> 1778 1779#### See also: 1780 1781 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 1782 * [SGD](#standard-sgd) 1783 * [Training GANs with Optimism](https://arxiv.org/pdf/1711.00141.pdf) 1784 * [Differentiable separable functions](#differentiable-separable-functions) 1785 1786## Padam 1787 1788*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 1789 1790Padam is a variant of Adam with a partially adaptive momentum estimation method. 1791 1792#### Constructors 1793 1794 * `Padam()` 1795 * `Padam(`_`stepSize, batchSize`_`)` 1796 * `Padam(`_`stepSize, batchSize, beta1, beta2, partial, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 1797 1798#### Attributes 1799 1800| **type** | **name** | **description** | **default** | 1801|----------|----------|-----------------|-------------| 1802| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 1803| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 1804| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 1805| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 1806| `double` | **`partial`** | Partially adaptive parameter. | `0.25` | 1807| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 1808| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 1809| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 1810| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 1811| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 1812| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 1813 1814The attributes of the optimizer may also be modified via the member methods 1815`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Partial()`, `Epsilon()`, 1816`MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and `ExactObjective()`. 1817 1818#### Examples 1819 1820<details open> 1821<summary>Click to collapse/expand example code. 1822</summary> 1823 1824```c++ 1825RosenbrockFunction f; 1826arma::mat coordinates = f.GetInitialPoint(); 1827 1828Padam optimizer(0.001, 32, 0.9, 0.999, 0.25, 1e-8, 100000, 1e-5, true); 1829optimizer.Optimize(f, coordinates); 1830``` 1831 1832</details> 1833 1834#### See also: 1835 * [Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks](https://arxiv.org/abs/1806.06763) 1836 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 1837 * [SGD](#standard-sgd) 1838 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980) 1839 * [Differentiable separable functions](#differentiable-separable-functions) 1840 1841## PSO 1842 1843*An optimizer for [arbitrary functions](#arbitrary-functions).* 1844 1845PSO is an evolutionary approach to optimization that is inspired by flocks or birds or fishes. The fundamental analogy is that every creature (particle in a swarm) is at a measurable position of goodness or fitness, and this information can be shared amongst the creatures in the flock, so that iteratively, the entire flock can get close to the global optimum. 1846 1847#### Constructors 1848 1849 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>()` 1850 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles`_`)` 1851 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound`_`)` 1852 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound, maxIterations`_`)` 1853 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound, maxIterations, horizonSize`_`)` 1854 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound, maxIterations, horizonSize, impTolerance`_`)` 1855 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound, maxIterations, horizonSize, impTolerance, exploitationFactor, explorationFactor`_`)` 1856 1857#### Attributes 1858 1859| **type** | **name** | **description** | **default** | 1860|----------|----------|-----------------|-------------| 1861| `size_t` | **`numParticles`** | numParticles Number of particles in the swarm. | `64` | 1862| `double`, `arma::mat` | **`lowerBound`** | Lower bound of the coordinates of the initial population. | `1` | 1863| `double`, `arma::mat` | **`upperBound`** | Upper bound of the coordinates of the initial population. | `1` | 1864| `size_t` | **`maxIterations`** | Maximum number of iterations allowed. | `3000` | 1865| `size_t` | **`horizonSize`** | Size of the lookback-horizon for computing improvement. | `350` | 1866| `double` | **`impTolerance`** | The final value of the objective function for termination. If set to negative value, tolerance is not considered. | `1e-5` | 1867| `double` | **`exploitationFactor`** | Influence of the personal best of the particle. | `2.05` | 1868| `double` | **`explorationFactor`** | Influence of the neighbours of the particle. | `2.05` | 1869 1870Note that the parameters `lowerBound` and `upperBound` are overloaded. Data types of `double` or `arma::mat` may be used. If they are initialized as single values of `double`, then the same value of the bound applies to all the axes, resulting in an initialization following a uniform distribution in a hypercube. If they are initialized as matrices of `arma::mat`, then the value of `lowerBound[i]` applies to axis `[i]`; similarly, for values in `upperBound`. This results in an initialization following a uniform distribution in a hyperrectangle within the specified bounds. 1871 1872Attributes of the optimizer may also be changed via the member methods 1873`NumParticles()`, `LowerBound()`, `UpperBound()`, `MaxIterations()`, 1874`HorizonSize()`, `ImpTolerance()`,`ExploitationFactor()`, and 1875`ExplorationFactor()`. 1876 1877At present, only the local-best variant of PSO is present in ensmallen. The optimizer may be initialized using the class type `LBestPSO`, which is an alias for `PSOType<LBestUpdate, DefaultInit>`. 1878 1879#### Examples: 1880 1881<details open> 1882<summary>Click to collapse/expand example code. 1883</summary> 1884 1885```c++ 1886SphereFunction f(4); 1887arma::vec coordinates = f.GetInitialPoint(); 1888 1889LBestPSO s; 1890const double result = s.Optimize(f, coordinates) 1891``` 1892 1893</details> 1894 1895<details open> 1896<summary>Click to collapse/expand example code. 1897</summary> 1898 1899```c++ 1900RosenbrockFunction f; 1901arma::vec coordinates = f.GetInitialPoint(); 1902 1903// Setting bounds for the initial swarm population of size 2. 1904arma::vec lowerBound("50 50"); 1905arma::vec upperBound("60 60"); 1906 1907LBestPSO s(200, lowerBound, upperBound, 3000, 600, 1e-30, 2.05, 2.05); 1908const double result = s.Optimize(f, coordinates) 1909``` 1910 1911</details> 1912 1913<details open> 1914<summary>Click to collapse/expand example code. 1915</summary> 1916 1917```c++ 1918RosenbrockFunction f; 1919arma::vec coordinates = f.GetInitialPoint(); 1920 1921// Setting bounds for the initial swarm population as type double. 1922double lowerBound = 50; 1923double upperBound = 60; 1924 1925LBestPSO s(64, lowerBound, upperBound, 3000, 400, 1e-30, 2.05, 2.05); 1926const double result = s.Optimize(f, coordinates) 1927``` 1928 1929</details> 1930 1931#### See also: 1932 1933 * [Particle Swarm Optimization](http://www.swarmintelligence.org/) 1934 * [Arbitrary functions](#arbitrary-functions) 1935 1936 1937## Primal-dual SDP Solver 1938 1939*An optimizer for [semidefinite programs](#semidefinite-programs).* 1940 1941A primal-dual interior point method solver. This can solve semidefinite 1942programs. 1943 1944#### Constructors 1945 1946 * `PrimalDualSolver<>(`_`maxIterations`_`)` 1947 * `PrimalDualSolver<>(`_`maxIterations, tau, normXzTol, primalInfeasTol, dualInfeasTol`_`)` 1948 1949#### Attributes 1950 1951The `PrimalDualSolver<>` class has several attributes that are only modifiable 1952as member methods. 1953 1954| **type** | **method name** | **description** | **default** | 1955|----------|----------|-----------------|-------------| 1956| `double` | **`Tau()`** | Value of tau used to compute alpha\_hat. | `0.99` | 1957| `double` | **`NormXZTol()`** | Tolerance for the norm of X\*Z. | `1e-7` | 1958| `double` | **`PrimalInfeasTol()`** | Tolerance for primal infeasibility. | `1e-7` | 1959| `double` | **`DualInfeasTol()`** | Tolerance for dual infeasibility. | `1e-7` | 1960| `size_t` | **`MaxIterations()`** | Maximum number of iterations before convergence. | `1000` | 1961 1962#### Optimization 1963 1964The `PrimalDualSolver<>` class offers two overloads of `Optimize()` that 1965optionally return the converged values for the dual variables. 1966 1967<details open> 1968<summary>Click to collapse/expand example code. 1969</summary> 1970 1971```c++ 1972/** 1973 * Invoke the optimization procedure, returning the converged values for the 1974 * primal and dual variables. 1975 */ 1976template<typename SDPType> 1977double Optimize(SDPType& s, 1978 arma::mat& X, 1979 arma::vec& ySparse, 1980 arma::vec& yDense, 1981 arma::mat& Z); 1982 1983/** 1984 * Invoke the optimization procedure, and only return the primal variable. 1985 */ 1986template<typename SDPType> 1987double Optimize(SDPType& s, arma::mat& X); 1988``` 1989 1990</details> 1991 1992The _`SDPType`_ template parameter specifies the type of SDP to solve. The 1993`SDP<arma::mat>` and `SDP<arma::sp_mat>` classes are available for use; these 1994represent SDPs with dense and sparse `C` matrices, respectively. The `SDP<>` 1995class is detailed in the [semidefinite program 1996documentation](#semidefinite-programs). _`SDPType`_ is automatically inferred 1997when `Optimize()` is called with an SDP. 1998 1999#### See also: 2000 2001 * [Primal-dual interior-point methods for semidefinite programming](http://www.dtic.mil/dtic/tr/fulltext/u2/1020236.pdf) 2002 * [Semidefinite programming on Wikipedia](https://en.wikipedia.org/wiki/Semidefinite_programming) 2003 * [Semidefinite programs](#semidefinite-programs) (includes example usage of `PrimalDualSolver`) 2004 2005## Quasi-Hyperbolic Momentum Update SGD (QHSGD) 2006 2007*An optimizer for [differentiable separable 2008functions](#differentiable-separable-functions).* 2009 2010Quasi-hyperbolic momentum update SGD (QHSGD) is an SGD-like optimizer with 2011momentum where quasi-hyperbolic terms are added to the parametrization. The 2012update rule for this optimizer is a weighted average of momentum SGD and vanilla 2013SGD. 2014 2015#### Constructors 2016 2017 * `QHSGD()` 2018 * `QHSGD(`_`stepSize, batchSize`_`)` 2019 * `QHSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, exactObjective`_`)` 2020 2021 Note that `QHSGD` is based on the templated type 2022 `SGD<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` = 2023 QHUpdate` and _`DecayPolicyType`_` = NoDecay`. 2024 2025#### Attributes 2026 2027 | **type** | **name** | **description** | **default** | 2028 |----------|----------|-----------------|-------------| 2029 | `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 2030 | `size_t` | **`batchSize`** | Batch size to use for each step. | `32` | 2031 | `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2032 | `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2033 | `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 2034 | `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2035 2036 Attributes of the optimizer may also be modified via the member methods 2037 `StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, and `ExactObjective()`. 2038 2039 Note that the `QHUpdate` class has the constructor `QHUpdate(`_`v, 2040momentum`_`)` with a default value of `0.7` for the quasi-hyperbolic term `v` 2041and `0.999` for the momentum term. 2042 2043#### Examples 2044 2045<details open> 2046<summary>Click to collapse/expand example code. 2047</summary> 2048 2049```c++ 2050RosenbrockFunction f; 2051arma::mat coordinates = f.GetInitialPoint(); 2052 2053QHSGD optimizer(0.01, 32, 100000, 1e-5, true); 2054optimizer.Optimize(f, coordinates); 2055``` 2056 2057</details> 2058 2059#### See also: 2060 2061 * [Quasi-Hyperbolic Momentum and Adam For Deep Learning](https://arxiv.org/pdf/1810.06801.pdf) 2062 * [SGD](#sgd) 2063 * [Momentum SGD](#momentum-sgd) 2064 * [Nesterov Momentum SGD](#nesterov-momentum-sgd) 2065 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2066 * [Differentiable separable functions](#differentiable-separable-functions) 2067 2068## QHAdam 2069 2070*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2071 2072QHAdam is an optimizer that uses quasi-hyperbolic descent with the Adam 2073optimizer. This replaces the moment estimators of Adam with quasi-hyperbolic 2074terms, and various values of the `v1` and `v2` parameters are equivalent to 2075the following other optimizers: 2076 2077 * When `v1 = v2 = 1`, `QHAdam` is equivalent to `Adam`. 2078 2079 * When `v1 = 0` and `v2 = 1`, `QHAdam` is equivalent to `RMSProp`. 2080 2081 * When `v1 = beta1` and `v2 = 1`, `QHAdam` is equivalent to `Nadam`. 2082 2083#### Constructors 2084 2085 * `QHAdam()` 2086 * `QHAdam(`_`stepSize, batchSize`_`)` 2087 * `QHAdam(`_`stepSize, batchSize, v1, v2, beta1, beta2, eps, maxIterations`_`)` 2088 * `QHAdam(`_`stepSize, batchSize, v1, v2, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 2089 2090#### Attributes 2091 2092 | **type** | **name** | **description** | **default** | 2093 |----------|----------|-----------------|-------------| 2094 | `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 2095 | `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` | 2096 | `double` | **`v1`** | The First Quasi Hyperbolic Term. | `0.7` | 2097 | `double` | **`v2`** | The Second Quasi Hyperbolic Term. | `1.00` | 2098 | `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 2099 | `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 2100 | `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 2101 | `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2102 | `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2103 | `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 2104 | `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 2105 | `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2106 2107 The attributes of the optimizer may also be modified via the member methods 2108 `StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`, 2109 `Tolerance()`, `Shuffle()`, `V1()`, `V2()`, `ResetPolicy()`, and `ExactObjective()`. 2110 2111#### Examples 2112 2113 ```c++ 2114 RosenbrockFunction f; 2115 arma::mat coordinates = f.GetInitialPoint(); 2116 2117 QHAdam optimizer(0.001, 32, 0.7, 0.9, 0.9, 0.999, 1e-8, 100000, 1e-5, true); 2118 optimizer.Optimize(f, coordinates); 2119 ``` 2120 2121#### See also: 2122 2123 * [Quasi-Hyperbolic Momentum and Adam For Deep Learning](https://arxiv.org/pdf/1810.06801.pdf) 2124 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2125 * [SGD](#standard-sgd) 2126 * [Adam](#adam) 2127 * [RMSprop](#rmsprop) 2128 * [Nadam](#nadam) 2129 * [Incorporating Nesterov Momentum into Adam](http://cs229.stanford.edu/proj2015/054_report.pdf) 2130 * [Differentiable separable functions](#differentiable-separable-functions) 2131 2132## RMSProp 2133 2134*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2135 2136RMSProp utilizes the magnitude of recent gradients to normalize the gradients. 2137 2138#### Constructors 2139 2140 * `RMSProp()` 2141 * `RMSProp(`_`stepSize, batchSize`_`)` 2142 * `RMSProp(`_`stepSize, batchSize, alpha, epsilon, maxIterations, tolerance, shuffle`_`)` 2143 * `RMSProp(`_`stepSize, batchSize, alpha, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 2144 2145#### Attributes 2146 2147| **type** | **name** | **description** | **default** | 2148|----------|----------|-----------------|-------------| 2149| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 2150| `size_t` | **`batchSize`** | Number of points to process in each step. | `32` | 2151| `double` | **`alpha`** | Smoothing constant, similar to that used in AdaDelta and momentum methods. | `0.99` | 2152| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` | 2153| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2154| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | 2155| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 2156| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 2157| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2158 2159Attributes of the optimizer can also be modified via the member methods 2160`StepSize()`, `BatchSize()`, `Alpha()`, `Epsilon()`, `MaxIterations()`, 2161`Tolerance()`, `Shuffle()`, `ResetPolicy()`, and `ExactObjective()`. 2162 2163#### Examples: 2164 2165<details open> 2166<summary>Click to collapse/expand example code. 2167</summary> 2168 2169```c++ 2170RosenbrockFunction f; 2171arma::mat coordinates = f.GetInitialPoint(); 2172 2173RMSProp optimizer(1e-3, 1, 0.99, 1e-8, 5000000, 1e-9, true); 2174optimizer.Optimize(f, coordinates); 2175``` 2176 2177</details> 2178 2179#### See also: 2180 2181 * [Divide the gradient by a running average of its recent magnitude](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) 2182 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp) 2183 * [Differentiable separable functions](#differentiable-separable-functions) 2184 2185## Simulated Annealing (SA) 2186 2187*An optimizer for [arbitrary functions](#arbitrary-functions).* 2188 2189Simulated Annealing is an stochastic optimization algorithm which is able to 2190deliver near-optimal results quickly without knowing the gradient of the 2191function being optimized. It has a unique hill climbing capability that makes 2192it less vulnerable to local minima. This implementation uses exponential cooling 2193schedule and feedback move control by default, but the cooling schedule can be 2194changed via a template parameter. 2195 2196#### Constructors 2197 2198 * `SA<`_`CoolingScheduleType`_`>(`_`coolingSchedule`_`)` 2199 * `SA<`_`CoolingScheduleType`_`>(`_`coolingSchedule, maxIterations`_`)` 2200 * `SA<`_`CoolingScheduleType`_`>(`_`coolingSchedule, maxIterations, initT, initMoves, moveCtrlSweep, tolerance, maxToleranceSweep, maxMoveCoef, initMoveCoef, gain`_`)` 2201 2202The _`CoolingScheduleType`_ template parameter implements a policy to update the 2203temperature. The `ExponentialSchedule` class is available for use; it has a 2204constructor `ExponentialSchedule(`_`lambda`_`)` where _`lambda`_ is the cooling 2205speed (default `0.001`). Custom schedules may be created by implementing a 2206class with at least the single member method below: 2207 2208<details open> 2209<summary>Click to collapse/expand example code. 2210</summary> 2211 2212```c++ 2213// Return the next temperature given the current system status. 2214double NextTemperature(const double currentTemperature, 2215 const double currentEnergy); 2216``` 2217 2218</details> 2219 2220For convenience, the default cooling schedule is `ExponentialSchedule`, so the 2221shorter type `SA<>` may be used instead of the equivalent 2222`SA<ExponentialSchedule>`. 2223 2224#### Attributes 2225 2226| **type** | **name** | **description** | **default** | 2227|----------|----------|-----------------|-------------| 2228| `CoolingScheduleType` | **`coolingSchedule`** | Instantiated cooling schedule (default ExponentialSchedule). | **CoolingScheduleType()** | 2229| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 indicates no limit). | `1000000` | 2230| `double` | **`initT`** | Initial temperature. | `10000.0` | 2231| `size_t` | **`initMoves`** | Number of initial iterations without changing temperature. | `1000` | 2232| `size_t` | **`moveCtrlSweep`** | Sweeps per feedback move control. | `100` | 2233| `double` | **`tolerance`** | Tolerance to consider system frozen. | `1e-5` | 2234| `size_t` | **`maxToleranceSweep`** | Maximum sweeps below tolerance to consider system frozen. | `3` | 2235| `double` | **`maxMoveCoef`** | Maximum move size. | `20` | 2236| `double` | **`initMoveCoef`** | Initial move size. | `0.3` | 2237| `double` | **`gain`** | Proportional control in feedback move control. | `0.3` | 2238 2239Attributes of the optimizer may also be changed via the member methods 2240`CoolingSchedule()`, `MaxIterations()`, `InitT()`, `InitMoves()`, 2241`MoveCtrlSweep()`, `Tolerance()`, `MaxToleranceSweep()`, `MaxMoveCoef()`, 2242`InitMoveCoef()`, and `Gain()`. 2243 2244#### Examples: 2245 2246<details open> 2247<summary>Click to collapse/expand example code. 2248</summary> 2249 2250```c++ 2251RosenbrockFunction f; 2252arma::mat coordinates = f.GetInitialPoint(); 2253 2254SA<> optimizer(ExponentialSchedule(), 1000000, 1000., 1000, 100, 1e-10, 3, 1.5, 2255 0.5, 0.3); 2256optimizer.Optimize(f, coordinates); 2257``` 2258 2259</details> 2260 2261#### See also: 2262 2263 * [Simulated annealing on Wikipedia](https://en.wikipedia.org/wiki/Simulated_annealing) 2264 * [Arbitrary functions](#arbitrary-functions) 2265 2266## Simultaneous Perturbation Stochastic Approximation (SPSA) 2267 2268*An optimizer for [arbitrary functions](#arbitrary-functions).* 2269 2270The SPSA algorithm approximates the gradient of the function by finite 2271differences along stochastic directions. 2272 2273#### Constructors 2274 2275 * `SPSA(`_`alpha, gamma, stepSize, evaluationStepSize, maxIterations, tolerance`_`)` 2276 2277#### Attributes 2278 2279| **type** | **name** | **description** | **default** | 2280|----------|----------|-----------------|-------------| 2281| `double` | **`alpha`** | Scaling exponent for the step size. | `0.602` | 2282| `double` | **`gamma`** | Scaling exponent for evaluation step size. | `0.101` | 2283| `double` | **`stepSize`** | Scaling parameter for step size (named as 'a' in the paper). | `0.16` | 2284| `double` | **`evaluationStepSize`** | Scaling parameter for evaluation step size (named as 'c' in the paper). | `0.3` | 2285| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2286| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2287 2288Attributes of the optimizer may also be changed via the member methods 2289`Alpha()`, `Gamma()`, `StepSize()`, `EvaluationStepSize()`, and `MaxIterations()`. 2290 2291#### Examples: 2292 2293<details open> 2294<summary>Click to collapse/expand example code. 2295</summary> 2296 2297```c++ 2298SphereFunction f(2); 2299arma::mat coordinates = f.GetInitialPoint(); 2300 2301SPSA optimizer(0.1, 0.102, 0.16, 0.3, 100000, 1e-5); 2302optimizer.Optimize(f, coordinates); 2303``` 2304 2305</details> 2306 2307#### See also: 2308 2309 * [An Overview of the Simultaneous Perturbation Method for Efficient Optimization](https://pdfs.semanticscholar.org/bf67/0fb6b1bd319938c6a879570fa744cf36b240.pdf) 2310 * [SPSA on Wikipedia](https://en.wikipedia.org/wiki/Simultaneous_perturbation_stochastic_approximation) 2311 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2312 * [Differentiable separable functions](#differentiable-separable-functions) 2313 2314## Stochastic Recursive Gradient Algorithm (SARAH/SARAH+) 2315 2316*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2317 2318StochAstic Recusive gRadient algoritHm (SARAH), is a variance reducing 2319stochastic recursive gradient algorithm which employs the stochastic recursive 2320gradient, for solving empirical loss minimization for the case of nonconvex 2321losses. 2322 2323#### Constructors 2324 2325 * `SARAHType<`_`UpdatePolicyType`_`>()` 2326 * `SARAHType<`_`UpdatePolicyType`_`>(`_`stepSize, batchSize`_`)` 2327 * `SARAHType<`_`UpdatePolicyType`_`>(`_`stepSize, batchSize, maxIterations, innerIterations, tolerance, shuffle, updatePolicy, exactObjective`_`)` 2328 2329The _`UpdatePolicyType`_ template parameter specifies the update step used for 2330the optimizer. The `SARAHUpdate` and `SARAHPlusUpdate` classes are available 2331for use, and implement the standard SARAH update and SARAH+ update, 2332respectively. A custom update rule can be used by implementing a class with the 2333same method signatures. 2334 2335For convenience the following typedefs have been defined: 2336 2337 * `SARAH` (equivalent to `SARAHType<SARAHUpdate>`): the standard SARAH optimizer 2338 * `SARAH_Plus` (equivalent to `SARAHType<SARAHPlusUpdate>`): the SARAH+ optimizer 2339 2340#### Attributes 2341 2342| **type** | **name** | **description** | **default** | 2343|----------|----------|-----------------|-------------| 2344| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 2345| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` | 2346| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `1000` | 2347| `size_t` | **`innerIterations`** | The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. | `0` | 2348| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2349| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 2350| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` | 2351| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2352 2353Attributes of the optimizer may also be changed via the member methods 2354`StepSize()`, `BatchSize()`, `MaxIterations()`, `InnerIterations()`, 2355`Tolerance()`, `Shuffle()`, `UpdatePolicy()`, and `ExactObjective()`. 2356 2357Note that the default value for `updatePolicy` is the default constructor for 2358the `UpdatePolicyType`. 2359 2360#### Examples: 2361 2362<details open> 2363<summary>Click to collapse/expand example code. 2364</summary> 2365 2366```c++ 2367RosenbrockFunction f; 2368arma::mat coordinates = f.GetInitialPoint(); 2369 2370// Standard stochastic variance reduced gradient. 2371SARAH optimizer(0.01, 1, 5000, 0, 1e-5, true); 2372optimizer.Optimize(f, coordinates); 2373 2374// Stochastic variance reduced gradient with Barzilai-Borwein. 2375SARAH_Plus optimizerPlus(0.01, 1, 5000, 0, 1e-5, true); 2376optimizerPlus.Optimize(f, coordinates); 2377``` 2378 2379</details> 2380 2381#### See also: 2382 2383 * [Stochastic Recursive Gradient Algorithm for Nonconvex Optimization](https://arxiv.org/abs/1705.07261) 2384 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2385 * [Differentiable separable functions](#differentiable-separable-functions) 2386 2387## Standard SGD 2388 2389*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2390 2391Stochastic Gradient Descent is a technique for minimizing a function which 2392can be expressed as a sum of other functions. It's likely better to use any of 2393the other variants of SGD than this class; however, this standard SGD 2394implementation may still be useful in some situations. 2395 2396#### Constructors 2397 2398 * `StandardSGD()` 2399 * `StandardSGD(`_`stepSize, batchSize`_`)` 2400 * `StandardSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, updatePolicy, decayPolicy, resetPolicy, exactObjective`_`)` 2401 2402Note that `StandardSGD` is based on the templated type 2403`SGD<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` = 2404VanillaUpdate` and _`DecayPolicyType`_` = NoDecay`. 2405 2406#### Attributes 2407 2408| **type** | **name** | **description** | **default** | 2409|----------|----------|-----------------|-------------| 2410| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 2411| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` | 2412| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2413| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2414| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` | 2415| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` | 2416| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` | 2417| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` | 2418| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2419 2420Attributes of the optimizer may also be modified via the member methods 2421`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `DecayPolicy()`, `ResetPolicy()`, and 2422`ExactObjective()`. 2423 2424#### Examples 2425 2426<details open> 2427<summary>Click to collapse/expand example code. 2428</summary> 2429 2430```c++ 2431RosenbrockFunction f; 2432arma::mat coordinates = f.GetInitialPoint(); 2433 2434StandardSGD optimizer(0.01, 32, 100000, 1e-5, true); 2435optimizer.Optimize(f, coordinates); 2436``` 2437 2438</details> 2439 2440#### See also: 2441 2442 * [Momentum SGD](#momentum-sgd) 2443 * [Nesterov Momentum SGD](#nesterov-momentum-sgd) 2444 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2445 * [Differentiable separable functions](#differentiable-separable-functions) 2446 2447## Stochastic Coordinate Descent (SCD) 2448 2449*An optimizer for [partially differentiable functions](#partially-differentiable-functions).* 2450 2451Stochastic Coordinate descent is a technique for minimizing a function by 2452doing a line search along a single direction at the current point in the 2453iteration. The direction (or "coordinate") can be chosen cyclically, randomly 2454or in a greedy fashion. 2455 2456#### Constructors 2457 2458 * `SCD<`_`DescentPolicyType`_`>()` 2459 * `SCD<`_`DescentPolicyType`_`>(`_`stepSize, maxIterations`_`)` 2460 * `SCD<`_`DescentPolicyType`_`>(`_`stepSize, maxIterations, tolerance, updateInterval`_`)` 2461 * `SCD<`_`DescentPolicyType`_`>(`_`stepSize, maxIterations, tolerance, updateInterval, descentPolicy`_`)` 2462 2463The _`DescentPolicyType`_ template parameter specifies the behavior of SCD when 2464selecting the next coordinate to descend with. The `RandomDescent`, 2465`GreedyDescent`, and `CyclicDescent` classes are available for use. Custom 2466behavior can be achieved by implementing a class with the same method 2467signatures. 2468 2469For convenience, the following typedefs have been defined: 2470 2471 * `RandomSCD` (equivalent to `SCD<RandomDescent>`): selects coordinates randomly 2472 * `GreedySCD` (equivalent to `SCD<GreedyDescent>`): selects the coordinate with the maximum guaranteed descent according to the Gauss-Southwell rule 2473 * `CyclicSCD` (equivalent to `SCD<CyclicDescent>`): selects coordinates sequentially 2474 2475#### Attributes 2476 2477| **type** | **name** | **description** | **default** | 2478|----------|----------|-----------------|-------------| 2479| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 2480| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2481| `double` | **`tolerance`** | Maximum absolute tolerance to terminate the algorithm. | `1e-5` | 2482| `size_t` | **`updateInterval`** | The interval at which the objective is to be reported and checked for convergence. | `1e3` | 2483| `DescentPolicyType` | **`descentPolicy`** | The policy to use for selecting the coordinate to descend on. | `DescentPolicyType()` | 2484 2485Attributes of the optimizer may also be modified via the member methods 2486`StepSize()`, `MaxIterations()`, `Tolerance()`, `UpdateInterval()`, and 2487`DescentPolicy()`. 2488 2489Note that the default value for `descentPolicy` is the default constructor for 2490_`DescentPolicyType`_. 2491 2492#### Examples 2493 2494<details open> 2495<summary>Click to collapse/expand example code. 2496</summary> 2497 2498```c++ 2499SparseTestFunction f; 2500arma::mat coordinates = f.GetInitialPoint(); 2501 2502RandomSCD randomscd(0.01, 100000, 1e-5, 1e3); 2503randomscd.Optimize(f, coordinates); 2504 2505GreedySCD greedyscd(0.01, 100000, 1e-5, 1e3); 2506greedyscd.Optimize(f, coordinates); 2507 2508CyclicSCD cyclicscd(0.01, 100000, 1e-5, 1e3); 2509cyclicscd.Optimize(f, coordinates); 2510``` 2511 2512</details> 2513 2514#### See also: 2515 2516 * [Coordinate descent on Wikipedia](https://en.wikipedia.org/wiki/Coordinate_descent) 2517 * [Stochastic Methods for L1-Regularized Loss Minimization](https://www.jmlr.org/papers/volume12/shalev-shwartz11a/shalev-shwartz11a.pdf) 2518 * [Partially differentiable functions](#partially-differentiable-functions) 2519 2520## Stochastic Gradient Descent with Restarts (SGDR) 2521 2522*An optimizer for [differentiable separable 2523functions](#differentiable-separable-functions).* 2524 2525SGDR is based on Mini-batch Stochastic Gradient Descent class and simulates a 2526new warm-started run/restart once a number of epochs are performed. 2527 2528#### Constructors 2529 2530 * `SGDR<`_`UpdatePolicyType`_`>()` 2531 * `SGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize`_`)` 2532 * `SGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, updatePolicy`_`)` 2533 * `SGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, updatePolicy, resetPolicy, exactObjective`_`)` 2534 2535The _`UpdatePolicyType`_ template parameter controls the update policy used 2536during the iterative update process. The `MomentumUpdate` class is available 2537for use, and custom behavior can be achieved by implementing a class with the 2538same method signatures as `MomentumUpdate`. 2539 2540For convenience, the default type of _`UpdatePolicyType`_ is `MomentumUpdate`, 2541so the shorter type `SGDR<>` can be used instead of the equivalent 2542`SGDR<MomentumUpdate>`. 2543 2544#### Attributes 2545 2546| **type** | **name** | **description** | **default** | 2547|----------|----------|-----------------|-------------| 2548| `size_t` | **`epochRestart`** | Initial epoch where decay is applied. | `50` | 2549| `double` | **`multFactor`** | Batch size multiplication factor. | `2.0` | 2550| `size_t` | **`batchSize`** | Size of each mini-batch. | `1000` | 2551| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 2552| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2553| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2554| `bool` | **`shuffle`** | If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. | `true` | 2555| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` | 2556| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 2557| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2558 2559Attributes of the optimizer can also be modified via the member methods 2560`EpochRestart()`, `MultFactor()`, `BatchSize()`, `StepSize()`, 2561`MaxIterations()`, `Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `ResetPolicy()`, and 2562`ExactObjective()`. 2563 2564Note that the default value for `updatePolicy` is the default constructor for 2565the `UpdatePolicyType`. 2566 2567#### Examples: 2568 2569<details open> 2570<summary>Click to collapse/expand example code. 2571</summary> 2572 2573```c++ 2574RosenbrockFunction f; 2575arma::mat coordinates = f.GetInitialPoint(); 2576 2577SGDR<> optimizer(50, 2.0, 1, 0.01, 10000, 1e-3); 2578optimizer.Optimize(f, coordinates); 2579``` 2580 2581</details> 2582 2583#### See also: 2584 2585 * [SGDR: Stochastic Gradient Descent with Warm Restarts](https://arxiv.org/abs/1608.03983) 2586 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2587 * [Differentiable separable functions](#differentiable-separable-functions) 2588 2589## Snapshot Stochastic Gradient Descent with Restarts (SnapshotSGDR) 2590 2591*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2592 2593SnapshotSGDR simulates a new warm-started run/restart once a number of epochs 2594are performed using the Snapshot Ensembles technique. 2595 2596#### Constructors 2597 2598 * `SnapshotSGDR<`_`UpdatePolicyType`_`>()` 2599 * `SnapshotSGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize`_`)` 2600 * `SnapshotSGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, snapshots, accumulate, updatePolicy`_`)` 2601 * `SnapshotSGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, snapshots, accumulate, updatePolicy, resetPolicy, exactObjective`_`)` 2602 2603The _`UpdatePolicyType`_ template parameter controls the update policy used 2604during the iterative update process. The `MomentumUpdate` class is available 2605for use, and custom behavior can be achieved by implementing a class with the 2606same method signatures as `MomentumUpdate`. 2607 2608For convenience, the default type of _`UpdatePolicyType`_ is `MomentumUpdate`, 2609so the shorter type `SnapshotSGDR<>` can be used instead of the equivalent 2610`SnapshotSGDR<MomentumUpdate>`. 2611 2612#### Attributes 2613 2614| **type** | **name** | **description** | **default** | 2615|----------|----------|-----------------|-------------| 2616| `size_t` | **`epochRestart`** | Initial epoch where decay is applied. | `50` | 2617| `double` | **`multFactor`** | Batch size multiplication factor. | `2.0` | 2618| `size_t` | **`batchSize`** | Size of each mini-batch. | `1000` | 2619| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 2620| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2621| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2622| `bool` | **`shuffle`** | If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. | `true` | 2623| `size_t` | **`snapshots`** | Maximum number of snapshots. | `5` | 2624| `bool` | **`accumulate`** | Accumulate the snapshot parameter. | `true` | 2625| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` | 2626| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 2627| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2628 2629Attributes of the optimizer can also be modified via the member methods 2630`EpochRestart()`, `MultFactor()`, `BatchSize()`, `StepSize()`, 2631`MaxIterations()`, `Tolerance()`, `Shuffle()`, `Snapshots()`, `Accumulate()`, 2632`UpdatePolicy()`, `ResetPolicy()`, and `ExactObjective()`. 2633 2634The `Snapshots()` function returns a `std::vector<arma::mat>&` (a vector of 2635snapshots of the parameters), not a `size_t` representing the maximum number of 2636snapshots. 2637 2638Note that the default value for `updatePolicy` is the default constructor for 2639the `UpdatePolicyType`. 2640 2641#### Examples: 2642 2643<details open> 2644<summary>Click to collapse/expand example code. 2645</summary> 2646 2647```c++ 2648RosenbrockFunction f; 2649arma::mat coordinates = f.GetInitialPoint(); 2650 2651SnapshotSGDR<> optimizer(50, 2.0, 1, 0.01, 10000, 1e-3); 2652optimizer.Optimize(f, coordinates); 2653``` 2654 2655</details> 2656 2657#### See also: 2658 2659 * [Snapshot ensembles: Train 1, get m for free](https://arxiv.org/abs/1704.00109) 2660 * [SGDR: Stochastic Gradient Descent with Warm Restarts](https://arxiv.org/abs/1608.03983) 2661 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2662 * [Differentiable separable functions](#differentiable-separable-functions) 2663 2664## SMORMS3 2665 2666*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2667 2668SMORMS3 is a hybrid of RMSprop, which is trying to estimate a safe and optimal 2669distance based on curvature or perhaps just normalizing the step-size in the 2670parameter space. 2671 2672#### Constructors 2673 2674 * `SMORMS3()` 2675 * `SMORMS3(`_`stepSize, batchSize`_`)` 2676 * `SMORMS3(`_`stepSize, batchSize, epsilon, maxIterations, tolerance`_`)` 2677 * `SMORMS3(`_`stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 2678 2679#### Attributes 2680 2681| **type** | **name** | **description** | **default** | 2682|----------|----------|-----------------|-------------| 2683| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 2684| `size_t` | **`batchSize`** | Number of points to process at each step. | `32` | 2685| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-16` | 2686| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2687| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2688| `bool` | **`shuffle`** | If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. | `true` | 2689| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 2690| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2691 2692Attributes of the optimizer can also be modified via the member methods 2693`StepSize()`, `BatchSize()`, `Epsilon()`, `MaxIterations()`, `Tolerance()`, 2694`Shuffle()`, `ResetPolicy()`, and `ExactObjective()`. 2695 2696#### Examples: 2697 2698<details open> 2699<summary>Click to collapse/expand example code. 2700</summary> 2701 2702```c++ 2703RosenbrockFunction f; 2704arma::mat coordinates = f.GetInitialPoint(); 2705 2706SMORMS3 optimizer(0.001, 1, 1e-16, 5000000, 1e-9, true); 2707optimizer.Optimize(f, coordinates); 2708``` 2709 2710</details> 2711 2712#### See also: 2713 2714 * [RMSprop loses to SMORMS3 - Beware the Epsilon!](https://sifter.org/simon/journal/20150420.html) 2715 * [RMSProp](#rmsprop) 2716 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2717 * [Differentiable separable functions](#differentiable-separable-functions) 2718 2719## Standard stochastic variance reduced gradient (SVRG) 2720 2721*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2722 2723Stochastic Variance Reduced Gradient is a technique for minimizing smooth and 2724strongly convex problems. 2725 2726#### Constructors 2727 2728 * `SVRGType<`_`UpdatePolicyType, DecayPolicyType`_`>()` 2729 * `SVRGType<`_`UpdatePolicyType, DecayPolicyType`_`>(`_`stepSize`_`)` 2730 * `SVRGType<`_`UpdatePolicyType, DecayPolicyType`_`>(`_`stepSize, batchSize, maxIterations, innerIterations`_`)` 2731 * `SVRGType<`_`UpdatePolicyType, DecayPolicyType`_`>(`_`stepSize, batchSize, maxIterations, innerIterations, tolerance, shuffle, updatePolicy, decayPolicy, resetPolicy, exactObjective`_`)` 2732 2733The _`UpdatePolicyType`_ template parameter controls the update step used by 2734SVRG during the optimization. The `SVRGUpdate` class is available for use and 2735custom update behavior can be achieved by implementing a class with the same 2736method signatures as `SVRGUpdate`. 2737 2738The _`DecayPolicyType`_ template parameter controls the decay policy used to 2739adjust the step size during the optimization. The `BarzilaiBorweinDecay` and 2740`NoDecay` classes are available for use. Custom decay functionality can be 2741achieved by implementing a class with the same method signatures. 2742 2743For convenience the following typedefs have been defined: 2744 2745 * `SVRG` (equivalent to `SVRGType<SVRGUpdate, NoDecay>`): the standard SVRG technique 2746 * `SVRG_BB` (equivalent to `SVRGType<SVRGUpdate, BarzilaiBorweinDecay>`): SVRG with the Barzilai-Borwein decay policy 2747 2748#### Attributes 2749 2750| **type** | **name** | **description** | **default** | 2751|----------|----------|-----------------|-------------| 2752| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 2753| `size_t` | **`batchSize`** | Initial batch size. | `32` | 2754| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `1000` | 2755| `size_t` | **`innerIterations`** | The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. | `0` | 2756| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2757| `bool` | **`shuffle`** | If true, the batch order is shuffled; otherwise, each batch is visited in linear order. | `true` | 2758| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` | 2759| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` | 2760| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` | 2761| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2762 2763Attributes of the optimizer may also be modified via the member methods 2764`StepSize()`, `BatchSize()`, `MaxIterations()`, `InnerIterations()`, 2765`Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `DecayPolicy()`, `ResetPolicy()`, and 2766`ExactObjective()`. 2767 2768Note that the default values for the `updatePolicy` and `decayPolicy` parameters 2769are simply the default constructors of the _`UpdatePolicyType`_ and 2770_`DecayPolicyType`_ classes. 2771 2772#### Examples: 2773 2774<details open> 2775<summary>Click to collapse/expand example code. 2776</summary> 2777 2778```c++ 2779RosenbrockFunction f; 2780arma::mat coordinates = f.GetInitialPoint(); 2781 2782// Standard stochastic variance reduced gradient. 2783SVRG optimizer(0.005, 1, 300, 0, 1e-10, true); 2784optimizer.Optimize(f, coordinates); 2785 2786// Stochastic variance reduced gradient with Barzilai-Borwein. 2787SVRG_BB bbOptimizer(0.005, batchSize, 300, 0, 1e-10, true, SVRGUpdate(), 2788 BarzilaiBorweinDecay(0.1)); 2789bbOptimizer.Optimize(f, coordinates); 2790``` 2791 2792</details> 2793 2794#### See also: 2795 2796 * [Accelerating Stochastic Gradient Descent using Predictive Variance Reduction](https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf) 2797 * [SGD](#standard-sgd) 2798 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2799 * [Differentiable separable functions](#differentiable-separable-functions) 2800 2801## SPALeRA Stochastic Gradient Descent (SPALeRASGD) 2802 2803*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2804 2805SPALeRA involves two components: a learning rate adaptation scheme, which 2806ensures that the learning system goes as fast as it can; and a catastrophic 2807event manager, which is in charge of detecting undesirable behaviors and getting 2808the system back on track. 2809 2810#### Constructors 2811 2812 * `SPALeRASGD<`_`DecayPolicyType`_`>()` 2813 * `SPALeRASGD<`_`DecayPolicyType`_`>(`_`stepSize, batchSize`_`)` 2814 * `SPALeRASGD<`_`DecayPolicyType`_`>(`_`stepSize, batchSize, maxIterations, tolerance`_`)` 2815 * `SPALeRASGD<`_`DecayPolicyType`_`>(`_`stepSize, batchSize, maxIterations, tolerance, lambda, alpha, epsilon, adaptRate, shuffle, decayPolicy, resetPolicy, exactObjective`_`)` 2816 2817The _`DecayPolicyType`_ template parameter controls the decay in the step size 2818during the course of the optimization. The `NoDecay` class is available for 2819use; custom behavior can be achieved by implementing a class with the same 2820method signatures. 2821 2822By default, _`DecayPolicyType`_ is set to `NoDecay`, so the shorter type 2823`SPALeRASGD<>` can be used instead of the equivalent `SPALeRASGD<NoDecay>`. 2824 2825#### Attributes 2826 2827| **type** | **name** | **description** | **default** | 2828|----------|----------|-----------------|-------------| 2829| `double` | **`stepSize`** | Step size for each iteration. | `0.01` | 2830| `size_t` | **`batchSize`** | Initial batch size. | `32` | 2831| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2832| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2833| `double` | **`lambda`** | Page-Hinkley update parameter. | `0.01` | 2834| `double` | **`alpha`** | Memory parameter of the Agnostic Learning Rate adaptation. | `0.001` | 2835| `double` | **`epsilon`** | Numerical stability parameter. | `1e-6` | 2836| `double` | **`adaptRate`** | Agnostic learning rate update rate. | `3.10e-8` | 2837| `bool` | **`shuffle`** | If true, the batch order is shuffled; otherwise, each batch is visited in linear order. | `true` | 2838| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` | 2839| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` | 2840| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2841 2842Attributes of the optimizer may also be modified via the member methods 2843`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Lambda()`, 2844`Alpha()`, `Epsilon()`, `AdaptRate()`, `Shuffle()`, `DecayPolicy()`, `ResetPolicy()`, and `ExactObjective()`. 2845 2846#### Examples 2847 2848<details open> 2849<summary>Click to collapse/expand example code. 2850</summary> 2851 2852```c++ 2853RosenbrockFunction f; 2854arma::mat coordinates = f.GetInitialPoint(); 2855 2856SPALeRASGD<> optimizer(0.05, 1, 10000, 1e-4); 2857optimizer.Optimize(f, coordinates); 2858``` 2859 2860</details> 2861 2862#### See also: 2863 2864 * [Stochastic Gradient Descent: Going As Fast As Possible But Not Faster](https://arxiv.org/abs/1709.01427) 2865 * [SGD](#standard-sgd) 2866 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2867 * [Differentiable separable functions](#differentiable-separable-functions) 2868 2869## SWATS 2870 2871*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2872 2873SWATS is an optimizer that uses a simple strategy to switch from Adam to 2874standard SGD when a triggering condition is satisfied. The condition relates to 2875the projection of Adam steps on the gradient subspace. 2876 2877#### Constructors 2878 2879 * `SWATS()` 2880 * `SWATS(`_`stepSize, batchSize`_`)` 2881 * `SWATS(`_`stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance`_`)` 2882 * `SWATS(`_`stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 2883 2884#### Attributes 2885 2886| **type** | **name** | **description** | **default** | 2887|----------|----------|-----------------|-------------| 2888| `double` | **`stepSize`** | Step size for each iteration. | `0.001` | 2889| `size_t` | **`batchSize`** | Number of points to process at each step. | `32` | 2890| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` | 2891| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` | 2892| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-16` | 2893| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2894| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2895| `bool` | **`shuffle`** | If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. | `true` | 2896| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 2897| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2898 2899Attributes of the optimizer can also be modified via the member methods 2900`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Epsilon()`, 2901`MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and `ExactObjective()`. 2902 2903#### Examples: 2904 2905<details open> 2906<summary>Click to collapse/expand example code. 2907</summary> 2908 2909```c++ 2910RosenbrockFunction f; 2911arma::mat coordinates = f.GetInitialPoint(); 2912 2913SWATS optimizer(0.001, 1, 0.9, 0.999, 1e-16, 5000000, 1e-9, true); 2914optimizer.Optimize(f, coordinates); 2915``` 2916 2917</details> 2918 2919#### See also: 2920 2921 * [Improving generalization performance by switching from Adam to SGD](https://arxiv.org/abs/1712.07628) 2922 * [Adam](#adam) 2923 * [Standard SGD](#standard-sgd) 2924 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2925 * [Differentiable separable functions](#differentiable-separable-functions) 2926 2927## WNGrad 2928 2929*An optimizer for [differentiable separable functions](#differentiable-separable-functions).* 2930 2931WNGrad is a general nonlinear update rule for the learning rate. WNGrad has 2932near-optimal convergence rates in both the batch and stochastic settings. 2933 2934#### Constructors 2935 2936 * `WNGrad()` 2937 * `WNGrad(`_`stepSize, batchSize`_`)` 2938 * `WNGrad(`_`stepSize, batchSize, maxIterations, tolerance, shuffle`_`)` 2939 * `WNGrad(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)` 2940 2941#### Attributes 2942 2943| **type** | **name** | **description** | **default** | 2944|----------|----------|-----------------|-------------| 2945| `double` | **`stepSize`** | Step size for each iteration. | `0.562` | 2946| `size_t` | **`batchSize`** | Initial batch size. | `32` | 2947| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` | 2948| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` | 2949| `bool` | **`shuffle`** | If true, the batch order is shuffled; otherwise, each batch is visited in linear order. | `true` | 2950| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` | 2951| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` | 2952 2953Attributes of the optimizer may also be modified via the member methods 2954`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and 2955`ExactObjective()`. 2956 2957#### Examples 2958 2959<details open> 2960<summary>Click to collapse/expand example code. 2961</summary> 2962 2963```c++ 2964RosenbrockFunction f; 2965arma::mat coordinates = f.GetInitialPoint(); 2966 2967WNGrad<> optimizer(0.562, 1, 10000, 1e-4); 2968optimizer.Optimize(f, coordinates); 2969``` 2970 2971</details> 2972 2973#### See also: 2974 2975 * [WNGrad: Learn the Learning Rate in Gradient Descent](https://arxiv.org/abs/1803.02865) 2976 * [SGD](#standard-sgd) 2977 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 2978 * [Differentiable separable functions](#differentiable-separable-functions) 2979