1## AdaBound
2
3*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
4
5AdaBound is a variant of Adam which employs dynamic bounds on learning rates.
6
7#### Constructors
8
9 * `AdaBound()`
10 * `AdaBound(`_`stepSize, batchSize`_`)`
11 * `AdaBound(`_`stepSize, batchSize, finalLr, gamma, beta1, beta2, epsilon, maxIterations, tolerance, shuffle`_`)`
12 * `AdaBound(`_`stepSize, batchSize, finalLr, gamma, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
13
14Note that the `AdaBound` class is based on the `AdaBoundType<`_`UpdateRule`_`>`
15class with _`UpdateRule`_` = AdaBoundUpdate`.
16
17#### Attributes
18
19| **type** | **name** | **description** | **default** |
20|----------|----------|-----------------|-------------|
21| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
22| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
23| `double` | **`finalLr`** | The final (SGD) learning rate. | `0.1` |
24| `double` | **`gamma`** | The convergence speed of the bound functions. | `0.001` |
25| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
26| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
27| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
28| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
29| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
30| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
31| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
32| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
33
34The attributes of the optimizer may also be modified via the member methods
35`FinalLr()`, `Gamma()`, `StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`,
36`Eps()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and
37`ExactObjective()`.
38
39#### Examples
40
41<details open>
42<summary>Click to collapse/expand example code.
43</summary>
44
45```c++
46SphereFunction f(2);
47arma::mat coordinates = f.GetInitialPoint();
48
49AdaBound optimizer(0.001, 2, 0.1, 1e-3, 0.9, 0.999, 1e-8, 500000, 1e-3);
50optimizer.Optimize(f, coordinates);
51```
52
53</details>
54
55#### See also:
56
57 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
58 * [SGD](#standard-sgd)
59 * [Adaptive Gradient Methods with Dynamic Bound of Learning Rate](https://arxiv.org/abs/1902.09843)
60 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980)
61 * [Differentiable separable functions](#differentiable-separable-functions)
62
63## AdaDelta
64
65*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
66
67AdaDelta is an extension of [AdaGrad](#adagrad) that adapts learning rates
68based on a moving window of gradient updates, instead of accumulating all past
69gradients. Instead of accumulating all past squared gradients, the sum of
70gradients is recursively defined as a decaying average of all past squared
71gradients.
72
73#### Constructors
74
75 * `AdaDelta()`
76 * `AdaDelta(`_`stepSize`_`)`
77 * `AdaDelta(`_`stepSize, batchSize`_`)`
78 * `AdaDelta(`_`stepSize, batchSize, rho, epsilon, maxIterations, tolerance, shuffle`_`)`
79 * `AdaDelta(`_`stepSize, batchSize, rho, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
80
81#### Attributes
82
83| **type** | **name** | **description** | **default** |
84|----------|----------|-----------------|-------------|
85| `double` | **`stepSize`** | Step size for each iteration. | `1.0` |
86| `size_t` | **`batchSize`**| Number of points to process in one step. | `32` |
87| `double` | **`rho`** | Smoothing constant. Corresponding to fraction of gradient to keep at each time step. | `0.95` |
88| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-6` |
89| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
90| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
91| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
92| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
93| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
94
95Attributes of the optimizer may also be changed via the member methods
96`StepSize()`, `BatchSize()`, `Rho()`, `Epsilon()`, `MaxIterations()`,
97`Shuffle()`, `ResetPolicy()`, and `ExactObjective()`.
98
99#### Examples:
100
101<details open>
102<summary>Click to collapse/expand example code.
103</summary>
104
105```c++
106AdaDelta optimizer(1.0, 1, 0.99, 1e-8, 1000, 1e-9, true);
107
108RosenbrockFunction f;
109arma::mat coordinates = f.GetInitialPoint();
110optimizer.Optimize(f, coordinates);
111```
112
113</details>
114
115#### See also:
116
117 * [Adadelta - an adaptive learning rate method](https://arxiv.org/abs/1212.5701)
118 * [AdaGrad](#adagrad)
119 * [Differentiable separable functions](#differentiable-separable-functions)
120
121## Adagrad
122
123*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
124
125AdaGrad is an optimizer with parameter-specific learning rates, which are
126adapted relative to how frequently a parameter gets updated during training.
127Larger updates for more sparse parameters and smaller updates for less sparse
128parameters.
129
130#### Constructors
131
132 - `AdaGrad()`
133 - `AdaGrad(`_`stepSize`_`)`
134 - `AdaGrad(`_`stepSize, batchSize`_`)`
135 - `AdaGrad(`_`stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle`_`)`
136 - `AdaGrad(`_`stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
137
138#### Attributes
139
140| **type** | **name** | **description** | **default** |
141|----------|----------|-----------------|-------------|
142| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
143| `size_t` | **`batchSize`** | Number of points to process in one step. | `32` |
144| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
145| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
146| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `tolerance` |
147| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
148| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
149| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
150
151Attributes of the optimizer may also be changed via the member methods
152`StepSize()`, `BatchSize()`, `Epsilon()`, `MaxIterations()`, `Tolerance()`,
153`Shuffle()`, `ResetPolicy()`, and `ExactObjective()`.
154
155#### Examples:
156
157<details open>
158<summary>Click to collapse/expand example code.
159</summary>
160
161```c++
162AdaGrad optimizer(1.0, 1, 1e-8, 1000, 1e-9, true);
163
164RosenbrockFunction f;
165arma::mat coordinates = f.GetInitialPoint();
166optimizer.Optimize(f, coordinates);
167```
168
169</details>
170
171#### See also:
172
173 * [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)
174 * [AdaGrad in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#AdaGrad)
175 * [AdaDelta](#adadelta)
176 * [Differentiable separable functions](#differentiable-separable-functions)
177
178## Adam
179
180*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
181
182Adam is an algorithm for first-order gradient-based optimization of
183stochastic objective functions, based on adaptive estimates of lower-order
184moments.
185
186#### Constructors
187
188 * `Adam()`
189 * `Adam(`_`stepSize, batchSize`_`)`
190 * `Adam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)`
191 * `Adam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
192
193Note that the `Adam` class is based on the `AdamType<`_`UpdateRule`_`>` class
194with _`UpdateRule`_` = AdamUpdate`.
195
196#### Attributes
197
198| **type** | **name** | **description** | **default** |
199|----------|----------|-----------------|-------------|
200| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
201| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
202| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
203| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
204| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
205| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
206| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
207| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
208| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
209| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
210
211The attributes of the optimizer may also be modified via the member methods
212`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`,
213`Tolerance()`, `Shuffle()`, `ResetPolicy()`, and `ExactObjective()`.
214
215#### Examples
216
217<details open>
218<summary>Click to collapse/expand example code.
219</summary>
220
221```c++
222RosenbrockFunction f;
223arma::mat coordinates = f.GetInitialPoint();
224
225Adam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
226optimizer.Optimize(f, coordinates);
227```
228
229</details>
230
231#### See also:
232
233 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
234 * [SGD](#standard-sgd)
235 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980)
236 * [Differentiable separable functions](#differentiable-separable-functions)
237
238## AdaMax
239
240*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
241
242AdaMax is simply a variant of Adam based on the infinity norm.
243
244#### Constructors
245
246 * `AdaMax()`
247 * `AdaMax(`_`stepSize, batchSize`_`)`
248 * `AdaMax(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)`
249 * `AdaMax(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, exactObjective, resetPolicy`_`)`
250
251Note that the `AdaMax` class is based on the `AdamType<`_`UpdateRule`_`>` class
252with _`UpdateRule`_` = AdaMaxUpdate`.
253
254#### Attributes
255
256| **type** | **name** | **description** | **default** |
257|----------|----------|-----------------|-------------|
258| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
259| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
260| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
261| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
262| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
263| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
264| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
265| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
266| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
267| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
268
269The attributes of the optimizer may also be modified via the member methods
270`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`,
271`Tolerance()`, `Shuffle()`, `ExactObjective()`, and `ResetPolicy()`.
272
273#### Examples
274
275<details open>
276<summary>Click to collapse/expand example code.
277</summary>
278
279```c++
280RosenbrockFunction f;
281arma::mat coordinates = f.GetInitialPoint();
282
283AdaMax optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
284optimizer.Optimize(f, coordinates);
285```
286
287</details>
288
289#### See also:
290
291 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
292 * [SGD](#standard-sgd)
293 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980) (see section 7)
294 * [Differentiable separable functions](#differentiable-separable-functions)
295
296## AMSBound
297
298*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
299
300AMSBound is a variant of Adam which employs dynamic bounds on learning rates.
301
302#### Constructors
303
304 * `AMSBound()`
305 * `AMSBound(`_`stepSize, batchSize`_`)`
306 * `AMSBound(`_`stepSize, batchSize, finalLr, gamma, beta1, beta2, epsilon, maxIterations, tolerance, shuffle`_`)`
307 * `AMSBound(`_`stepSize, batchSize, finalLr, gamma, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
308
309Note that the `AMSBound` class is based on the `AdaBoundType<`_`UpdateRule`_`>`
310class with _`UpdateRule`_` = AdaBoundUpdate`.
311
312#### Attributes
313
314| **type** | **name** | **description** | **default** |
315|----------|----------|-----------------|-------------|
316| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
317| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
318| `double` | **`finalLr`** | The final (SGD) learning rate. | `0.1` |
319| `double` | **`gamma`** | The convergence speed of the bound functions. | `0.001` |
320| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
321| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
322| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
323| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
324| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
325| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
326| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
327| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
328
329The attributes of the optimizer may also be modified via the member methods
330`FinalLr()`, `Gamma()`, `StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`,
331`Eps()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and
332`ExactObjective()`.
333
334#### Examples
335
336<details open>
337<summary>Click to collapse/expand example code.
338</summary>
339
340```c++
341SphereFunction f(2);
342arma::mat coordinates = f.GetInitialPoint();
343
344AMSBound optimizer(0.001, 2, 0.1, 1e-3, 0.9, 0.999, 1e-8, 500000, 1e-3);
345optimizer.Optimize(f, coordinates);
346```
347
348</details>
349
350#### See also:
351
352 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
353 * [SGD](#standard-sgd)
354 * [Adaptive Gradient Methods with Dynamic Bound of Learning Rate](https://arxiv.org/abs/1902.09843)
355 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980)
356 * [Differentiable separable functions](#differentiable-separable-functions)
357
358## AMSGrad
359
360*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
361
362AMSGrad is a variant of Adam with guaranteed convergence.
363
364#### Constructors
365
366 * `AMSGrad()`
367 * `AMSGrad(`_`stepSize, batchSize`_`)`
368 * `AMSGrad(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)`
369 * `AMSGrad(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, exactObjective, resetPolicy`_`)`
370
371Note that the `AMSGrad` class is based on the `AdamType<`_`UpdateRule`_`>` class
372with _`UpdateRule`_` = AMSGradUpdate`.
373
374#### Attributes
375
376| **type** | **name** | **description** | **default** |
377|----------|----------|-----------------|-------------|
378| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
379| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
380| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
381| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
382| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
383| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
384| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
385| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
386| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
387| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
388
389The attributes of the optimizer may also be modified via the member methods
390`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`,
391`Tolerance()`, `Shuffle()`, `ExactObjective()`, and `ResetPolicy()`.
392
393#### Examples
394
395<details open>
396<summary>Click to collapse/expand example code.
397</summary>
398
399```c++
400RosenbrockFunction f;
401arma::mat coordinates = f.GetInitialPoint();
402
403AMSGrad optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
404optimizer.Optimize(f, coordinates);
405```
406
407</details>
408
409#### See also:
410
411 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
412 * [SGD](#standard-sgd)
413 * [On the Convergence of Adam and Beyond](https://openreview.net/forum?id=ryQu7f-RZ)
414 * [Differentiable separable functions](#differentiable-separable-functions)
415
416## Augmented Lagrangian
417
418*An optimizer for [differentiable constrained functions](#constrained-functions).*
419
420The `AugLagrangian` class implements the Augmented Lagrangian method of
421optimization.  In this scheme, a penalty term is added to the Lagrangian.
422This method is also called the "method of multipliers".  Internally, the
423optimizer uses [L-BFGS](#l-bfgs).
424
425#### Constructors
426
427 * `AugLagrangian(`_`maxIterations, penaltyThresholdFactor, sigmaUpdateFactor`_`)`
428
429#### Attributes
430
431| **type** | **name** | **description** | **default** |
432|----------|----------|-----------------|-------------|
433| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `1000` |
434| `double` | **`penaltyThresholdFactor`** | When penalty threshold is updated, set it to this multiplied by the penalty. | `0.25` |
435| `double` | **`sigmaUpdateFactor`** | When sigma is updated, multiply it by this. | `10.0` |
436| `L_BFGS&` | **`lbfgs`** | Internal l-bfgs optimizer. | `L_BFGS()` |
437
438The attributes of the optimizer may also be modified via the member methods
439`MaxIterations()`, `PenaltyThresholdFactor()`, `SigmaUpdateFactor()` and `LBFGS()`.
440
441<details open>
442<summary>Click to collapse/expand example code.
443</summary>
444
445```c++
446/**
447 * Optimize the function.  The value '1' is used for the initial value of each
448 * Lagrange multiplier.  To set the Lagrange multipliers yourself, use the
449 * other overload of Optimize().
450 *
451 * @tparam LagrangianFunctionType Function which can be optimized by this
452 *     class.
453 * @param function The function to optimize.
454 * @param coordinates Output matrix to store the optimized coordinates in.
455 */
456template<typename LagrangianFunctionType>
457bool Optimize(LagrangianFunctionType& function,
458              arma::mat& coordinates);
459
460/**
461 * Optimize the function, giving initial estimates for the Lagrange
462 * multipliers.  The vector of Lagrange multipliers will be modified to
463 * contain the Lagrange multipliers of the final solution (if one is found).
464 *
465 * @tparam LagrangianFunctionType Function which can be optimized by this
466 *      class.
467 * @param function The function to optimize.
468 * @param coordinates Output matrix to store the optimized coordinates in.
469 * @param initLambda Vector of initial Lagrange multipliers.  Should have
470 *     length equal to the number of constraints.
471 * @param initSigma Initial penalty parameter.
472 */
473template<typename LagrangianFunctionType>
474bool Optimize(LagrangianFunctionType& function,
475              arma::mat& coordinates,
476              const arma::vec& initLambda,
477              const double initSigma);
478```
479
480</details>
481
482#### Examples
483
484<details open>
485<summary>Click to collapse/expand example code.
486</summary>
487
488```c++
489GockenbachFunction f;
490arma::mat coordinates = f.GetInitialPoint();
491
492AugLagrangian optimizer;
493optimizer.Optimize(f, coords);
494```
495
496</details>
497
498#### See also:
499
500 * [Augmented Lagrangian method on Wikipedia](https://en.wikipedia.org/wiki/Augmented_Lagrangian_method)
501 * [L-BFGS](#l-bfgs)
502 * [Constrained functions](#constrained-functions)
503
504## Big Batch SGD
505
506*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
507
508Big-batch stochastic gradient descent adaptively grows the batch size over time
509to maintain a nearly constant signal-to-noise ratio in the gradient
510approximation, so the Big Batch SGD optimizer is able to adaptively adjust batch
511sizes without user oversight.
512
513#### Constructors
514
515 * `BigBatchSGD<`_`UpdatePolicy`_`>()`
516 * `BigBatchSGD<`_`UpdatePolicy`_`>(`_`stepSize`_`)`
517 * `BigBatchSGD<`_`UpdatePolicy`_`>(`_`stepSize, batchSize`_`)`
518 * `BigBatchSGD<`_`UpdatePolicy`_`>(`_`stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, exactObjective`_`)`
519
520The _`UpdatePolicy`_ template parameter refers to the way that a new step size
521is computed.  The `AdaptiveStepsize` and `BacktrackingLineSearch` classes are
522available for use; custom behavior can be achieved by implementing a class
523with the same method signatures.
524
525For convenience the following typedefs have been defined:
526
527 * `BBS_Armijo = BigBatchSGD<BacktrackingLineSearch>`
528 * `BBS_BB = BigBatchSGD<AdaptiveStepsize>`
529
530#### Attributes
531
532| **type** | **name** | **description** | **default** |
533|----------|----------|-----------------|-------------|
534| `size_t` | **`batchSize`** | Initial batch size. | `1000` |
535| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
536| `double` | **`batchDelta`** | Factor for the batch update step. | `0.1` |
537| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
538| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
539| `bool` | **`shuffle`** | If true, the batch order is shuffled; otherwise, each batch is visited in linear order. | `true` |
540| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
541
542Attributes of the optimizer may also be changed via the member methods
543`BatchSize()`, `StepSize()`, `BatchDelta()`, `MaxIterations()`, `Tolerance()`,
544`Shuffle()`, and `ExactObjective()`.
545
546#### Examples:
547
548<details open>
549<summary>Click to collapse/expand example code.
550</summary>
551
552```c++
553RosenbrockFunction f;
554arma::mat coordinates = f.GetInitialPoint();
555
556// Big-Batch SGD with the adaptive stepsize policy.
557BBS_BB optimizer(10, 0.01, 0.1, 8000, 1e-4);
558optimizer.Optimize(f, coordinates);
559
560// Big-Batch SGD with backtracking line search.
561BBS_Armijo optimizer2(10, 0.01, 0.1, 8000, 1e-4);
562optimizer2.Optimize(f, coordinates);
563```
564
565</details>
566
567#### See also:
568
569 * [Big Batch SGD: Automated Inference using Adaptive Batch Sizes](https://arxiv.org/pdf/1610.05792.pdf)
570 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
571 * [SGD](#standard-sgd)
572
573## CMAES
574
575*An optimizer for [separable functions](#separable-functions).*
576
577CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is a stochastic search
578algorithm. CMA-ES is a second order approach estimating a positive definite
579matrix within an iterative procedure using the covariance matrix.
580
581#### Constructors
582
583 * `CMAES<`_`SelectionPolicyType`_`>()`
584 * `CMAES<`_`SelectionPolicyType`_`>(`_`lambda, lowerBound, upperBound`_`)`
585 * `CMAES<`_`SelectionPolicyType`_`>(`_`lambda, lowerBound, upperBound, batchSize`_`)`
586 * `CMAES<`_`SelectionPolicyType`_`>(`_`lambda, lowerBound, upperBound, batchSize, maxIterations, tolerance, selectionPolicy`_`)`
587
588The _`SelectionPolicyType`_ template parameter refers to the strategy used to
589compute the (approximate) objective function.  The `FullSelection` and
590`RandomSelection` classes are available for use; custom behavior can be achieved
591by implementing a class with the same method signatures.
592
593For convenience the following types can be used:
594
595 * **`CMAES<>`** (equivalent to `CMAES<FullSelection>`): uses all separable functions to compute objective
596 * **`ApproxCMAES`** (equivalent to `CMAES<RandomSelection>`): uses a small amount of separable functions to compute approximate objective
597
598#### Attributes
599
600| **type** | **name** | **description** | **default** |
601|----------|----------|-----------------|-------------|
602| `size_t` | **`lambda`** | The population size (0 uses a default size). | `0` |
603| `double` | **`lowerBound`** | Lower bound of decision variables. | `-10.0` |
604| `double` | **`upperBound`** | Upper bound of decision variables. | `10.0` |
605| `size_t` | **`batchSize`** | Batch size to use for the objective calculation. | `32` |
606| `size_t` | **`maxIterations`** | Maximum number of iterations. | `1000` |
607| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
608| `SelectionPolicyType` | **`selectionPolicy`** | Instantiated selection policy used to calculate the objective. | `SelectionPolicyType()` |
609
610Attributes of the optimizer may also be changed via the member methods
611`Lambda()`, `LowerBound()`, `UpperBound()`, `BatchSize()`, `MaxIterations()`,
612`Tolerance()`, and `SelectionPolicy()`.
613
614The `selectionPolicy` attribute allows an instantiated `SelectionPolicyType` to
615be given.  The `FullSelection` policy has no need to be instantiated and thus
616the option is not relevant when the `CMAES<>` optimizer type is being used; the
617`RandomSelection` policy has the constructor `RandomSelection(`_`fraction`_`)`
618where _`fraction`_ specifies the percentage of separable functions to use to
619estimate the objective function.
620
621#### Examples:
622
623<details open>
624<summary>Click to collapse/expand example code.
625</summary>
626
627```c++
628RosenbrockFunction f;
629arma::mat coordinates = f.GetInitialPoint();
630
631// CMAES with the FullSelection policy.
632CMAES<> optimizer(0, -1, 1, 32, 200, 1e-4);
633optimizer.Optimize(f, coordinates);
634
635// CMAES with the RandomSelection policy.
636ApproxCMAES<> approxOptimizer(0, -1, 1. 32, 200, 1e-4);
637approxOptimizer.Optimize(f, coordinates);
638```
639
640</details>
641
642#### See also:
643
644 * [Completely Derandomized Self-Adaptation in Evolution Strategies](http://www.cmap.polytechnique.fr/~nikolaus.hansen/cmaartic.pdf)
645 * [CMA-ES in Wikipedia](https://en.wikipedia.org/wiki/CMA-ES)
646 * [Evolution strategy in Wikipedia](https://en.wikipedia.org/wiki/Evolution_strategy)
647
648## CNE
649
650*An optimizer for [arbitrary functions](#arbitrary-functions).*
651
652Conventional Neural Evolution is an optimizer that works like biological evolution which selects best candidates based on their fitness scores and creates new generation by mutation and crossover of population. The initial population is generated based on a random normal distribution centered at the given starting point.
653
654#### Constructors
655
656 * `CNE()`
657 * `CNE(`_`populationSize, maxGenerations`_`)`
658 * `CNE(`_`populationSize, maxGenerations, mutationProb, mutationSize`_`)`
659 * `CNE(`_`populationSize, maxGenerations, mutationProb, mutationSize, selectPercent, tolerance`_`)`
660
661#### Attributes
662
663| **type** | **name** | **description** | **default** |
664|----------|----------|-----------------|-------------|
665| `size_t` | **`populationSize`** | The number of candidates in the population. This should be at least 4 in size. | `500` |
666| `size_t` | **`maxGenerations`** | The maximum number of generations allowed for CNE. | `5000` |
667| `double` | **`mutationProb`** | Probability that a weight will get mutated. | `0.1` |
668| `double` | **`mutationSize`** | The range of mutation noise to be added. This range is between 0 and mutationSize. | `0.02` |
669| `double` | **`selectPercent`** | The percentage of candidates to select to become the the next generation. | `0.2` |
670| `double` | **`tolerance`** | The final value of the objective function for termination. If set to negative value, tolerance is not considered. | `1e-5` |
671
672Attributes of the optimizer may also be changed via the member methods
673`PopulationSize()`, `MaxGenerations()`, `MutationProb()`, `SelectPercent()`
674and `Tolerance()`.
675
676#### Examples:
677
678<details open>
679<summary>Click to collapse/expand example code.
680</summary>
681
682```c++
683RosenbrockFunction f;
684arma::mat coordinates = f.GetInitialPoint();
685
686CNE optimizer(200, 10000, 0.2, 0.2, 0.3, 1e-5);
687optimizer.Optimize(f, coordinates);
688```
689
690</details>
691
692#### See also:
693
694 * [Neuroevolution in Wikipedia](https://en.wikipedia.org/wiki/Neuroevolution)
695 * [Arbitrary functions](#arbitrary-functions)
696
697## DE
698
699*An optimizer for [arbitrary functions](#arbitrary-functions).*
700
701Differential Evolution is an evolutionary optimization algorithm which selects best candidates based on their fitness scores and creates new generation by mutation and crossover of population.
702
703#### Constructors
704
705* `DE()`
706* `DE(`_`populationSize, maxGenerations`_`)`
707* `DE(`_`populationSize, maxGenerations, crossoverRate`_`)`
708* `DE(`_`populationSize, maxGenerations, crossoverRate, differentialWeight`_`)`
709* `DE(`_`populationSize, maxGenerations, crossoverRate, differentialWeight, tolerance`_`)`
710
711#### Attributes
712
713| **type** | **name** | **description** | **default** |
714|----------|----------|-----------------|-------------|
715| `size_t` | **`populationSize`** | The number of candidates in the population. This should be at least 3 in size. | `100` |
716| `size_t` | **`maxGenerations`** | The maximum number of generations allowed for DE. | `2000` |
717| `double` | **`crossoverRate`** | Probability that a candidate will undergo crossover. | `0.6` |
718| `double` | **`differentialWeight`** | Amplification factor for differentiation. | `0.8` |
719| `double` | **`tolerance`** | The final value of the objective function for termination. If set to negative value, tolerance is not considered. | `1e-5` |
720
721Attributes of the optimizer may also be changed via the member methods
722`PopulationSize()`, `MaxGenerations()`, `CrossoverRate()`, `DifferentialWeight()`
723and `Tolerance()`.
724
725#### Examples:
726
727<details open>
728<summary>Click to collapse/expand example code.
729</summary>
730
731```c++
732RosenbrockFunction f;
733arma::mat coordinates = f.GetInitialPoint();
734
735DE optimizer(200, 1000, 0.6, 0.8, 1e-5);
736optimizer.Optimize(f, coordinates);
737```
738
739</details>
740
741#### See also:
742
743 * [Differential Evolution - A simple and efficient adaptive scheme for global optimization over continuous spaces](http://www1.icsi.berkeley.edu/~storn/TR-95-012.pdf)
744 * [Differential Evolution in Wikipedia](https://en.wikipedia.org/wiki/Differential_Evolution)
745 * [Arbitrary functions](#arbitrary-functions)
746
747## Eve
748
749*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
750
751Eve is a stochastic gradient based optimization method with locally and globally adaptive learning rates.
752
753#### Constructors
754
755 * `Eve()`
756 * `Eve(`_`stepSize, batchSize`_`)`
757 * `Eve(`_`stepSize, batchSize, beta1, beta2, beta3, epsilon, clip, maxIterations, tolerance, shuffle`_`)`
758
759#### Attributes
760
761| **type** | **name** | **description** | **default** |
762|----------|----------|-----------------|-------------|
763| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
764| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
765| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
766| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
767| `double` | **`beta3`** | Exponential decay rate for relative change. | `0.999` |
768| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
769| `double` | **`clip`** | Clipping range to avoid extreme values. | `10` |
770| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
771| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
772| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
773| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
774
775The attributes of the optimizer may also be modified via the member methods
776`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Beta3()`, `Epsilon()`, `Clip()`, `MaxIterations()`,
777`Tolerance()`, `Shuffle()`, and `ExactObjective()`.
778
779#### Examples
780
781<details open>
782<summary>Click to collapse/expand example code.
783</summary>
784
785```c++
786RosenbrockFunction f;
787arma::mat coordinates = f.GetInitialPoint();
788
789Eve optimizer(0.001, 32, 0.9, 0.999, 0.999, 10, 1e-8, 100000, 1e-5, true);
790optimizer.Optimize(f, coordinates);
791```
792
793</details>
794
795#### See also:
796
797 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
798 * [SGD](#standard-sgd)
799 * [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://arxiv.org/pdf/1611.01505.pdf)
800 * [Differentiable separable functions](#differentiable-separable-functions)
801
802## Frank-Wolfe
803
804*An optimizer for [differentiable functions](#differentiable-functions) that may also be constrained.*
805
806Frank-Wolfe is a technique to minimize a continuously differentiable convex function f over a compact convex subset D of a vector space. It is also known as conditional gradient method.
807
808#### Constructors
809
810 * `FrankWolfe<`_`LinearConstrSolverType, UpdateRuleType`_`>(`_`linearConstrSolver, updateRule`_`)`
811 * `FrankWolfe<`_`LinearConstrSolverType, UpdateRuleType`_`>(`_`linearConstrSolver, updateRule, maxIterations, tolerance`_`)`
812
813The _`LinearConstrSolverType`_ template parameter specifies the constraint
814domain D for the problem.  The `ConstrLpBallSolver` and
815`ConstrStructGroupSolver<GroupLpBall>` classes are available for use; the former
816restricts D to the unit ball of the specified l-p norm.  Other constraint types
817may be implemented as a class with the same method signatures as either of the
818existing classes.
819
820The _`UpdateRuleType`_ template parameter specifies the update rule used by the
821optimizer.  The `UpdateClassic` and `UpdateLineSearch` classes are available for
822use and represent a simple update step rule and a line search based update rule,
823respectively.  The `UpdateSpan` and `UpdateFulLCorrection` classes are also
824available and may be used with the `FuncSq` function class (which is a squared
825matrix loss).
826
827For convenience the following typedefs have been defined:
828
829 * `OMP` (equivalent to `FrankWolfe<ConstrLpBallSolver, UpdateSpan>`): a solver for the orthogonal matching pursuit problem
830 * `StandardFrankWolfe` (equivalent to `FrankWolfe<ConstrLpBallSolver, ClassicUpdate>`): the standard Frank-Wolfe algorithm with the solution restricted to lie within the unit ball
831
832#### Attributes
833
834| **type** | **name** | **description** | **default** |
835|----------|----------|-----------------|-------------|
836| `LinearConstrSolverType` | **`linearConstrSolver`** | Solver for linear constrained problem. | **n/a** |
837| `UpdateRuleType` | **`updateRule`** | Rule for updating solution in each iteration. | **n/a** |
838| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
839| `size_t` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-10` |
840
841Attributes of the optimizer may also be changed via the member methods
842`LinearConstrSolver()`, `UpdateRule()`, `MaxIterations()`, and `Tolerance()`.
843
844#### Examples:
845
846TODO
847
848#### See also:
849
850 * [An algorithm for quadratic programming](https://pdfs.semanticscholar.org/3a24/54478a94f1e66a3fc5d209e69217087acbc0.pdf)
851 * [Frank-Wolfe in Wikipedia](https://en.wikipedia.org/wiki/Frank%E2%80%93Wolfe_algorithm)
852 * [Differentiable functions](#differentiable-functions)
853
854## FTML (Follow the Moving Leader)
855
856*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
857
858Follow the Moving Leader (FTML) is an optimizer where recent samples are
859weighted more heavily in each iteration, so FTML can adapt more quickly to
860changes.
861
862#### Constructors
863
864 * `FTML()`
865 * `FTML(`_`stepSize, batchSize`_`)`
866 * `FTML(`_`stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle`_`)`
867 * `FTML(`_`stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
868
869#### Attributes
870
871| **type** | **name** | **description** | **default** |
872|----------|----------|-----------------|-------------|
873| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
874| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
875| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
876| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
877| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
878| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
879| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
880| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
881| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
882| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
883
884The attributes of the optimizer may also be modified via the member methods
885`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Epsilon()`, `MaxIterations()`,
886`Tolerance()`, `Shuffle()`, `ResetPolicy()`, and `ExactObjective()`.
887
888#### Examples
889
890<details open>
891<summary>Click to collapse/expand example code.
892</summary>
893
894```c++
895RosenbrockFunction f;
896arma::mat coordinates = f.GetInitialPoint();
897
898FTML optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
899optimizer.Optimize(f, coordinates);
900```
901
902</details>
903
904#### See also:
905 * [Follow the Moving Leader in Deep Learning](http://proceedings.mlr.press/v70/zheng17a/zheng17a.pdf)
906 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
907 * [SGD](#standard-sgd)
908 * [Differentiable separable functions](#differentiable-separable-functions)
909
910## Gradient Descent
911
912*An optimizer for [differentiable functions](#differentiable-functions).*
913
914Gradient Descent is a technique to minimize a function. To find a local minimum
915of a function using gradient descent, one takes steps proportional to the
916negative of the gradient of the function at the current point.
917
918#### Constructors
919
920 * `GradientDescent()`
921 * `GradientDescent(`_`stepSize`_`)`
922 * `GradientDescent(`_`stepSize, maxIterations, tolerance`_`)`
923
924#### Attributes
925
926| **type** | **name** | **description** | **default** |
927|----------|----------|-----------------|-------------|
928| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
929| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
930| `size_t` | **`tolerance`**  | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
931
932Attributes of the optimizer may also be changed via the member methods
933`StepSize()`, `MaxIterations()`, and `Tolerance()`.
934
935#### Examples:
936
937<details open>
938<summary>Click to collapse/expand example code.
939</summary>
940
941```c++
942RosenbrockFunction f;
943arma::mat coordinates = f.GetInitialPoint();
944
945GradientDescent optimizer(0.001, 0, 1e-15);
946optimizer.Optimize(f, coordinates);
947```
948
949</details>
950
951#### See also:
952
953 * [Gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Gradient_descent)
954 * [Differentiable functions](#differentiable-functions)
955
956## Grid Search
957
958*An optimizer for [categorical functions](#categorical-functions).*
959
960An optimizer that finds the minimum of a given function by iterating through
961points on a multidimensional grid.
962
963#### Constructors
964
965 * `GridSearch()`
966
967#### Attributes
968
969The `GridSearch` class has no configurable attributes.
970
971**Note**: the `GridSearch` class can only optimize categorical functions where
972*every* parameter is categorical.
973
974#### See also:
975
976 * [Categorical functions](#categorical-functions) (includes an example for `GridSearch`)
977 * [Grid search on Wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search)
978
979## Hogwild! (Parallel SGD)
980
981*An optimizer for [sparse differentiable separable functions](#differentiable-separable-functions).*
982
983An implementation of parallel stochastic gradient descent using the lock-free
984HOGWILD! approach.  This implementation requires OpenMP to be enabled during
985compilation (i.e., `-fopenmp` specified as a compiler flag).
986
987Note that the requirements for Hogwild! are slightly different than for most
988[differentiable separable functions](#differentiable-separable-functions) but it
989is often possible to use Hogwild! by implementing `Gradient()` with a template
990parameter.  See the [sparse differentiable separable
991functions](#sparse-differentiable-separable-functions) documentation for more
992details.
993
994#### Constructors
995
996 * `ParallelSGD<`_`DecayPolicyType`_`>(`_`maxIterations, threadShareSize`_`)`
997 * `ParallelSGD<`_`DecayPolicyType`_`>(`_`maxIterations, threadShareSize, tolerance, shuffle, decayPolicy`_`)`
998
999The _`DecayPolicyType`_ template parameter specifies the policy used to update
1000the step size after each iteration.  The `ConstantStep` class is available for
1001use.  Custom behavior can be achieved by implementing a class with the same
1002method signatures.
1003
1004The default type for _`DecayPolicyType`_ is `ConstantStep`, so the shorter type
1005`ParallelSGD<>` can be used instead of the equivalent
1006`ParallelSGD<ConstantStep>`.
1007
1008#### Attributes
1009
1010| **type** | **name** | **description** | **default** |
1011|----------|----------|-----------------|-------------|
1012| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | **n/a** |
1013| `size_t` | **`threadShareSize`** | Number of datapoints to be processed in one iteration by each thread. | **n/a** |
1014| `double` | **`tolerance`** | Maximum absolute tolerance to terminate the algorithm. | `1e-5` |
1015| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
1016| `DecayPolicyType` | **`decayPolicy`** | An instantiated step size update policy to use. | `DecayPolicyType()` |
1017
1018Attributes of the optimizer may also be modified via the member methods
1019`MaxIterations()`, `ThreadShareSize()`, `Tolerance()`, `Shuffle()`, and
1020`DecayPolicy()`.
1021
1022Note that the default value for `decayPolicy` is the default constructor for the
1023`DecayPolicyType`.
1024
1025#### Examples
1026
1027<details open>
1028<summary>Click to collapse/expand example code.
1029</summary>
1030
1031```c++
1032GeneralizedRosenbrockFunction f(50); // 50-dimensional Rosenbrock function.
1033arma::mat coordinates = f.GetInitialPoint();
1034
1035ParallelSGD<> optimizer(100000, f.NumFunctions(), 1e-5, true);
1036optimizer.Optimize(f, coordinates);
1037```
1038
1039</details>
1040
1041#### See also:
1042
1043 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
1044 * [SGD](#standard-sgd)
1045 * [HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent](https://arxiv.org/abs/1106.5730)
1046 * [Sparse differentiable separable functions](#sparse-differentiable-separable-functions)
1047
1048## IQN
1049
1050*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
1051
1052The Incremental Quasi-Newton belongs to the family of stochastic and incremental
1053methods that have a cost per iteration independent of n. IQN iterations are a
1054stochastic version of BFGS iterations that use memory to reduce the variance of
1055stochastic approximations.
1056
1057#### Constructors
1058
1059 * `IQN()`
1060 * `IQN(`_`stepSize`_`)`
1061 * `IQN(`_`stepSize, batchSize, maxIterations, tolerance`_`)`
1062
1063#### Attributes
1064
1065| **type** | **name** | **description** | **default** |
1066|----------|----------|-----------------|-------------|
1067| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
1068| `size_t` | **`batchSize`** | Size of each batch. | `10` |
1069| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
1070| `size_t` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
1071
1072Attributes of the optimizer may also be changed via the member methods
1073`StepSize()`, `BatchSize()`, `MaxIterations()`, and `Tolerance()`.
1074
1075#### Examples:
1076
1077<details open>
1078<summary>Click to collapse/expand example code.
1079</summary>
1080
1081```c++
1082RosenbrockFunction f;
1083arma::mat coordinates = f.GetInitialPoint();
1084
1085IQN optimizer(0.01, 1, 5000, 1e-5);
1086optimizer.Optimize(f, coordinates);
1087```
1088
1089</details>
1090
1091#### See also:
1092
1093 * [IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate](https://arxiv.org/abs/1702.00709)
1094 * [A Stochastic Quasi-Newton Method for Large-Scale Optimization](https://arxiv.org/abs/1401.7020)
1095 * [Differentiable functions](#differentiable-functions)
1096
1097## Katyusha
1098
1099*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
1100
1101Katyusha is a direct, primal-only stochastic gradient method which uses a
1102"negative momentum" on top of Nesterov's momentum.  Two types are
1103available---one that uses a proximal update step, and one that uses the standard
1104update step.
1105
1106#### Constructors
1107
1108 * `KatyushaType<`_`proximal`_`>()`
1109 * `KatyushaType<`_`proximal`_`>(`_`convexity, lipschitz`_`)`
1110 * `KatyushaType<`_`proximal`_`>(`_`convexity, lipschitz, batchSize`_`)`
1111 * `KatyushaType<`_`proximal`_`>(`_`convexity, lipschitz, batchSize, maxIterations, innerIterations, tolerance, shuffle, exactObjective`_`)`
1112
1113The _`proximal`_ template parameter is a boolean value (`true` or `false`) that
1114specifies whether or not the proximal update should be used.
1115
1116For convenience the following typedefs have been defined:
1117
1118 * `Katyusha` (equivalent to `KatyushaType<false>`): Katyusha with the standard update step
1119 * `KatyushaProximal` (equivalent to `KatyushaType<true>`): Katyusha with the proximal update step
1120
1121#### Attributes
1122
1123| **type** | **name** | **description** | **default** |
1124|----------|----------|-----------------|-------------|
1125| `double` | **`convexity`** | The regularization parameter. | `1.0` |
1126| `double` | **`lipschitz`** | The Lipschitz constant. | `10.0` |
1127| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` |
1128| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `1000` |
1129| `size_t` | **`innerIterations`** | The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. | `0` |
1130| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
1131| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
1132| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
1133
1134
1135Attributes of the optimizer may also be changed via the member methods
1136`Convexity()`, `Lipschitz()`, `BatchSize()`, `MaxIterations()`,
1137`InnerIterations()`, `Tolerance()`, `Shuffle()`, and `ExactObjective()`.
1138
1139#### Examples:
1140
1141<details open>
1142<summary>Click to collapse/expand example code.
1143</summary>
1144
1145```c++
1146RosenbrockFunction f;
1147arma::mat coordinates = f.GetInitialPoint();
1148
1149// Without proximal update.
1150Katyusha optimizer(1.0, 10.0, 1, 100, 0, 1e-10, true);
1151optimizer.Optimize(f, coordinates);
1152
1153// With proximal update.
1154KatyushaProximal proximalOptimizer(1.0, 10.0, 1, 100, 0, 1e-10, true);
1155proximalOptimizer.Optimize(f, coordinates);
1156```
1157
1158</details>
1159
1160#### See also:
1161
1162 * [Katyusha: The First Direct Acceleration of Stochastic Gradient Methods](https://arxiv.org/abs/1603.05953)
1163 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
1164 * [Differentiable separable functions](#differentiable-separable-functions)
1165
1166## L-BFGS
1167
1168*An optimizer for [differentiable functions](#differentiable-functions)*
1169
1170L-BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm using a limited amount of computer memory.
1171
1172#### Constructors
1173
1174 * `L_BFGS()`
1175 * `L_BFGS(`_`numBasis, maxIterations`_`)`
1176 * `L_BFGS(`_`numBasis, maxIterations, armijoConstant, wolfe, minGradientNorm, factr, maxLineSearchTrials`_`)`
1177 * `L_BFGS(`_`numBasis, maxIterations, armijoConstant, wolfe, minGradientNorm, factr, maxLineSearchTrials, minStep, maxStep`_`)`
1178
1179#### Attributes
1180
1181| **type** | **name** | **description** | **default** |
1182|----------|----------|-----------------|-------------|
1183| `size_t` | **`numBasis`** | Number of memory points to be stored (default 10). | `10` |
1184| `size_t` | **`maxIterations`** | Maximum number of iterations for the optimization (0 means no limit and may run indefinitely). | `10000` |
1185| `double` | **`armijoConstant`** | Controls the accuracy of the line search routine for determining the Armijo condition. | `1e-4` |
1186| `double` | **`wolfe`** | Parameter for detecting the Wolfe condition. | `0.9` |
1187| `double` | **`minGradientNorm`** | Minimum gradient norm required to continue the optimization. | `1e-6` |
1188| `double` | **`factr`** | Minimum relative function value decrease to continue the optimization. | `1e-15` |
1189| `size_t` | **`maxLineSearchTrials`** | The maximum number of trials for the line search (before giving up). | `50` |
1190| `double` | **`minStep`** | The minimum step of the line search. | `1e-20` |
1191| `double` | **`maxStep`** | The maximum step of the line search. | `1e20` |
1192
1193Attributes of the optimizer may also be changed via the member methods
1194`NumBasis()`, `MaxIterations()`, `ArmijoConstant()`, `Wolfe()`,
1195`MinGradientNorm()`, `Factr()`, `MaxLineSearchTrials()`, `MinStep()`, and
1196`MaxStep()`.
1197
1198#### Examples:
1199
1200<details open>
1201<summary>Click to collapse/expand example code.
1202</summary>
1203
1204```c++
1205RosenbrockFunction f;
1206arma::mat coordinates = f.GetInitialPoint();
1207
1208L_BFGS optimizer(20);
1209optimizer.Optimize(f, coordinates);
1210```
1211
1212</details>
1213
1214#### See also:
1215
1216 * [The solution of non linear finite element equations](https://onlinelibrary.wiley.com/doi/full/10.1002/nme.1620141104)
1217 * [Updating Quasi-Newton Matrices with Limited Storage](https://www.jstor.org/stable/2006193)
1218 * [Limited-memory BFGS in Wikipedia](https://en.wikipedia.org/wiki/Limited-memory_BFGS)
1219 * [Differentiable functions](#differentiable-functions)
1220
1221## Lookahead
1222
1223*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
1224
1225Lookahead is a stochastic gradient based optimization method which chooses a
1226search direction by looking ahead at the sequence of "fast weights" generated
1227by another optimizer.
1228
1229#### Constructors
1230 * `Lookahead<>()`
1231 * `Lookahead<>(`_`stepSize`_`)`
1232 * `Lookahead<>(`_`stepSize, k`_`)`
1233 * `Lookahead<>(`_`stepSize, k, maxIterations, tolerance, decayPolicy, exactObjective`_`)`
1234 * `Lookahead<>(`_`baseOptimizer, stepSize, k, maxIterations, tolerance, decayPolicy, exactObjective`_`)`
1235
1236Note that `Lookahead<>` is based on the templated type
1237`LookaheadType<`_`BaseOptimizerType, DecayPolicyType`_`>` with _`BaseOptimizerType`_` = Adam` and _`DecayPolicyType`_` = NoDecay`.
1238
1239Any optimizer that implements the differentiable separable functions interface
1240can be paired with the `Lookahead` optimizer.
1241
1242#### Attributes
1243
1244| **type** | **name** | **description** | **default** |
1245|----------|----------|-----------------|-------------|
1246| `BaseOptimizerType` | **`baseOptimizer`** |  Optimizer for the forward step. | Adam |
1247| `double` | **`stepSize`** | Step size for each iteration. | `0.5` |
1248| `size_t` | **`k`** | The synchronization period. | `5` |
1249| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
1250| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
1251| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` |
1252| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
1253
1254The attributes of the optimizer may also be modified via the member methods
1255`BaseOptimizer()`, `StepSize()`, `K()`, `MaxIterations()`,
1256`Tolerance()`, `DecayPolicy()` and `ExactObjective()`.
1257
1258#### Examples
1259
1260<details open>
1261<summary>Click to collapse/expand example code.
1262</summary>
1263
1264```c++
1265RosenbrockFunction f;
1266arma::mat coordinates = f.GetInitialPoint();
1267
1268Lookahead<> optimizer(0.5, 5, 100000, 1e-5);
1269optimizer.Optimize(f, coordinates);
1270```
1271
1272</details>
1273
1274#### See also:
1275
1276 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
1277 * [SGD](#standard-sgd)
1278 * [Lookahead Optimizer: k steps forward, 1 step back](https://arxiv.org/abs/1907.08610)
1279 * [Differentiable separable functions](#differentiable-separable-functions)
1280
1281## LRSDP (low-rank SDP solver)
1282
1283*An optimizer for [semidefinite programs](#semidefinite-programs).*
1284
1285LRSDP is the implementation of Monteiro and Burer's formulation of low-rank
1286semidefinite programs (LR-SDP).  This solver uses the augmented Lagrangian
1287optimizer to solve low-rank semidefinite programs.
1288
1289The assumption here is that the solution matrix for the SDP is low-rank.  If
1290this assumption is not true, the algorithm should not be expected to converge.
1291
1292#### Constructors
1293
1294 * `LRSDP<`_`SDPType`_`>()`
1295
1296The _`SDPType`_ template parameter specifies the type of SDP to solve.  The
1297`SDP<arma::mat>` and `SDP<arma::sp_mat>` classes are available for use; these
1298represent SDPs with dense and sparse `C` matrices, respectively.  The `SDP<>`
1299class is detailed in the [semidefinite program
1300documentation](#semidefinite-programs).
1301
1302Once the `LRSDP<>` object is constructed, the SDP may be specified by calling
1303the `SDP()` member method, which returns a reference to the _`SDPType`_.
1304
1305#### Attributes
1306
1307The attributes of the LRSDP optimizer may only be accessed via member methods.
1308
1309| **type** | **method name** | **description** | **default** |
1310|----------|----------|-----------------|-------------|
1311| `size_t` | **`MaxIterations()`** | Maximum number of iterations before termination. | `1000` |
1312| `AugLagrangian` | **`AugLag()`** | The internally-held Augmented Lagrangian optimizer. | **n/a** |
1313
1314#### See also:
1315
1316 * [A Nonlinear Programming Algorithm for Solving Semidefinite Programs via Low-rank Factorization](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.682.1520&rep=rep1&type=pdf)
1317 * [Semidefinite programming on Wikipedia](https://en.wikipedia.org/wiki/Semidefinite_programming)
1318 * [Semidefinite programs](#semidefinite-programs) (includes example usage of `PrimalDualSolver`)
1319
1320## Momentum SGD
1321
1322*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
1323
1324Stochastic Gradient Descent is a technique for minimizing a function which
1325can be expressed as a sum of other functions.  This is an SGD variant that uses
1326momentum for its updates.  Using momentum updates for parameter learning can
1327accelerate the rate of convergence, specifically in the cases where the surface
1328curves much more steeply (a steep hilly terrain with high curvature).
1329
1330#### Constructors
1331
1332 * `MomentumSGD()`
1333 * `MomentumSGD(`_`stepSize, batchSize`_`)`
1334 * `MomentumSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle`_`)`
1335 * `MomentumSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, momentumPolicy, decayPolicy, resetPolicy, exactObjective`_`)`
1336
1337Note that `MomentumSGD` is based on the templated type
1338`SGD<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` =
1339MomentumUpdate` and _`DecayPolicyType`_` = NoDecay`.
1340
1341#### Attributes
1342
1343| **type** | **name** | **description** | **default** |
1344|----------|----------|-----------------|-------------|
1345| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
1346| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` |
1347| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
1348| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
1349| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
1350| `MomentumUpdate` | **`updatePolicy`** | An instantiated `MomentumUpdate`. | `MomentumUpdate()` |
1351| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` |
1352| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` |
1353| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
1354
1355Attributes of the optimizer may also be modified via the member methods
1356`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `DecayPolicy()`, `ResetPolicy()`, and
1357`ExactObjective()`.
1358
1359Note that the `MomentumUpdate` class has the constructor
1360`MomentumUpdate(`_`momentum`_`)` with a default value of `0.5` for the momentum.
1361
1362#### Examples
1363
1364<details open>
1365<summary>Click to collapse/expand example code.
1366</summary>
1367
1368```c++
1369RosenbrockFunction f;
1370arma::mat coordinates = f.GetInitialPoint();
1371
1372MomentumSGD optimizer(0.01, 32, 100000, 1e-5, true, MomentumUpdate(0.5));
1373optimizer.Optimize(f, coordinates);
1374```
1375
1376</details>
1377
1378#### See also:
1379
1380 * [Standard SGD](#standard-sgd)
1381 * [Nesterov Momentum SGD](#nesterov-momentum-sgd)
1382 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
1383 * [Differentiable separable functions](#differentiable-separable-functions)
1384
1385## Nadam
1386
1387*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
1388
1389Nadam is a variant of Adam based on NAG (Nesterov accelerated gradient).  It
1390uses Nesterov momentum for faster convergence.
1391
1392#### Constructors
1393
1394 * `Nadam()`
1395 * `Nadam(`_`stepSize, batchSize`_`)`
1396 * `Nadam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)`
1397 * `Nadam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy`_`)`
1398
1399Note that the `Nadam` class is based on the `AdamType<`_`UpdateRule`_`>` class
1400with _`UpdateRule`_` = NadamUpdate`.
1401
1402#### Attributes
1403
1404| **type** | **name** | **description** | **default** |
1405|----------|----------|-----------------|-------------|
1406| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
1407| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
1408| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
1409| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
1410| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
1411| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
1412| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
1413| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
1414| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
1415
1416The attributes of the optimizer may also be modified via the member methods
1417`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`,
1418`Tolerance()`, `Shuffle()`, and `ResetPolicy()`.
1419
1420#### Examples
1421
1422<details open>
1423<summary>Click to collapse/expand example code.
1424</summary>
1425
1426```c++
1427RosenbrockFunction f;
1428arma::mat coordinates = f.GetInitialPoint();
1429
1430Nadam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
1431optimizer.Optimize(f, coordinates);
1432```
1433
1434</details>
1435
1436#### See also:
1437
1438 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
1439 * [SGD](#standard-sgd)
1440 * [Incorporating Nesterov Momentum into Adam](http://cs229.stanford.edu/proj2015/054_report.pdf)
1441 * [Differentiable separable functions](#differentiable-separable-functions)
1442
1443## NadaMax
1444
1445*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
1446
1447NadaMax is a variant of AdaMax based on NAG (Nesterov accelerated gradient).  It
1448uses Nesterov momentum for faster convergence.
1449
1450#### Constructors
1451
1452 * `NadaMax()`
1453 * `NadaMax(`_`stepSize, batchSize`_`)`
1454 * `NadaMax(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)`
1455 * `NadaMax(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy`_`)`
1456
1457Note that the `NadaMax` class is based on the `AdamType<`_`UpdateRule`_`>` class
1458with _`UpdateRule`_` = NadaMaxUpdate`.
1459
1460#### Attributes
1461
1462| **type** | **name** | **description** | **default** |
1463|----------|----------|-----------------|-------------|
1464| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
1465| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
1466| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
1467| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
1468| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
1469| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
1470| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
1471| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
1472| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
1473
1474The attributes of the optimizer may also be modified via the member methods
1475`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`,
1476`Tolerance()`, `Shuffle()`, and `ResetPolicy()`.
1477
1478#### Examples
1479
1480<details open>
1481<summary>Click to collapse/expand example code.
1482</summary>
1483
1484```c++
1485RosenbrockFunction f;
1486arma::mat coordinates = f.GetInitialPoint();
1487
1488NadaMax optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
1489optimizer.Optimize(f, coordinates);
1490```
1491
1492</details>
1493
1494#### See also:
1495
1496 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
1497 * [SGD](#standard-sgd)
1498 * [Incorporating Nesterov Momentum into Adam](http://cs229.stanford.edu/proj2015/054_report.pdf)
1499 * [Differentiable separable functions](#differentiable-separable-functions)
1500
1501## Nesterov Momentum SGD
1502
1503*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
1504
1505Stochastic Gradient Descent is a technique for minimizing a function which
1506can be expressed as a sum of other functions.  This is an SGD variant that uses
1507Nesterov momentum for its updates.  Nesterov Momentum application can accelerate
1508the rate of convergence to O(1/k^2).
1509
1510#### Constructors
1511
1512 * `NesterovMomentumSGD()`
1513 * `NesterovMomentumSGD(`_`stepSize, batchSize`_`)`
1514 * `NesterovMomentumSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle`_`)`
1515 * `NesterovMomentumSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, momentumPolicy, decayPolicy, resetPolicy, exactObjective`_`)`
1516
1517Note that `MomentumSGD` is based on the templated type
1518`SGD<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` =
1519NesterovMomentumUpdate` and _`DecayPolicyType`_` = NoDecay`.
1520
1521#### Attributes
1522
1523| **type** | **name** | **description** | **default** |
1524|----------|----------|-----------------|-------------|
1525| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
1526| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` |
1527| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
1528| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
1529| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
1530| `NesterovMomentumUpdate` | **`updatePolicy`** | An instantiated `MomentumUpdate`. | `NesterovMomentumUpdate()` |
1531| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` |
1532| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` |
1533| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
1534
1535Attributes of the optimizer may also be modified via the member methods
1536`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `DecayPolicy()`, `ResetPolicy()`, and
1537`ExactObjective()`.
1538
1539Note that the `NesterovMomentumUpdate` class has the constructor
1540`MomentumUpdate(`_`momentum`_`)` with a default value of `0.5` for the momentum.
1541
1542#### Examples
1543
1544<details open>
1545<summary>Click to collapse/expand example code.
1546</summary>
1547
1548```c++
1549RosenbrockFunction f;
1550arma::mat coordinates = f.GetInitialPoint();
1551
1552NesterovMomentumSGD optimizer(0.01, 32, 100000, 1e-5, true,
1553    MomentumUpdate(0.5));
1554optimizer.Optimize(f, coordinates);
1555```
1556
1557</details>
1558
1559#### See also:
1560
1561 * [Standard SGD](#standard-sgd)
1562 * [Momentum SGD](#momentum-sgd)
1563 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
1564 * [Differentiable separable functions](#differentiable-separable-functions)
1565
1566## MOEA/D-DE
1567*An optimizer for arbitrary multi-objective functions.*
1568MOEA/D-DE (Multi Objective Evolutionary Algorithm based on Decomposition - Differential Evolution) is a multi
1569objective optimization algorithm. It works by decomposing the problem into a number of scalar optimization
1570subproblems which are solved simultaneously per generation. MOEA/D in itself is a framework, this particular
1571algorithm uses Differential Crossover followed by Polynomial Mutation to create offsprings which are then
1572decomposed to form a Single Objective Problem. A diversity preserving mechanism is also employed which encourages
1573a varied set of solution.
1574
1575#### Constructors
1576* `MOEAD<`_`InitPolicyType, DecompPolicyType`_`>()`
1577* `MOEAD<`_`InitPolicyType, DecompPolicyType`_`>(`_`populationSize, maxGenerations, crossoverProb,  neighborProb, neighborSize, distributionIndex, differentialWeight, maxReplace, epsilon, lowerBound, upperBound`_`)`
1578
1579The _`InitPolicyType`_ template parameter refers to the strategy used to
1580initialize the reference directions.
1581
1582The following types are available:
1583
1584 * **`Uniform`**
1585 * **`BayesianBootstrap`**
1586 * **`Dirichlet`**
1587
1588The _`DecompPolicyType`_ template parameter refers to the strategy used to
1589decompose the weight vectors to form a scalar objective function.
1590
1591The following types are available:
1592
1593 * **`Tchebycheff`**
1594 * **`WeightedAverage`**
1595 * **`PenaltyBoundaryIntersection`**
1596
1597For convenience the following types can be used:
1598
1599 * **`DefaultMOEAD`** (equivalent to `MOEAD<Uniform, Tchebycheff>`): utilizes Uniform method for weight initialization
1600 and Tchebycheff for weight decomposition.
1601
1602 * **`BBSMOEAD`** (equivalent to `MOEAD<BayesianBootstrap, Tchebycheff>`): utilizes Bayesian Bootstrap method for weight initialization and Tchebycheff for weight decomposition.
1603
1604 * **`DirichletMOEAD`** (equivalent to `MOEAD<Dirichlet, Tchebycheff>`): utilizes Dirichlet sampling for weight init
1605 and Tchebycheff for weight decomposition.
1606
1607#### Attributes
1608
1609| **type** | **name** | **description** | **default** |
1610|----------|----------|-----------------|-------------|
1611| `size_t` | **`populationSize`** | The number of candidates in the population. | `150` |
1612| `size_t` | **`maxGenerations`** | The maximum number of generations allowed. | `300` |
1613| `double` | **`crossoverProb`** | Probability that a crossover will occur. | `1.0` |
1614| `double` | **`neighborProb`** | The probability of sampling from neighbor. | `0.9` |
1615| `size_t` | **`neighborSize`** | The number of nearest-neighbours to consider per weight vector.  | `20` |
1616| `double` | **`distributionIndex`** | The crowding degree of the mutation. | `20` |
1617| `double` | **`differentialWeight`** | Amplification factor of the differentiation. | `0.5` |
1618| `size_t` | **`maxReplace`** | The limit of solutions allowed to be replaced by a child. | `2`|
1619| `double` | **`epsilon`** | Handles numerical stability after weight initialization. | `1E-10`|
1620| `double`, `arma::vec` | **`lowerBound`** | Lower bound of the coordinates on the coordinates of the whole population during the search process. | `0` |
1621| `double`, `arma::vec` | **`upperBound`** | Lower bound of the coordinates on the coordinates of the whole population during the search process. | `1` |
1622| `InitPolicyType` | **`initPolicy`** | Instantiated init policy used to initialize weights. | `InitPolicyType()` |
1623| `DecompPolicyType` | **`decompPolicy`** | Instantiated decomposition policy used to create scalar objective problem. | `DecompPolicyType()` |
1624
1625Attributes of the optimizer may also be changed via the member methods
1626`PopulationSize()`, `MaxGenerations()`, `CrossoverRate()`, `NeighborProb()`, `NeighborSize()`, `DistributionIndex()`,
1627`DifferentialWeight()`, `MaxReplace()`, `Epsilon()`, `LowerBound()`, `UpperBound()`, `InitPolicy()` and `DecompPolicy()`.
1628
1629#### Examples:
1630
1631<details open>
1632<summary>Click to collapse/expand example code.
1633</summary>
1634
1635```c++
1636SchafferFunctionN1<arma::mat> SCH;
1637arma::vec lowerBound("-10 -10");
1638arma::vec upperBound("10 10");
1639DefaultMOEAD opt(300, 300, 1.0, 0.9, 20, 20, 0.5, 2, 1E-10, lowerBound, upperBound);
1640typedef decltype(SCH.objectiveA) ObjectiveTypeA;
1641typedef decltype(SCH.objectiveB) ObjectiveTypeB;
1642arma::mat coords = SCH.GetInitialPoint();
1643std::tuple<ObjectiveTypeA, ObjectiveTypeB> objectives = SCH.GetObjectives();
1644// obj will contain the minimum sum of objectiveA and objectiveB found on the best front.
1645double obj = opt.Optimize(objectives, coords);
1646// Now obtain the best front.
1647arma::cube bestFront = opt.ParetoFront();
1648```
1649</details>
1650
1651#### See also
1652* [MOEA/D-DE Algorithm](https://ieeexplore.ieee.org/document/4633340)
1653* [Multi-objective Functions in Wikipedia](https://en.wikipedia.org/wiki/Test_functions_for_optimization#Test_functions_for_multi-objective_optimization)
1654* [Multi-objective functions](#multi-objective-functions)
1655
1656## NSGA2
1657
1658*An optimizer for arbitrary multi-objective functions.*
1659
1660NSGA2 (Non-dominated Sorting Genetic Algorithm - II) is a multi-objective
1661optimization algorithm. The algorithm works by generating a candidate population
1662from a fixed starting point. At each stage of optimization, a new population of
1663children is generated. This new population along with its predecessor is sorted
1664using non-domination as the metric. Following this, the population is further
1665segregated into fronts. A new population is generated from these fronts having
1666size equal to that of the starting population.
1667
1668#### Constructors
1669
1670 * `NSGA2()`
1671 * `NSGA2(`_`populationSize, maxGenerations, crossoverProb, mutationProb, mutationStrength, epsilon, lowerBound, upperBound`_`)`
1672
1673#### Attributes
1674
1675| **type** | **name** | **description** | **default** |
1676|----------|----------|-----------------|-------------|
1677| `size_t` | **`populationSize`** | The number of candidates in the population. This should be at least 4 in size and a multiple of 4. | `100` |
1678| `size_t` | **`maxGenerations`** | The maximum number of generations allowed for NSGA2. | `2000` |
1679| `double` | **`crossoverProb`** | Probability that a crossover will occur. | `0.6` |
1680| `double` | **`mutationProb`** | Probability that a weight will get mutated. | `0.3` |
1681| `double` | **`mutationStrength`** | The range of mutation noise to be added. This range is between 0 and mutationStrength. | `0.001` |
1682| `double` | **`epsilon`** | The value used internally to evaluate approximate equality in crowding distance based sorting. | `1e-6` |
1683| `double`, `arma::vec` | **`lowerBound`** | Lower bound of the coordinates on the coordinates of the whole population during the search process. | `0` |
1684| `double`, `arma::vec` | **`upperBound`** | Lower bound of the coordinates on the coordinates of the whole population during the search process. | `1` |
1685
1686Note that the parameters `lowerBound` and `upperBound` are overloaded. Data types of `double` or `arma::mat` may be used. If they are initialized as single values of `double`, then the same value of the bound applies to all the axes, resulting in an initialization following a uniform distribution in a hypercube. If they are initialized as matrices of `arma::mat`, then the value of `lowerBound[i]` applies to axis `[i]`; similarly, for values in `upperBound`. This results in an initialization following a uniform distribution in a hyperrectangle within the specified bounds.
1687
1688Attributes of the optimizer may also be changed via the member methods
1689`PopulationSize()`, `MaxGenerations()`, `CrossoverRate()`, `MutationProbability()`, `MutationStrength()`, `Epsilon()`, `LowerBound()` and `UpperBound()`.
1690
1691#### Examples:
1692
1693<details open>
1694<summary>Click to collapse/expand example code.
1695</summary>
1696
1697```c++
1698SchafferFunctionN1<arma::mat> SCH;
1699arma::vec lowerBound("-1000 -1000");
1700arma::vec upperBound("1000 1000");
1701NSGA2 opt(20, 5000, 0.5, 0.5, 1e-3, 1e-6, lowerBound, upperBound);
1702
1703typedef decltype(SCH.objectiveA) ObjectiveTypeA;
1704typedef decltype(SCH.objectiveB) ObjectiveTypeB;
1705
1706arma::mat coords = SCH.GetInitialPoint();
1707std::tuple<ObjectiveTypeA, ObjectiveTypeB> objectives = SCH.GetObjectives();
1708
1709// obj will contain the minimum sum of objectiveA and objectiveB found on the best front.
1710double obj = opt.Optimize(objectives, coords);
1711// Now obtain the best front.
1712arma::cube bestFront = opt.Front();
1713```
1714
1715</details>
1716
1717#### See also:
1718
1719 * [NSGA-II Algorithm](https://www.iitk.ac.in/kangal/Deb_NSGA-II.pdf)
1720 * [Multi-objective Functions in Wikipedia](https://en.wikipedia.org/wiki/Test_functions_for_optimization#Test_functions_for_multi-objective_optimization)
1721  * [Multi-objective functions](#multi-objective-functions)
1722
1723## OptimisticAdam
1724
1725*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
1726
1727OptimisticAdam is an optimizer which implements the Optimistic Adam algorithm
1728which uses Optmistic Mirror Descent with the Adam Optimizer.  It addresses the
1729problem of limit cycling while training GANs (generative adversarial networks).
1730It uses OMD to achieve faster regret rates in solving the zero sum game of
1731training a GAN. It consistently achieves a smaller KL divergence with~ respect to
1732the true underlying data distribution.  The implementation here can be used with
1733any differentiable separable function, not just GAN training.
1734
1735#### Constructors
1736
1737 * `OptimisticAdam()`
1738 * `OptimisticAdam(`_`stepSize, batchSize`_`)`
1739 * `OptimisticAdam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle`_`)`
1740 * `OptimisticAdam(`_`stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy`_`)`
1741
1742Note that the `OptimisticAdam` class is based on the
1743`AdamType<`_`UpdateRule`_`>` class with _`UpdateRule`_` = OptimisticAdamUpdate`.
1744
1745#### Attributes
1746
1747| **type** | **name** | **description** | **default** |
1748|----------|----------|-----------------|-------------|
1749| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
1750| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
1751| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
1752| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
1753| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
1754| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
1755| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
1756| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
1757| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
1758
1759The attributes of the optimizer may also be modified via the member methods
1760`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`,
1761`Tolerance()`, `Shuffle()`, and `ResetPolicy()`.
1762
1763#### Examples
1764
1765<details open>
1766<summary>Click to collapse/expand example code.
1767</summary>
1768
1769```c++
1770RosenbrockFunction f;
1771arma::mat coordinates = f.GetInitialPoint();
1772
1773OptimisticAdam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
1774optimizer.Optimize(f, coordinates);
1775```
1776
1777</details>
1778
1779#### See also:
1780
1781 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
1782 * [SGD](#standard-sgd)
1783 * [Training GANs with Optimism](https://arxiv.org/pdf/1711.00141.pdf)
1784 * [Differentiable separable functions](#differentiable-separable-functions)
1785
1786## Padam
1787
1788*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
1789
1790Padam is a variant of Adam with a partially adaptive momentum estimation method.
1791
1792#### Constructors
1793
1794 * `Padam()`
1795 * `Padam(`_`stepSize, batchSize`_`)`
1796 * `Padam(`_`stepSize, batchSize, beta1, beta2, partial, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
1797
1798#### Attributes
1799
1800| **type** | **name** | **description** | **default** |
1801|----------|----------|-----------------|-------------|
1802| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
1803| `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
1804| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
1805| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
1806| `double` | **`partial`** | Partially adaptive parameter. | `0.25` |
1807| `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
1808| `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
1809| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
1810| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
1811| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
1812| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
1813
1814The attributes of the optimizer may also be modified via the member methods
1815`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Partial()`, `Epsilon()`,
1816`MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and `ExactObjective()`.
1817
1818#### Examples
1819
1820<details open>
1821<summary>Click to collapse/expand example code.
1822</summary>
1823
1824```c++
1825RosenbrockFunction f;
1826arma::mat coordinates = f.GetInitialPoint();
1827
1828Padam optimizer(0.001, 32, 0.9, 0.999, 0.25, 1e-8, 100000, 1e-5, true);
1829optimizer.Optimize(f, coordinates);
1830```
1831
1832</details>
1833
1834#### See also:
1835 * [Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks](https://arxiv.org/abs/1806.06763)
1836 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
1837 * [SGD](#standard-sgd)
1838 * [Adam: A Method for Stochastic Optimization](http://arxiv.org/abs/1412.6980)
1839 * [Differentiable separable functions](#differentiable-separable-functions)
1840
1841## PSO
1842
1843*An optimizer for [arbitrary functions](#arbitrary-functions).*
1844
1845PSO is an evolutionary approach to optimization that is inspired by flocks or birds or fishes. The fundamental analogy is that every creature (particle in a swarm) is at a measurable position of goodness or fitness, and this information can be shared amongst the creatures in the flock, so that iteratively, the entire flock can get close to the global optimum.
1846
1847#### Constructors
1848
1849 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>()`
1850 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles`_`)`
1851 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound`_`)`
1852 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound, maxIterations`_`)`
1853 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound, maxIterations, horizonSize`_`)`
1854 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound, maxIterations, horizonSize, impTolerance`_`)`
1855 * `PSOType<`_`VelocityUpdatePolicy, InitPolicy`_`>(`_`numParticles, lowerBound, upperBound, maxIterations, horizonSize, impTolerance, exploitationFactor, explorationFactor`_`)`
1856
1857#### Attributes
1858
1859| **type** | **name** | **description** | **default** |
1860|----------|----------|-----------------|-------------|
1861| `size_t` | **`numParticles`** |  numParticles Number of particles in the swarm. | `64` |
1862| `double`, `arma::mat` | **`lowerBound`** | Lower bound of the coordinates of the initial population. | `1` |
1863| `double`, `arma::mat` | **`upperBound`** |  Upper bound of the coordinates of the initial population. | `1` |
1864| `size_t` | **`maxIterations`** | Maximum number of iterations allowed. | `3000` |
1865| `size_t` | **`horizonSize`** | Size of the lookback-horizon for computing improvement. | `350` |
1866| `double` | **`impTolerance`** | The final value of the objective function for termination. If set to negative value, tolerance is not considered. | `1e-5` |
1867| `double` | **`exploitationFactor`** | Influence of the personal best of the particle. | `2.05` |
1868| `double` | **`explorationFactor`** | Influence of the neighbours of the particle. | `2.05` |
1869
1870Note that the parameters `lowerBound` and `upperBound` are overloaded. Data types of `double` or `arma::mat` may be used. If they are initialized as single values of `double`, then the same value of the bound applies to all the axes, resulting in an initialization following a uniform distribution in a hypercube. If they are initialized as matrices of `arma::mat`, then the value of `lowerBound[i]` applies to axis `[i]`; similarly, for values in `upperBound`. This results in an initialization following a uniform distribution in a hyperrectangle within the specified bounds.
1871
1872Attributes of the optimizer may also be changed via the member methods
1873`NumParticles()`, `LowerBound()`, `UpperBound()`, `MaxIterations()`,
1874`HorizonSize()`, `ImpTolerance()`,`ExploitationFactor()`, and
1875`ExplorationFactor()`.
1876
1877At present, only the local-best variant of PSO is present in ensmallen. The optimizer may be initialized using the class type `LBestPSO`, which is an alias for `PSOType<LBestUpdate, DefaultInit>`.
1878
1879#### Examples:
1880
1881<details open>
1882<summary>Click to collapse/expand example code.
1883</summary>
1884
1885```c++
1886SphereFunction f(4);
1887arma::vec coordinates = f.GetInitialPoint();
1888
1889LBestPSO s;
1890const double result = s.Optimize(f, coordinates)
1891```
1892
1893</details>
1894
1895<details open>
1896<summary>Click to collapse/expand example code.
1897</summary>
1898
1899```c++
1900RosenbrockFunction f;
1901arma::vec coordinates = f.GetInitialPoint();
1902
1903// Setting bounds for the initial swarm population of size 2.
1904arma::vec lowerBound("50 50");
1905arma::vec upperBound("60 60");
1906
1907LBestPSO s(200, lowerBound, upperBound, 3000, 600, 1e-30, 2.05, 2.05);
1908const double result = s.Optimize(f, coordinates)
1909```
1910
1911</details>
1912
1913<details open>
1914<summary>Click to collapse/expand example code.
1915</summary>
1916
1917```c++
1918RosenbrockFunction f;
1919arma::vec coordinates = f.GetInitialPoint();
1920
1921// Setting bounds for the initial swarm population as type double.
1922double lowerBound = 50;
1923double upperBound = 60;
1924
1925LBestPSO s(64, lowerBound, upperBound, 3000, 400, 1e-30, 2.05, 2.05);
1926const double result = s.Optimize(f, coordinates)
1927```
1928
1929</details>
1930
1931#### See also:
1932
1933 * [Particle Swarm Optimization](http://www.swarmintelligence.org/)
1934 * [Arbitrary functions](#arbitrary-functions)
1935
1936
1937## Primal-dual SDP Solver
1938
1939*An optimizer for [semidefinite programs](#semidefinite-programs).*
1940
1941A primal-dual interior point method solver.  This can solve semidefinite
1942programs.
1943
1944#### Constructors
1945
1946 * `PrimalDualSolver<>(`_`maxIterations`_`)`
1947 * `PrimalDualSolver<>(`_`maxIterations, tau, normXzTol, primalInfeasTol, dualInfeasTol`_`)`
1948
1949#### Attributes
1950
1951The `PrimalDualSolver<>` class has several attributes that are only modifiable
1952as member methods.
1953
1954| **type** | **method name** | **description** | **default** |
1955|----------|----------|-----------------|-------------|
1956| `double` | **`Tau()`** | Value of tau used to compute alpha\_hat. | `0.99` |
1957| `double` | **`NormXZTol()`** | Tolerance for the norm of X\*Z. | `1e-7` |
1958| `double` | **`PrimalInfeasTol()`** | Tolerance for primal infeasibility. | `1e-7` |
1959| `double` | **`DualInfeasTol()`** | Tolerance for dual infeasibility. | `1e-7` |
1960| `size_t` | **`MaxIterations()`** | Maximum number of iterations before convergence. | `1000` |
1961
1962#### Optimization
1963
1964The `PrimalDualSolver<>` class offers two overloads of `Optimize()` that
1965optionally return the converged values for the dual variables.
1966
1967<details open>
1968<summary>Click to collapse/expand example code.
1969</summary>
1970
1971```c++
1972/**
1973 * Invoke the optimization procedure, returning the converged values for the
1974 * primal and dual variables.
1975 */
1976template<typename SDPType>
1977double Optimize(SDPType& s,
1978                arma::mat& X,
1979                arma::vec& ySparse,
1980                arma::vec& yDense,
1981                arma::mat& Z);
1982
1983/**
1984 * Invoke the optimization procedure, and only return the primal variable.
1985 */
1986template<typename SDPType>
1987double Optimize(SDPType& s, arma::mat& X);
1988```
1989
1990</details>
1991
1992The _`SDPType`_ template parameter specifies the type of SDP to solve.  The
1993`SDP<arma::mat>` and `SDP<arma::sp_mat>` classes are available for use; these
1994represent SDPs with dense and sparse `C` matrices, respectively.  The `SDP<>`
1995class is detailed in the [semidefinite program
1996documentation](#semidefinite-programs).  _`SDPType`_ is automatically inferred
1997when `Optimize()` is called with an SDP.
1998
1999#### See also:
2000
2001 * [Primal-dual interior-point methods for semidefinite programming](http://www.dtic.mil/dtic/tr/fulltext/u2/1020236.pdf)
2002 * [Semidefinite programming on Wikipedia](https://en.wikipedia.org/wiki/Semidefinite_programming)
2003 * [Semidefinite programs](#semidefinite-programs) (includes example usage of `PrimalDualSolver`)
2004
2005## Quasi-Hyperbolic Momentum Update SGD (QHSGD)
2006
2007*An optimizer for [differentiable separable
2008functions](#differentiable-separable-functions).*
2009
2010Quasi-hyperbolic momentum update SGD (QHSGD) is an SGD-like optimizer with
2011momentum where quasi-hyperbolic terms are added to the parametrization.  The
2012update rule for this optimizer is a weighted average of momentum SGD and vanilla
2013SGD.
2014
2015#### Constructors
2016
2017  * `QHSGD()`
2018  * `QHSGD(`_`stepSize, batchSize`_`)`
2019  * `QHSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, exactObjective`_`)`
2020
2021 Note that `QHSGD` is based on the templated type
2022 `SGD<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` =
2023 QHUpdate` and _`DecayPolicyType`_` = NoDecay`.
2024
2025#### Attributes
2026
2027 | **type** | **name** | **description** | **default** |
2028 |----------|----------|-----------------|-------------|
2029 | `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
2030 | `size_t` | **`batchSize`** | Batch size to use for each step. | `32` |
2031 | `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2032 | `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2033 | `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
2034 | `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2035
2036 Attributes of the optimizer may also be modified via the member methods
2037 `StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`,  and `ExactObjective()`.
2038
2039 Note that the `QHUpdate` class has the constructor  `QHUpdate(`_`v,
2040momentum`_`)` with a default value of `0.7` for the quasi-hyperbolic term `v`
2041and `0.999` for the momentum term.
2042
2043#### Examples
2044
2045<details open>
2046<summary>Click to collapse/expand example code.
2047</summary>
2048
2049```c++
2050RosenbrockFunction f;
2051arma::mat coordinates = f.GetInitialPoint();
2052
2053QHSGD optimizer(0.01, 32, 100000, 1e-5, true);
2054optimizer.Optimize(f, coordinates);
2055```
2056
2057</details>
2058
2059#### See also:
2060
2061  * [Quasi-Hyperbolic Momentum and Adam For Deep Learning](https://arxiv.org/pdf/1810.06801.pdf)
2062  * [SGD](#sgd)
2063  * [Momentum SGD](#momentum-sgd)
2064  * [Nesterov Momentum SGD](#nesterov-momentum-sgd)
2065  * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2066  * [Differentiable separable functions](#differentiable-separable-functions)
2067
2068## QHAdam
2069
2070*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2071
2072QHAdam is an optimizer that uses quasi-hyperbolic descent with the Adam
2073optimizer.  This replaces the moment estimators of Adam with quasi-hyperbolic
2074terms, and various values of the `v1` and `v2` parameters are equivalent to
2075the following other optimizers:
2076
2077 * When `v1 = v2 = 1`, `QHAdam` is equivalent to `Adam`.
2078
2079 * When `v1 = 0` and `v2 = 1`, `QHAdam` is equivalent to `RMSProp`.
2080
2081 * When `v1 = beta1` and `v2 = 1`, `QHAdam` is equivalent to `Nadam`.
2082
2083#### Constructors
2084
2085  * `QHAdam()`
2086  * `QHAdam(`_`stepSize, batchSize`_`)`
2087  * `QHAdam(`_`stepSize, batchSize, v1, v2, beta1, beta2, eps, maxIterations`_`)`
2088  * `QHAdam(`_`stepSize, batchSize, v1, v2, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
2089
2090#### Attributes
2091
2092 | **type** | **name** | **description** | **default** |
2093 |----------|----------|-----------------|-------------|
2094 | `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
2095 | `size_t` | **`batchSize`** | Number of points to process in a single step. | `32` |
2096 | `double` | **`v1`** | The First Quasi Hyperbolic Term. | `0.7` |
2097 | `double` | **`v2`** | The Second Quasi Hyperbolic Term. | `1.00` |
2098 | `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
2099 | `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
2100 | `double` | **`eps`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
2101 | `size_t` | **`max_iterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2102 | `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2103 | `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
2104 | `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
2105 | `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2106
2107 The attributes of the optimizer may also be modified via the member methods
2108 `StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Eps()`, `MaxIterations()`,
2109 `Tolerance()`, `Shuffle()`, `V1()`, `V2()`, `ResetPolicy()`, and `ExactObjective()`.
2110
2111#### Examples
2112
2113 ```c++
2114 RosenbrockFunction f;
2115 arma::mat coordinates = f.GetInitialPoint();
2116
2117 QHAdam optimizer(0.001, 32, 0.7, 0.9, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
2118 optimizer.Optimize(f, coordinates);
2119 ```
2120
2121#### See also:
2122
2123  * [Quasi-Hyperbolic Momentum and Adam For Deep Learning](https://arxiv.org/pdf/1810.06801.pdf)
2124  * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2125  * [SGD](#standard-sgd)
2126  * [Adam](#adam)
2127  * [RMSprop](#rmsprop)
2128  * [Nadam](#nadam)
2129  * [Incorporating Nesterov Momentum into Adam](http://cs229.stanford.edu/proj2015/054_report.pdf)
2130  * [Differentiable separable functions](#differentiable-separable-functions)
2131
2132## RMSProp
2133
2134*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2135
2136RMSProp utilizes the magnitude of recent gradients to normalize the gradients.
2137
2138#### Constructors
2139
2140 * `RMSProp()`
2141 * `RMSProp(`_`stepSize, batchSize`_`)`
2142 * `RMSProp(`_`stepSize, batchSize, alpha, epsilon, maxIterations, tolerance, shuffle`_`)`
2143 * `RMSProp(`_`stepSize, batchSize, alpha, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
2144
2145#### Attributes
2146
2147| **type** | **name** | **description** | **default** |
2148|----------|----------|-----------------|-------------|
2149| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
2150| `size_t` | **`batchSize`** | Number of points to process in each step. | `32` |
2151| `double` | **`alpha`** | Smoothing constant, similar to that used in AdaDelta and momentum methods. | `0.99` |
2152| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-8` |
2153| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2154| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. |
2155| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
2156| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
2157| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2158
2159Attributes of the optimizer can also be modified via the member methods
2160`StepSize()`, `BatchSize()`, `Alpha()`, `Epsilon()`, `MaxIterations()`,
2161`Tolerance()`, `Shuffle()`, `ResetPolicy()`, and `ExactObjective()`.
2162
2163#### Examples:
2164
2165<details open>
2166<summary>Click to collapse/expand example code.
2167</summary>
2168
2169```c++
2170RosenbrockFunction f;
2171arma::mat coordinates = f.GetInitialPoint();
2172
2173RMSProp optimizer(1e-3, 1, 0.99, 1e-8, 5000000, 1e-9, true);
2174optimizer.Optimize(f, coordinates);
2175```
2176
2177</details>
2178
2179#### See also:
2180
2181 * [Divide the gradient by a running average of its recent magnitude](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
2182 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp)
2183 * [Differentiable separable functions](#differentiable-separable-functions)
2184
2185## Simulated Annealing (SA)
2186
2187*An optimizer for [arbitrary functions](#arbitrary-functions).*
2188
2189Simulated Annealing is an stochastic optimization algorithm which is able to
2190deliver near-optimal results quickly without knowing the gradient of the
2191function being optimized.  It has a unique hill climbing capability that makes
2192it less vulnerable to local minima. This implementation uses exponential cooling
2193schedule and feedback move control by default, but the cooling schedule can be
2194changed via a template parameter.
2195
2196#### Constructors
2197
2198 * `SA<`_`CoolingScheduleType`_`>(`_`coolingSchedule`_`)`
2199 * `SA<`_`CoolingScheduleType`_`>(`_`coolingSchedule, maxIterations`_`)`
2200 * `SA<`_`CoolingScheduleType`_`>(`_`coolingSchedule, maxIterations, initT, initMoves, moveCtrlSweep, tolerance, maxToleranceSweep, maxMoveCoef, initMoveCoef, gain`_`)`
2201
2202The _`CoolingScheduleType`_ template parameter implements a policy to update the
2203temperature.  The `ExponentialSchedule` class is available for use; it has a
2204constructor `ExponentialSchedule(`_`lambda`_`)` where _`lambda`_ is the cooling
2205speed (default `0.001`).  Custom schedules may be created by implementing a
2206class with at least the single member method below:
2207
2208<details open>
2209<summary>Click to collapse/expand example code.
2210</summary>
2211
2212```c++
2213// Return the next temperature given the current system status.
2214double NextTemperature(const double currentTemperature,
2215                       const double currentEnergy);
2216```
2217
2218</details>
2219
2220For convenience, the default cooling schedule is `ExponentialSchedule`, so the
2221shorter type `SA<>` may be used instead of the equivalent
2222`SA<ExponentialSchedule>`.
2223
2224#### Attributes
2225
2226| **type** | **name** | **description** | **default** |
2227|----------|----------|-----------------|-------------|
2228| `CoolingScheduleType` | **`coolingSchedule`** | Instantiated cooling schedule (default ExponentialSchedule). | **CoolingScheduleType()** |
2229| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 indicates no limit). | `1000000` |
2230| `double` | **`initT`** | Initial temperature. | `10000.0` |
2231| `size_t` | **`initMoves`** | Number of initial iterations without changing temperature. | `1000` |
2232| `size_t` | **`moveCtrlSweep`** | Sweeps per feedback move control. | `100` |
2233| `double` | **`tolerance`** | Tolerance to consider system frozen. | `1e-5` |
2234| `size_t` | **`maxToleranceSweep`** | Maximum sweeps below tolerance to consider system frozen. | `3` |
2235| `double` | **`maxMoveCoef`** | Maximum move size. | `20` |
2236| `double` | **`initMoveCoef`** | Initial move size. | `0.3` |
2237| `double` | **`gain`** | Proportional control in feedback move control. | `0.3` |
2238
2239Attributes of the optimizer may also be changed via the member methods
2240`CoolingSchedule()`, `MaxIterations()`, `InitT()`, `InitMoves()`,
2241`MoveCtrlSweep()`, `Tolerance()`, `MaxToleranceSweep()`, `MaxMoveCoef()`,
2242`InitMoveCoef()`, and `Gain()`.
2243
2244#### Examples:
2245
2246<details open>
2247<summary>Click to collapse/expand example code.
2248</summary>
2249
2250```c++
2251RosenbrockFunction f;
2252arma::mat coordinates = f.GetInitialPoint();
2253
2254SA<> optimizer(ExponentialSchedule(), 1000000, 1000., 1000, 100, 1e-10, 3, 1.5,
2255    0.5, 0.3);
2256optimizer.Optimize(f, coordinates);
2257```
2258
2259</details>
2260
2261#### See also:
2262
2263 * [Simulated annealing on Wikipedia](https://en.wikipedia.org/wiki/Simulated_annealing)
2264 * [Arbitrary functions](#arbitrary-functions)
2265
2266## Simultaneous Perturbation Stochastic Approximation (SPSA)
2267
2268*An optimizer for [arbitrary functions](#arbitrary-functions).*
2269
2270The SPSA algorithm approximates the gradient of the function by finite
2271differences along stochastic directions.
2272
2273#### Constructors
2274
2275 * `SPSA(`_`alpha, gamma, stepSize, evaluationStepSize, maxIterations, tolerance`_`)`
2276
2277#### Attributes
2278
2279| **type** | **name** | **description** | **default** |
2280|----------|----------|-----------------|-------------|
2281| `double` | **`alpha`** | Scaling exponent for the step size. | `0.602` |
2282| `double` | **`gamma`** | Scaling exponent for evaluation step size. | `0.101` |
2283| `double` | **`stepSize`** | Scaling parameter for step size (named as 'a' in the paper). | `0.16` |
2284| `double` | **`evaluationStepSize`** | Scaling parameter for evaluation step size (named as 'c' in the paper). | `0.3` |
2285| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2286| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2287
2288Attributes of the optimizer may also be changed via the member methods
2289`Alpha()`, `Gamma()`, `StepSize()`, `EvaluationStepSize()`, and `MaxIterations()`.
2290
2291#### Examples:
2292
2293<details open>
2294<summary>Click to collapse/expand example code.
2295</summary>
2296
2297```c++
2298SphereFunction f(2);
2299arma::mat coordinates = f.GetInitialPoint();
2300
2301SPSA optimizer(0.1, 0.102, 0.16, 0.3, 100000, 1e-5);
2302optimizer.Optimize(f, coordinates);
2303```
2304
2305</details>
2306
2307#### See also:
2308
2309 * [An Overview of the Simultaneous Perturbation Method for Efficient Optimization](https://pdfs.semanticscholar.org/bf67/0fb6b1bd319938c6a879570fa744cf36b240.pdf)
2310 * [SPSA on Wikipedia](https://en.wikipedia.org/wiki/Simultaneous_perturbation_stochastic_approximation)
2311 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2312 * [Differentiable separable functions](#differentiable-separable-functions)
2313
2314## Stochastic Recursive Gradient Algorithm (SARAH/SARAH+)
2315
2316*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2317
2318StochAstic Recusive gRadient algoritHm (SARAH), is a variance reducing
2319stochastic recursive gradient algorithm which employs the stochastic recursive
2320gradient, for solving empirical loss minimization for the case of nonconvex
2321losses.
2322
2323#### Constructors
2324
2325 * `SARAHType<`_`UpdatePolicyType`_`>()`
2326 * `SARAHType<`_`UpdatePolicyType`_`>(`_`stepSize, batchSize`_`)`
2327 * `SARAHType<`_`UpdatePolicyType`_`>(`_`stepSize, batchSize, maxIterations, innerIterations, tolerance, shuffle, updatePolicy, exactObjective`_`)`
2328
2329The _`UpdatePolicyType`_ template parameter specifies the update step used for
2330the optimizer.  The `SARAHUpdate` and `SARAHPlusUpdate` classes are available
2331for use, and implement the standard SARAH update and SARAH+ update,
2332respectively.  A custom update rule can be used by implementing a class with the
2333same method signatures.
2334
2335For convenience the following typedefs have been defined:
2336
2337 * `SARAH` (equivalent to `SARAHType<SARAHUpdate>`): the standard SARAH optimizer
2338 * `SARAH_Plus` (equivalent to `SARAHType<SARAHPlusUpdate>`): the SARAH+ optimizer
2339
2340#### Attributes
2341
2342| **type** | **name** | **description** | **default** |
2343|----------|----------|-----------------|-------------|
2344| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
2345| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` |
2346| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `1000` |
2347| `size_t` | **`innerIterations`** | The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. | `0` |
2348| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2349| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
2350| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` |
2351| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2352
2353Attributes of the optimizer may also be changed via the member methods
2354`StepSize()`, `BatchSize()`, `MaxIterations()`, `InnerIterations()`,
2355`Tolerance()`, `Shuffle()`, `UpdatePolicy()`, and `ExactObjective()`.
2356
2357Note that the default value for `updatePolicy` is the default constructor for
2358the `UpdatePolicyType`.
2359
2360#### Examples:
2361
2362<details open>
2363<summary>Click to collapse/expand example code.
2364</summary>
2365
2366```c++
2367RosenbrockFunction f;
2368arma::mat coordinates = f.GetInitialPoint();
2369
2370// Standard stochastic variance reduced gradient.
2371SARAH optimizer(0.01, 1, 5000, 0, 1e-5, true);
2372optimizer.Optimize(f, coordinates);
2373
2374// Stochastic variance reduced gradient with Barzilai-Borwein.
2375SARAH_Plus optimizerPlus(0.01, 1, 5000, 0, 1e-5, true);
2376optimizerPlus.Optimize(f, coordinates);
2377```
2378
2379</details>
2380
2381#### See also:
2382
2383 * [Stochastic Recursive Gradient Algorithm for Nonconvex Optimization](https://arxiv.org/abs/1705.07261)
2384 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2385 * [Differentiable separable functions](#differentiable-separable-functions)
2386
2387## Standard SGD
2388
2389*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2390
2391Stochastic Gradient Descent is a technique for minimizing a function which
2392can be expressed as a sum of other functions.  It's likely better to use any of
2393the other variants of SGD than this class; however, this standard SGD
2394implementation may still be useful in some situations.
2395
2396#### Constructors
2397
2398 * `StandardSGD()`
2399 * `StandardSGD(`_`stepSize, batchSize`_`)`
2400 * `StandardSGD(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, updatePolicy, decayPolicy, resetPolicy, exactObjective`_`)`
2401
2402Note that `StandardSGD` is based on the templated type
2403`SGD<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` =
2404VanillaUpdate` and _`DecayPolicyType`_` = NoDecay`.
2405
2406#### Attributes
2407
2408| **type** | **name** | **description** | **default** |
2409|----------|----------|-----------------|-------------|
2410| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
2411| `size_t` | **`batchSize`** | Batch size to use for each step. | `32` |
2412| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2413| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2414| `bool` | **`shuffle`** | If true, the function order is shuffled; otherwise, each function is visited in linear order. | `true` |
2415| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` |
2416| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` |
2417| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` |
2418| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2419
2420Attributes of the optimizer may also be modified via the member methods
2421`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `DecayPolicy()`, `ResetPolicy()`, and
2422`ExactObjective()`.
2423
2424#### Examples
2425
2426<details open>
2427<summary>Click to collapse/expand example code.
2428</summary>
2429
2430```c++
2431RosenbrockFunction f;
2432arma::mat coordinates = f.GetInitialPoint();
2433
2434StandardSGD optimizer(0.01, 32, 100000, 1e-5, true);
2435optimizer.Optimize(f, coordinates);
2436```
2437
2438</details>
2439
2440#### See also:
2441
2442 * [Momentum SGD](#momentum-sgd)
2443 * [Nesterov Momentum SGD](#nesterov-momentum-sgd)
2444 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2445 * [Differentiable separable functions](#differentiable-separable-functions)
2446
2447## Stochastic Coordinate Descent (SCD)
2448
2449*An optimizer for [partially differentiable functions](#partially-differentiable-functions).*
2450
2451Stochastic Coordinate descent is a technique for minimizing a function by
2452doing a line search along a single direction at the current point in the
2453iteration. The direction (or "coordinate") can be chosen cyclically, randomly
2454or in a greedy fashion.
2455
2456#### Constructors
2457
2458 * `SCD<`_`DescentPolicyType`_`>()`
2459 * `SCD<`_`DescentPolicyType`_`>(`_`stepSize, maxIterations`_`)`
2460 * `SCD<`_`DescentPolicyType`_`>(`_`stepSize, maxIterations, tolerance, updateInterval`_`)`
2461 * `SCD<`_`DescentPolicyType`_`>(`_`stepSize, maxIterations, tolerance, updateInterval, descentPolicy`_`)`
2462
2463The _`DescentPolicyType`_ template parameter specifies the behavior of SCD when
2464selecting the next coordinate to descend with.  The `RandomDescent`,
2465`GreedyDescent`, and `CyclicDescent` classes are available for use.  Custom
2466behavior can be achieved by implementing a class with the same method
2467signatures.
2468
2469For convenience, the following typedefs have been defined:
2470
2471 * `RandomSCD` (equivalent to `SCD<RandomDescent>`): selects coordinates randomly
2472 * `GreedySCD` (equivalent to `SCD<GreedyDescent>`): selects the coordinate with the maximum guaranteed descent according to the Gauss-Southwell rule
2473 * `CyclicSCD` (equivalent to `SCD<CyclicDescent>`): selects coordinates sequentially
2474
2475#### Attributes
2476
2477| **type** | **name** | **description** | **default** |
2478|----------|----------|-----------------|-------------|
2479| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
2480| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2481| `double` | **`tolerance`** | Maximum absolute tolerance to terminate the algorithm. | `1e-5` |
2482| `size_t` | **`updateInterval`** | The interval at which the objective is to be reported and checked for convergence. | `1e3` |
2483| `DescentPolicyType` | **`descentPolicy`** | The policy to use for selecting the coordinate to descend on. | `DescentPolicyType()` |
2484
2485Attributes of the optimizer may also be modified via the member methods
2486`StepSize()`, `MaxIterations()`, `Tolerance()`, `UpdateInterval()`, and
2487`DescentPolicy()`.
2488
2489Note that the default value for `descentPolicy` is the default constructor for
2490_`DescentPolicyType`_.
2491
2492#### Examples
2493
2494<details open>
2495<summary>Click to collapse/expand example code.
2496</summary>
2497
2498```c++
2499SparseTestFunction f;
2500arma::mat coordinates = f.GetInitialPoint();
2501
2502RandomSCD randomscd(0.01, 100000, 1e-5, 1e3);
2503randomscd.Optimize(f, coordinates);
2504
2505GreedySCD greedyscd(0.01, 100000, 1e-5, 1e3);
2506greedyscd.Optimize(f, coordinates);
2507
2508CyclicSCD cyclicscd(0.01, 100000, 1e-5, 1e3);
2509cyclicscd.Optimize(f, coordinates);
2510```
2511
2512</details>
2513
2514#### See also:
2515
2516 * [Coordinate descent on Wikipedia](https://en.wikipedia.org/wiki/Coordinate_descent)
2517 * [Stochastic Methods for L1-Regularized Loss Minimization](https://www.jmlr.org/papers/volume12/shalev-shwartz11a/shalev-shwartz11a.pdf)
2518 * [Partially differentiable functions](#partially-differentiable-functions)
2519
2520## Stochastic Gradient Descent with Restarts (SGDR)
2521
2522*An optimizer for [differentiable separable
2523functions](#differentiable-separable-functions).*
2524
2525SGDR is based on Mini-batch Stochastic Gradient Descent class and simulates a
2526new warm-started run/restart once a number of epochs are performed.
2527
2528#### Constructors
2529
2530 * `SGDR<`_`UpdatePolicyType`_`>()`
2531 * `SGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize`_`)`
2532 * `SGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, updatePolicy`_`)`
2533 * `SGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, updatePolicy, resetPolicy, exactObjective`_`)`
2534
2535The _`UpdatePolicyType`_ template parameter controls the update policy used
2536during the iterative update process.  The `MomentumUpdate` class is available
2537for use, and custom behavior can be achieved by implementing a class with the
2538same method signatures as `MomentumUpdate`.
2539
2540For convenience, the default type of _`UpdatePolicyType`_ is `MomentumUpdate`,
2541so the shorter type `SGDR<>` can be used instead of the equivalent
2542`SGDR<MomentumUpdate>`.
2543
2544#### Attributes
2545
2546| **type** | **name** | **description** | **default** |
2547|----------|----------|-----------------|-------------|
2548| `size_t` | **`epochRestart`** | Initial epoch where decay is applied. | `50` |
2549| `double` | **`multFactor`** | Batch size multiplication factor. | `2.0` |
2550| `size_t` | **`batchSize`** | Size of each mini-batch. | `1000` |
2551| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
2552| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2553| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2554| `bool` | **`shuffle`** | If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. | `true` |
2555| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` |
2556| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
2557| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2558
2559Attributes of the optimizer can also be modified via the member methods
2560`EpochRestart()`, `MultFactor()`, `BatchSize()`, `StepSize()`,
2561`MaxIterations()`, `Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `ResetPolicy()`, and
2562`ExactObjective()`.
2563
2564Note that the default value for `updatePolicy` is the default constructor for
2565the `UpdatePolicyType`.
2566
2567#### Examples:
2568
2569<details open>
2570<summary>Click to collapse/expand example code.
2571</summary>
2572
2573```c++
2574RosenbrockFunction f;
2575arma::mat coordinates = f.GetInitialPoint();
2576
2577SGDR<> optimizer(50, 2.0, 1, 0.01, 10000, 1e-3);
2578optimizer.Optimize(f, coordinates);
2579```
2580
2581</details>
2582
2583#### See also:
2584
2585 * [SGDR: Stochastic Gradient Descent with Warm Restarts](https://arxiv.org/abs/1608.03983)
2586 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2587 * [Differentiable separable functions](#differentiable-separable-functions)
2588
2589## Snapshot Stochastic Gradient Descent with Restarts (SnapshotSGDR)
2590
2591*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2592
2593SnapshotSGDR simulates a new warm-started run/restart once a number of epochs
2594are performed using the Snapshot Ensembles technique.
2595
2596#### Constructors
2597
2598 * `SnapshotSGDR<`_`UpdatePolicyType`_`>()`
2599 * `SnapshotSGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize`_`)`
2600 * `SnapshotSGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, snapshots, accumulate, updatePolicy`_`)`
2601 * `SnapshotSGDR<`_`UpdatePolicyType`_`>(`_`epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, snapshots, accumulate, updatePolicy, resetPolicy, exactObjective`_`)`
2602
2603The _`UpdatePolicyType`_ template parameter controls the update policy used
2604during the iterative update process.  The `MomentumUpdate` class is available
2605for use, and custom behavior can be achieved by implementing a class with the
2606same method signatures as `MomentumUpdate`.
2607
2608For convenience, the default type of _`UpdatePolicyType`_ is `MomentumUpdate`,
2609so the shorter type `SnapshotSGDR<>` can be used instead of the equivalent
2610`SnapshotSGDR<MomentumUpdate>`.
2611
2612#### Attributes
2613
2614| **type** | **name** | **description** | **default** |
2615|----------|----------|-----------------|-------------|
2616| `size_t` | **`epochRestart`** | Initial epoch where decay is applied. | `50` |
2617| `double` | **`multFactor`** | Batch size multiplication factor. | `2.0` |
2618| `size_t` | **`batchSize`** | Size of each mini-batch. | `1000` |
2619| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
2620| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2621| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2622| `bool` | **`shuffle`** | If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. | `true` |
2623| `size_t` | **`snapshots`** | Maximum number of snapshots. | `5` |
2624| `bool` | **`accumulate`** | Accumulate the snapshot parameter. | `true` |
2625| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` |
2626| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
2627| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2628
2629Attributes of the optimizer can also be modified via the member methods
2630`EpochRestart()`, `MultFactor()`, `BatchSize()`, `StepSize()`,
2631`MaxIterations()`, `Tolerance()`, `Shuffle()`, `Snapshots()`, `Accumulate()`,
2632`UpdatePolicy()`, `ResetPolicy()`, and `ExactObjective()`.
2633
2634The `Snapshots()` function returns a `std::vector<arma::mat>&` (a vector of
2635snapshots of the parameters), not a `size_t` representing the maximum number of
2636snapshots.
2637
2638Note that the default value for `updatePolicy` is the default constructor for
2639the `UpdatePolicyType`.
2640
2641#### Examples:
2642
2643<details open>
2644<summary>Click to collapse/expand example code.
2645</summary>
2646
2647```c++
2648RosenbrockFunction f;
2649arma::mat coordinates = f.GetInitialPoint();
2650
2651SnapshotSGDR<> optimizer(50, 2.0, 1, 0.01, 10000, 1e-3);
2652optimizer.Optimize(f, coordinates);
2653```
2654
2655</details>
2656
2657#### See also:
2658
2659 * [Snapshot ensembles: Train 1, get m for free](https://arxiv.org/abs/1704.00109)
2660 * [SGDR: Stochastic Gradient Descent with Warm Restarts](https://arxiv.org/abs/1608.03983)
2661 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2662 * [Differentiable separable functions](#differentiable-separable-functions)
2663
2664## SMORMS3
2665
2666*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2667
2668SMORMS3 is a hybrid of RMSprop, which is trying to estimate a safe and optimal
2669distance based on curvature or perhaps just normalizing the step-size in the
2670parameter space.
2671
2672#### Constructors
2673
2674 * `SMORMS3()`
2675 * `SMORMS3(`_`stepSize, batchSize`_`)`
2676 * `SMORMS3(`_`stepSize, batchSize, epsilon, maxIterations, tolerance`_`)`
2677 * `SMORMS3(`_`stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
2678
2679#### Attributes
2680
2681| **type** | **name** | **description** | **default** |
2682|----------|----------|-----------------|-------------|
2683| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
2684| `size_t` | **`batchSize`** | Number of points to process at each step. | `32` |
2685| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-16` |
2686| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2687| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2688| `bool` | **`shuffle`** | If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. | `true` |
2689| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
2690| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2691
2692Attributes of the optimizer can also be modified via the member methods
2693`StepSize()`, `BatchSize()`, `Epsilon()`, `MaxIterations()`, `Tolerance()`,
2694`Shuffle()`, `ResetPolicy()`, and `ExactObjective()`.
2695
2696#### Examples:
2697
2698<details open>
2699<summary>Click to collapse/expand example code.
2700</summary>
2701
2702```c++
2703RosenbrockFunction f;
2704arma::mat coordinates = f.GetInitialPoint();
2705
2706SMORMS3 optimizer(0.001, 1, 1e-16, 5000000, 1e-9, true);
2707optimizer.Optimize(f, coordinates);
2708```
2709
2710</details>
2711
2712#### See also:
2713
2714 * [RMSprop loses to SMORMS3 - Beware the Epsilon!](https://sifter.org/simon/journal/20150420.html)
2715 * [RMSProp](#rmsprop)
2716 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2717 * [Differentiable separable functions](#differentiable-separable-functions)
2718
2719## Standard stochastic variance reduced gradient (SVRG)
2720
2721*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2722
2723Stochastic Variance Reduced Gradient is a technique for minimizing smooth and
2724strongly convex problems.
2725
2726#### Constructors
2727
2728 * `SVRGType<`_`UpdatePolicyType, DecayPolicyType`_`>()`
2729 * `SVRGType<`_`UpdatePolicyType, DecayPolicyType`_`>(`_`stepSize`_`)`
2730 * `SVRGType<`_`UpdatePolicyType, DecayPolicyType`_`>(`_`stepSize, batchSize, maxIterations, innerIterations`_`)`
2731 * `SVRGType<`_`UpdatePolicyType, DecayPolicyType`_`>(`_`stepSize, batchSize, maxIterations, innerIterations, tolerance, shuffle, updatePolicy, decayPolicy, resetPolicy, exactObjective`_`)`
2732
2733The _`UpdatePolicyType`_ template parameter controls the update step used by
2734SVRG during the optimization.  The `SVRGUpdate` class is available for use and
2735custom update behavior can be achieved by implementing a class with the same
2736method signatures as `SVRGUpdate`.
2737
2738The _`DecayPolicyType`_ template parameter controls the decay policy used to
2739adjust the step size during the optimization.  The `BarzilaiBorweinDecay` and
2740`NoDecay` classes are available for use.  Custom decay functionality can be
2741achieved by implementing a class with the same method signatures.
2742
2743For convenience the following typedefs have been defined:
2744
2745 * `SVRG` (equivalent to `SVRGType<SVRGUpdate, NoDecay>`): the standard SVRG technique
2746 * `SVRG_BB` (equivalent to `SVRGType<SVRGUpdate, BarzilaiBorweinDecay>`): SVRG with the Barzilai-Borwein decay policy
2747
2748#### Attributes
2749
2750| **type** | **name** | **description** | **default** |
2751|----------|----------|-----------------|-------------|
2752| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
2753| `size_t` | **`batchSize`** | Initial batch size. | `32` |
2754| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `1000` |
2755| `size_t` | **`innerIterations`** | The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. | `0` |
2756| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2757| `bool` | **`shuffle`** | If true, the batch order is shuffled; otherwise, each batch is visited in linear order. | `true` |
2758| `UpdatePolicyType` | **`updatePolicy`** | Instantiated update policy used to adjust the given parameters. | `UpdatePolicyType()` |
2759| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` |
2760| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` |
2761| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2762
2763Attributes of the optimizer may also be modified via the member methods
2764`StepSize()`, `BatchSize()`, `MaxIterations()`, `InnerIterations()`,
2765`Tolerance()`, `Shuffle()`, `UpdatePolicy()`, `DecayPolicy()`, `ResetPolicy()`, and
2766`ExactObjective()`.
2767
2768Note that the default values for the `updatePolicy` and `decayPolicy` parameters
2769are simply the default constructors of the _`UpdatePolicyType`_ and
2770_`DecayPolicyType`_ classes.
2771
2772#### Examples:
2773
2774<details open>
2775<summary>Click to collapse/expand example code.
2776</summary>
2777
2778```c++
2779RosenbrockFunction f;
2780arma::mat coordinates = f.GetInitialPoint();
2781
2782// Standard stochastic variance reduced gradient.
2783SVRG optimizer(0.005, 1, 300, 0, 1e-10, true);
2784optimizer.Optimize(f, coordinates);
2785
2786// Stochastic variance reduced gradient with Barzilai-Borwein.
2787SVRG_BB bbOptimizer(0.005, batchSize, 300, 0, 1e-10, true, SVRGUpdate(),
2788    BarzilaiBorweinDecay(0.1));
2789bbOptimizer.Optimize(f, coordinates);
2790```
2791
2792</details>
2793
2794#### See also:
2795
2796 * [Accelerating Stochastic Gradient Descent using Predictive Variance Reduction](https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf)
2797 * [SGD](#standard-sgd)
2798 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2799 * [Differentiable separable functions](#differentiable-separable-functions)
2800
2801## SPALeRA Stochastic Gradient Descent (SPALeRASGD)
2802
2803*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2804
2805SPALeRA involves two components: a learning rate adaptation scheme, which
2806ensures that the learning system goes as fast as it can; and a catastrophic
2807event manager, which is in charge of detecting undesirable behaviors and getting
2808the system back on track.
2809
2810#### Constructors
2811
2812 * `SPALeRASGD<`_`DecayPolicyType`_`>()`
2813 * `SPALeRASGD<`_`DecayPolicyType`_`>(`_`stepSize, batchSize`_`)`
2814 * `SPALeRASGD<`_`DecayPolicyType`_`>(`_`stepSize, batchSize, maxIterations, tolerance`_`)`
2815 * `SPALeRASGD<`_`DecayPolicyType`_`>(`_`stepSize, batchSize, maxIterations, tolerance, lambda, alpha, epsilon, adaptRate, shuffle, decayPolicy, resetPolicy, exactObjective`_`)`
2816
2817The _`DecayPolicyType`_ template parameter controls the decay in the step size
2818during the course of the optimization.  The `NoDecay` class is available for
2819use; custom behavior can be achieved by implementing a class with the same
2820method signatures.
2821
2822By default, _`DecayPolicyType`_ is set to `NoDecay`, so the shorter type
2823`SPALeRASGD<>` can be used instead of the equivalent `SPALeRASGD<NoDecay>`.
2824
2825#### Attributes
2826
2827| **type** | **name** | **description** | **default** |
2828|----------|----------|-----------------|-------------|
2829| `double` | **`stepSize`** | Step size for each iteration. | `0.01` |
2830| `size_t` | **`batchSize`** | Initial batch size. | `32` |
2831| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2832| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2833| `double` | **`lambda`** | Page-Hinkley update parameter. | `0.01` |
2834| `double` | **`alpha`** | Memory parameter of the Agnostic Learning Rate adaptation. | `0.001` |
2835| `double` | **`epsilon`** | Numerical stability parameter. | `1e-6` |
2836| `double` | **`adaptRate`** | Agnostic learning rate update rate. | `3.10e-8` |
2837| `bool` | **`shuffle`** | If true, the batch order is shuffled; otherwise, each batch is visited in linear order. | `true` |
2838| `DecayPolicyType` | **`decayPolicy`** | Instantiated decay policy used to adjust the step size. | `DecayPolicyType()` |
2839| `bool` | **`resetPolicy`** | Flag that determines whether update policy parameters are reset before every Optimize call. | `true` |
2840| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2841
2842Attributes of the optimizer may also be modified via the member methods
2843`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Lambda()`,
2844`Alpha()`, `Epsilon()`, `AdaptRate()`, `Shuffle()`, `DecayPolicy()`, `ResetPolicy()`, and `ExactObjective()`.
2845
2846#### Examples
2847
2848<details open>
2849<summary>Click to collapse/expand example code.
2850</summary>
2851
2852```c++
2853RosenbrockFunction f;
2854arma::mat coordinates = f.GetInitialPoint();
2855
2856SPALeRASGD<> optimizer(0.05, 1, 10000, 1e-4);
2857optimizer.Optimize(f, coordinates);
2858```
2859
2860</details>
2861
2862#### See also:
2863
2864 * [Stochastic Gradient Descent: Going As Fast As Possible But Not Faster](https://arxiv.org/abs/1709.01427)
2865 * [SGD](#standard-sgd)
2866 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2867 * [Differentiable separable functions](#differentiable-separable-functions)
2868
2869## SWATS
2870
2871*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2872
2873SWATS is an optimizer that uses a simple strategy to switch from Adam to
2874standard SGD when a triggering condition is satisfied.  The condition relates to
2875the projection of Adam steps on the gradient subspace.
2876
2877#### Constructors
2878
2879 * `SWATS()`
2880 * `SWATS(`_`stepSize, batchSize`_`)`
2881 * `SWATS(`_`stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance`_`)`
2882 * `SWATS(`_`stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
2883
2884#### Attributes
2885
2886| **type** | **name** | **description** | **default** |
2887|----------|----------|-----------------|-------------|
2888| `double` | **`stepSize`** | Step size for each iteration. | `0.001` |
2889| `size_t` | **`batchSize`** | Number of points to process at each step. | `32` |
2890| `double` | **`beta1`** | Exponential decay rate for the first moment estimates. | `0.9` |
2891| `double` | **`beta2`** | Exponential decay rate for the weighted infinity norm estimates. | `0.999` |
2892| `double` | **`epsilon`** | Value used to initialize the mean squared gradient parameter. | `1e-16` |
2893| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2894| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2895| `bool` | **`shuffle`** | If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. | `true` |
2896| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
2897| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2898
2899Attributes of the optimizer can also be modified via the member methods
2900`StepSize()`, `BatchSize()`, `Beta1()`, `Beta2()`, `Epsilon()`,
2901`MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and  `ExactObjective()`.
2902
2903#### Examples:
2904
2905<details open>
2906<summary>Click to collapse/expand example code.
2907</summary>
2908
2909```c++
2910RosenbrockFunction f;
2911arma::mat coordinates = f.GetInitialPoint();
2912
2913SWATS optimizer(0.001, 1, 0.9, 0.999, 1e-16, 5000000, 1e-9, true);
2914optimizer.Optimize(f, coordinates);
2915```
2916
2917</details>
2918
2919#### See also:
2920
2921 * [Improving generalization performance by switching from Adam to SGD](https://arxiv.org/abs/1712.07628)
2922 * [Adam](#adam)
2923 * [Standard SGD](#standard-sgd)
2924 * [Stochastic gradient descent in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2925 * [Differentiable separable functions](#differentiable-separable-functions)
2926
2927## WNGrad
2928
2929*An optimizer for [differentiable separable functions](#differentiable-separable-functions).*
2930
2931WNGrad is a general nonlinear update rule for the learning rate. WNGrad has
2932near-optimal convergence rates in both the batch and stochastic settings.
2933
2934#### Constructors
2935
2936 * `WNGrad()`
2937 * `WNGrad(`_`stepSize, batchSize`_`)`
2938 * `WNGrad(`_`stepSize, batchSize, maxIterations, tolerance, shuffle`_`)`
2939 * `WNGrad(`_`stepSize, batchSize, maxIterations, tolerance, shuffle, resetPolicy, exactObjective`_`)`
2940
2941#### Attributes
2942
2943| **type** | **name** | **description** | **default** |
2944|----------|----------|-----------------|-------------|
2945| `double` | **`stepSize`** | Step size for each iteration. | `0.562` |
2946| `size_t` | **`batchSize`** | Initial batch size. | `32` |
2947| `size_t` | **`maxIterations`** | Maximum number of iterations allowed (0 means no limit). | `100000` |
2948| `double` | **`tolerance`** | Maximum absolute tolerance to terminate algorithm. | `1e-5` |
2949| `bool` | **`shuffle`** | If true, the batch order is shuffled; otherwise, each batch is visited in linear order. | `true` |
2950| `bool` | **`resetPolicy`** | If true, parameters are reset before every Optimize call; otherwise, their values are retained. | `true` |
2951| `bool` | **`exactObjective`** | Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). | `false` |
2952
2953Attributes of the optimizer may also be modified via the member methods
2954`StepSize()`, `BatchSize()`, `MaxIterations()`, `Tolerance()`, `Shuffle()`, `ResetPolicy()`, and
2955`ExactObjective()`.
2956
2957#### Examples
2958
2959<details open>
2960<summary>Click to collapse/expand example code.
2961</summary>
2962
2963```c++
2964RosenbrockFunction f;
2965arma::mat coordinates = f.GetInitialPoint();
2966
2967WNGrad<> optimizer(0.562, 1, 10000, 1e-4);
2968optimizer.Optimize(f, coordinates);
2969```
2970
2971</details>
2972
2973#### See also:
2974
2975 * [WNGrad: Learn the Learning Rate in Gradient Descent](https://arxiv.org/abs/1803.02865)
2976 * [SGD](#standard-sgd)
2977 * [SGD in Wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
2978 * [Differentiable separable functions](#differentiable-separable-functions)
2979