1Softmax {#dev_guide_softmax}
2============================
3
4>
5> [API Reference](@ref dnnl_api_softmax)
6>
7
8## General
9
10The softmax primitive performs softmax along a particular axis on data with
11arbitrary dimensions. All other axes are treated as independent (batch).
12
13### Forward
14
15In general form, the operation is defined by the following formulas (the
16variable names follow the standard @ref dev_guide_conventions):
17
18\f[
19    \dst(\overline{ou}, c, \overline{in}) =
20        \frac
21        {e^{\src(\overline{ou}, c, \overline{in}) - \nu(\overline{ou}, \overline{in})}}
22        {
23            \sum\limits_{ic}
24                e^{\src(\overline{ou}, ic, \overline{in}) - \nu(\overline{ou}, \overline{in})}
25        },
26\f]
27
28where
29
30- \f$c\f$ axis over which the softmax computation is computed on,
31- \f$\overline{ou}\f$ is the outermost index (to the left of softmax axis),
32- \f$\overline{in}\f$ is the innermost index (to the right of softmax axis), and
33- \f$\nu\f$ is used to produce more accurate results and defined as:
34
35\f[
36    \nu(\overline{ou}, \overline{in}) =
37        \max\limits_{ic}
38        \src(\overline{ou}, ic, \overline{in})
39\f]
40
41#### Difference Between Forward Training and Forward Inference
42
43There is no difference between the #dnnl_forward_training
44and #dnnl_forward_inference propagation kinds.
45
46### Backward
47
48The backward propagation computes \f$\diffsrc(ou, c, in)\f$, based on
49\f$\diffdst(ou, c, in)\f$ and \f$\dst(ou, c, in)\f$.
50
51## Execution Arguments
52When executed, the inputs and outputs should be mapped to an execution
53argument index as specified by the following table.
54
55| Primitive input/output | Execution argument index |
56| ---                    | ---                      |
57| \src                   | DNNL_ARG_SRC             |
58| \dst                   | DNNL_ARG_DST             |
59| \diffsrc               | DNNL_ARG_DIFF_SRC        |
60| \diffdst               | DNNL_ARG_DIFF_DST        |
61
62## Implementation Details
63
64### General Notes
65
661. Both forward and backward propagation support in-place operations, meaning
67   that `src` can be used as input and output for forward propagation, and
68   `diff_dst` can be used as input and output for backward propagation. In case
69   of in-place operation, the original data will be overwritten.
70
71### Post-ops and Attributes
72
73The softmax primitive does not support any post-ops or attributes.
74
75### Data Type Support
76
77The softmax primitive supports the following combinations of data types:
78
79| Propagation        | Source / Destination
80| :--                | :--
81| forward / backward | bf16, f32
82| forward            | f16
83
84### Data Representation
85
86#### Source, Destination, and Their Gradients
87
88The softmax primitive works with arbitrary data tensors. There is no special
89meaning associated with any logical dimensions. However, the softmax axis is
90typically referred to as channels (hence in formulas we use \f$c\f$).
91
92
93## Implementation Limitations
94
951. No primitive specific limitations. Refer to @ref dev_guide_data_types for
96   limitations related to data types support.
97
98## Performance Tips
99
1001. Use in-place operations whenever possible.
101
1022. Currently the softmax primitive is optimized for the cases where
103   the dimension of the softmax axis is physically dense. For instance:
104   - Optimized: 2D case, tensor \f$A \times B\f$,
105                softmax axis 1 (B), format tag #dnnl_ab
106   - Optimized: 4D case, tensor \f$A \times B \times C \times D\f$,
107                softmax axis 3 (D), format tag #dnnl_abcd
108   - Optimized: 4D case, tensor \f$A \times B \times C \times D\f$,
109                softmax axis 1 (B), format tag #dnnl_abcd, and
110                \f$C = D = 1\f$
111   - Optimized: 4D case, tensor \f$A \times B \times C \times D\f$,
112                softmax axis 1 (B), format tag #dnnl_acdb or #dnnl_aBcd16b, and
113                \f$C \cdot D \ne 1\f$
114   - Non-optimized: 2D case, tensor \f$A \times B\f$,
115                    softmax axis 0 (A), format tag #dnnl_ab,
116                    and \f$B \ne 1\f$
117   - Non-optimized: 2D case, tensor \f$A \times B\f$,
118                    softmax axis 1 (B), format tag #dnnl_ba,
119                    and \f$A \ne 1\f$
120   - Non-optimized: 4D case, tensor \f$A \times B \times C \times D\f$,
121                    softmax axis 2 (C), format tag #dnnl_acdb, and
122                    and \f$D \cdot B \ne 1\f$
123
124## Examples
125
126| Engine  | Name                     | Comments
127| :--     | :--                      | :--
128| CPU/GPU | @ref softmax_example_cpp | @copydetails softmax_example_cpp_short
129