1Understanding Memory Formats {#dev_guide_understanding_memory_formats}
2======================================================================
3
4## Introduction
5
6Most computations are about data: analyzing data, adjusting data, reading and
7storing data, generating data, etc. The DNN domain is no exception. Images,
8weights/filters, sound, and text require efficient representation in computer
9memory to facilitate performing operations fast and in the most convenient way.
10
11This article is devoted to **data format** -- one form of data representation
12that describes how multidimensional arrays (nD) are stored in linear (1D) memory
13address space and why this is important for oneDNN.
14
15@note For the purpose of this article, data *format* and *layout* are used
16interchangeably.
17
18### Nomenclature used
19
20- Channels are the same as feature maps
21- Upper-case letters denote the dimensions (e.g. `N`)
22- Lower-case letters denote the index (e.g. `n`, where `0 <= n < N`)
23- The notation for the activations:
24
25  *batch*         **N**,
26  *channels*      **C**,
27  *depth*         **D**,
28  *height*        **H**,
29  *width*         **W**
30
31- The notation for the weights:
32
33  *groups*            **G**,
34  *output channels*   **O**,
35  *input channels*    **I**,
36  *depth*             **D**,
37  *height*            **H**,
38  *width*             **W**
39
40## Data formats
41
42Let's first focus on data formats for activations (images).
43
44Activations consist of channels (also called feature maps) and a spatial domain,
451D, 2D, or 3D. The spatial domain together with channels form an image.
46During the training phase, images are typically grouped together in batches.
47Even if there is only one image, we still assume that there is a batch with
48batch size equal to 1. Hence, the overall dimensionality of activations is 4D
49(**N**, **C**, **H**, and **W**) or 5D (**N**, **C**, **D**, **H**, and **W**).
50
51For the sake of simplicity, we will use only 2D spatial in this article.
52
53### Plain data formats
54
55It would be simpler to start with an example.
56
57Consider 4D activations with batch equals 2, 16 channels, and 5 x 4 spatial
58domain. Logical representation is given in the picture below.
59![Activations](mem_fmt_img1.png)
60
61The value at the position (n, c, h, w) is generated with the following formula:
62~~~
63    value(n, c, h, w) = n * CHW + c * HW + h * W + w
64~~~
65
66In order to define how data in this 4D-tensor is laid out in memory, we need to
67define how to map it to a 1D tensor via an **offset** function that takes a
68logical index (n, c, h, w) as an input and returns an address displacement to
69the location of the value:
70~~~
71    offset : (int, int, int, int) --> int
72~~~
73
74#### NCHW
75
76Let's describe the order in which the tensor values are laid out in memory for
77one of the very popular formats, **NCHW**. The `[a:?]` marks refer to the jumps
78shown in the picture below, which shows the 1D representation of an NCHW tensor
79in memory.
80
81-
82  `[a:0]`
83  First within a line, from left to right
84
85-
86  `[a:1]`
87  Then line by line from top to bottom
88
89-
90  `[a:2]`
91  Then go from one plane to another (in depth)
92
93-
94  `[a:3]`
95  And finally switch from one image in a batch (**n** = 0)
96  to another (**n** = 1)
97
98Then the offset function is:
99~~~
100    offset_nchw(n, c, h, w) = n * CHW + c * HW + h * W + w
101~~~
102
103We use `nchw` here to denote that `w` is the inner-most dimension, meaning that
104two elements adjacent in memory would share the same indices of `n`, `c`,
105and `h`, and their index of `w` would be different by `1`. This is of course
106true only for non-border elements. On the contrary, `n` is the outermost
107dimension here, meaning that if you need to take the same pixel `(c, h, w)` but
108on the next image, you have to jump over the whole image size `C*H*W`.
109
110This data format is called **NCHW** and is used by default in BVLC\* Caffe.
111TensorFlow\* also supports this data format.
112
113@note It is just a coincidence that `offset_nchw()` is the same as `value()`
114in this example.
115
116One can create memory with **NCHW** data layout using
117#dnnl_nchw of the enum type #dnnl_format_tag_t defined in
118[dnnl_types.h](https://github.com/oneapi-src/oneDNN/blob/master/include/oneapi/dnnl/dnnl_types.h)
119for the C API, and dnnl::memory::format_tag::nchw defined in
120[dnnl.hpp](https://github.com/oneapi-src/oneDNN/blob/master/include/oneapi/dnnl/dnnl.hpp)
121for the C++ API.
122
123
124#### NHWC
125
126Another quite popular data format is **NHWC**, which uses the following offset
127function:
128~~~
129    offset_nhwc(n, c, h, w) = n * HWC + h * WC + w * C + c
130~~~
131
132In this case, the inner-most dimension is channels (`[b:0]`), which is followed
133by width (`[b:1]`), height (`[b:2]`), and finally batch (`[b:3]`).
134
135For a single image (**N** = 1), this format is very similar to how
136[BMP-file format](https://en.wikipedia.org/wiki/BMP_file_format) works,
137where the image is kept pixel by pixel and every pixel contains all
138required information about colors (for instance, three channels for 24bit BMP).
139
140NHWC data format is the default one for
141[TensorFlow](https://www.tensorflow.org/performance/performance_guide#data_formats).
142
143This layout corresponds to #dnnl_nhwc or dnnl::memory::format_tag::nhwc.
144
145
146#### CHWN
147
148The last example here for the plain data layout is **CHWN**, which is used by
149[Neon](https://neon.nervanasys.com/index.html/design.html#data-layout).
150This layout might be very interesting from a vectorization perspective if
151an appropriate batch size is used, but on the other hand users cannot always
152have *good* batch size (for example, in case of real-time inference batch is
153typically 1).
154
155The dimensions order is (from inner-most to outer-most):
156batch (`[c:0]`), width (`[c:1]`), height (`[c:2]`), channels (`[c:3]`).
157
158The offset function for **CHWN** format is defined as:
159~~~
160    offset_chwn(n, c, h, w) = c * HWN + h * WN + w * N + n
161~~~
162
163This layout corresponds to #dnnl_chwn or dnnl::memory::format_tag::chwn.
164
165![Different plain layouts](mem_fmt_img2.png)
166
167#### Relevant reading
168
169[TensorFlow Doc. Shapes and Layout](https://www.tensorflow.org/performance/xla/shapes)
170
171### Generalization of the plain data layout
172
173#### Strides
174
175In the previous examples the data was kept packed or in dense form, meaning
176pixels follow one another. Sometimes it might be necessary to not keep data
177contiguous in memory. For instance, some might need to work with a sub-tensor
178within a bigger tensor. Sometimes it might be beneficial to artificially make
179the data disjoint, as in case of GEMM with a non-trivial leading dimension to
180get better performance
181([see Tips 6](https://software.intel.com/content/www/us/en/develop/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function)).
182
183The following picture shows a simplified case for a 2D matrix of size
184`rows x columns` kept in row-major format where rows have some non-trivial
185(that is, not equal to the number of columns) stride.
186
187![Strides](strides.png)
188
189In this case, the general offset function looks like:
190~~~
191    offset(n, c, h, w) = n * stride_n
192                       + c * stride_c
193                       + h * stride_h
194                       + w * stride_w
195~~~
196
197Note that the **NCHW**, **NHWC**, and **CHWN** formats are just special cases of
198the format with strides. For example, for **NCHW** we have:
199~~~
200    stride_n = CHW, stride_c = HW, stride_h = W, stride_w = 1
201~~~
202
203A user can initialize a memory descriptor with strides:
204~~~cpp
205    dnnl_dims_t dims = {N, C, H, W};
206    dnnl_dims_t strides = {stride_n, stride_c, stride_h, stride_w};
207
208    dnnl_memory_desc_t md;
209    dnnl_memory_desc_init_by_strides(&md, 4, dims, dnnl_f32, strides);
210~~~
211
212
213oneDNN supports strides via blocking structure. The pseudo-code for
214the function above is:
215~~~cpp
216    memory_desc_t md; // memory descriptor object
217
218    // logical description, layout independent
219    md.ndims = 4;           // # dimensions
220    md.dims = {N, C, H, W}; // dimensions themselves
221
222    // physical description
223    md.format_kind = dnnl_blocked; // generic blocked format
224    md.format_desc.blocking.strides = {
225        stride_n, stride_c, stride_h, stride_w
226    };
227~~~
228In particular, whenever a user creates memory with the #dnnl_nchw format,
229oneDNN computes the strides and fills the structure on behalf of the
230user.
231
232
233## Blocked layout
234
235Plain layouts give great flexibility and are very convenient for use. That's
236why most of the frameworks and applications use either the **NCHW** or **NHWC**
237layout. However, depending on the operation that is performed on data, it might
238turn out that those layouts are sub-optimal from the performance perspective.
239
240In order to achieve better vectorization and cache reuse oneDNN
241introduces blocked layout that splits one or several dimensions into the
242blocks of fixed size. The most popular oneDNN data format is
243**nChw16c** on AVX512+ systems and **nChw8c** on SSE4.1+ systems. As one might
244guess from the name the only dimension that is blocked is channels and the
245block size is either 16 in the former case or 8 in the later case.
246
247Precisely, the offset function for **nChw8c** is:
248~~~
249    offset_nChw8c(n, c, h, w) = n * CHW
250                              + (c / 8) * HW*8
251                              + h * W*8
252                              + w * 8
253                              + (c % 8)
254~~~
255
256Note that blocks of 8 channels are kept contiguously in memory. Pixel by pixel
257the spatial domain is covered. Then next slice covers the subsequent 8 channels
258(that is, moving from `c=0..7` to `c=8..15`). Once all channel blocks are
259covered, the next image in the batch appears.
260
261![nChw8c format](mem_fmt_blk.png)
262
263@note We use lower- and uppercase letters in the formats to distinguish
264between the blocks (e.g. 8c) and the remaining co-dimension (**C** = channels
265/ 8).
266
267The reason behind the format choice can be found in
268[this paper](https://arxiv.org/pdf/1602.06709v1.pdf).
269
270oneDNN describes this type of memory via blocking structure as well. The
271pseudo-code is:
272~~~cpp
273    memory_desc_t md;
274    // logical description, layout independent
275    md.ndims = 4;           // # dimensions
276    md.dims = {N, C, H, W}; // dimensions themselves
277
278    // physical description
279    md.memory_format = dnnl_blocked; // blocked layout
280
281    ptrdiff_t stride_n = C*H*W;
282    ptrdiff_t stride_C = H*W*8;
283    ptrdiff_t stride_h =   W*8;
284    ptrdiff_t stride_w =     8;
285
286    md.format_desc.blocking.strides = { // strides between blocks
287        stride_n, stride_C, stride_h, stride_w
288    };
289
290    md.format_desc.inner_nblks = 1; // number of blocked dimensions;
291                                    // 1, since only channels are blocked
292
293    md.format_desc.inner_idxs[0] = 1; // Only the 1st (c) dimension is blocked
294                                      // n -- 0st dim, w -- 3rd dim
295
296    md.format_desc.inner_blks[0] = 8; // This 1st dimensions is blocked by 8
297~~~
298
299### What if channels are not a multiple of 8 (or 16)?
300
301The blocking data layout gives a significant performance improvement for the
302convolutions, but what to do when the number of channels is not a multiple of
303the block size (for example, 17 channels for **nChw8c** format)?
304
305One of the possible ways to handle that would be to use blocked layout for as
306many channels as possible by rounding them down to a number that is a multiple
307of the block size (in this case `16 = 17 / 8 * 8`) and process the tail somehow.
308However, that would lead to the introduction of very special tail-processing
309code into many oneDNN kernels.
310
311So we came up with another solution using zero-padding. The idea is to round the
312channels up to make them multiples of the block size and pad the resulting tail
313with zeros (in the example above, `24 = div_up(17, 8) * 8`). Then primitives
314like convolutions might work with a rounded-up number of channels instead of
315the original ones and compute the correct result (adding zeros does not change
316the result).
317
318That enables supporting an arbitrary number of channels with almost no changes
319to the kernels. The price would be some extra computations on those zeros,
320but either this is negligible or the performance with overheads is still higher
321than the performance with the plain data layout.
322
323The picture below depicts the idea. Note that some extra computations occur
324during computation of `d0`, but that does not affect the result.
325
326![Padded format](mem_fmt_padded_blk.png)
327
328Some pitfalls of the given approach:
329
330- The memory size required to keep the data cannot be computed by the formula
331  `sizeof(data_type) * N * C * H * W` anymore. The actual size should always be
332  queried via dnnl_memory_desc_get_size() in C and
333  dnnl::memory::desc::get_size() in C++.
334
335- The actual zero-padding of oneDNN memory objects happen inside the
336  primitive execution functions in order to minimize its performance
337  impact. The current convention is that a primitive execution can
338  assume its inputs are properly zero padded, and should guarantee
339  its outputs are properly zero padded. If a user implements custom
340  kernels on oneDNN blocked memory objects, then they should respect this
341  convention. In particular, element-wise operations that are
342  implemented in the user's code and directly operate on oneDNN
343  blocked layout like this:
344  ~~~
345      for (int e = 0; e < phys_size; ++e)
346          x[e] = eltwise_op(x[e])
347  ~~~
348  are not safe if the data is padded with zeros and `eltwise_op(0) != 0`.
349
350Relevant oneDNN code:
351~~~cpp
352    const int C = 17;
353    const int C_padded = div_up(17, 8) * 8; // 24
354
355    // logical description, layout independent
356    const int ndims    = 4;            // # of dimensions
357    dnnl_dims_t dims = {N, C, H, W}; // dimensions themselves
358
359    memory_desc_t md;
360    // initialize memory descriptor
361    dnnl_memory_desc_init(&md, ndims,
362                                 dims,
363                                 dnnl_f32,   // single precision data type
364                                 dnnl_nChw8c // blocked layout
365                                 );
366
367    ptrdiff_t expect_stride_n = C_padded*H*W;   // note C_padded here, not C
368    ptrdiff_t expect_stride_C =          H*W*8;
369    ptrdiff_t expect_stride_h =            W*8;
370    ptrdiff_t expect_stride_w =              8;
371    ptrdiff_t expect_stride_8c =             1;
372
373    bool expect_true = true
374        && true // logical dims stay as is
375        && md.dims[0] == N
376        && md.dims[1] == C
377        && md.dims[2] == H
378        && md.dims[3] == W
379        && true // padded dims are rounded accordingly
380        && md.padded_dims[0] == N
381        && md.padded_dims[1] == C_padded
382        && md.padded_dims[2] == H
383        && md.padded_dims[3] == W
384        && true // strides between blocks correspond to the physical layout
385        && md.format_desc.blocking.strides[0] == expect_stride_n
386        && md.format_desc.blocking.strides[1] == expect_stride_C
387        && md.format_desc.blocking.strides[2] == expect_stride_h
388        && md.format_desc.blocking.strides[3] == expect_stride_w
389        && true // inner-most blocking
390        && md.format_desc.blocking.inner_nblks == 1 // only 1 dim is blocked (c)
391        && md.format_desc.blocking.inner_idxs[0] == 1  // 1st (c) dim is blocked
392        && md.format_desc.blocking.inner_dims[0] == 8; // the block size is 8
393
394    assert(expect_true);
395~~~
396