1Understanding Memory Formats {#dev_guide_understanding_memory_formats} 2====================================================================== 3 4## Introduction 5 6Most computations are about data: analyzing data, adjusting data, reading and 7storing data, generating data, etc. The DNN domain is no exception. Images, 8weights/filters, sound, and text require efficient representation in computer 9memory to facilitate performing operations fast and in the most convenient way. 10 11This article is devoted to **data format** -- one form of data representation 12that describes how multidimensional arrays (nD) are stored in linear (1D) memory 13address space and why this is important for oneDNN. 14 15@note For the purpose of this article, data *format* and *layout* are used 16interchangeably. 17 18### Nomenclature used 19 20- Channels are the same as feature maps 21- Upper-case letters denote the dimensions (e.g. `N`) 22- Lower-case letters denote the index (e.g. `n`, where `0 <= n < N`) 23- The notation for the activations: 24 25 *batch* **N**, 26 *channels* **C**, 27 *depth* **D**, 28 *height* **H**, 29 *width* **W** 30 31- The notation for the weights: 32 33 *groups* **G**, 34 *output channels* **O**, 35 *input channels* **I**, 36 *depth* **D**, 37 *height* **H**, 38 *width* **W** 39 40## Data formats 41 42Let's first focus on data formats for activations (images). 43 44Activations consist of channels (also called feature maps) and a spatial domain, 451D, 2D, or 3D. The spatial domain together with channels form an image. 46During the training phase, images are typically grouped together in batches. 47Even if there is only one image, we still assume that there is a batch with 48batch size equal to 1. Hence, the overall dimensionality of activations is 4D 49(**N**, **C**, **H**, and **W**) or 5D (**N**, **C**, **D**, **H**, and **W**). 50 51For the sake of simplicity, we will use only 2D spatial in this article. 52 53### Plain data formats 54 55It would be simpler to start with an example. 56 57Consider 4D activations with batch equals 2, 16 channels, and 5 x 4 spatial 58domain. Logical representation is given in the picture below. 59![Activations](mem_fmt_img1.png) 60 61The value at the position (n, c, h, w) is generated with the following formula: 62~~~ 63 value(n, c, h, w) = n * CHW + c * HW + h * W + w 64~~~ 65 66In order to define how data in this 4D-tensor is laid out in memory, we need to 67define how to map it to a 1D tensor via an **offset** function that takes a 68logical index (n, c, h, w) as an input and returns an address displacement to 69the location of the value: 70~~~ 71 offset : (int, int, int, int) --> int 72~~~ 73 74#### NCHW 75 76Let's describe the order in which the tensor values are laid out in memory for 77one of the very popular formats, **NCHW**. The `[a:?]` marks refer to the jumps 78shown in the picture below, which shows the 1D representation of an NCHW tensor 79in memory. 80 81- 82 `[a:0]` 83 First within a line, from left to right 84 85- 86 `[a:1]` 87 Then line by line from top to bottom 88 89- 90 `[a:2]` 91 Then go from one plane to another (in depth) 92 93- 94 `[a:3]` 95 And finally switch from one image in a batch (**n** = 0) 96 to another (**n** = 1) 97 98Then the offset function is: 99~~~ 100 offset_nchw(n, c, h, w) = n * CHW + c * HW + h * W + w 101~~~ 102 103We use `nchw` here to denote that `w` is the inner-most dimension, meaning that 104two elements adjacent in memory would share the same indices of `n`, `c`, 105and `h`, and their index of `w` would be different by `1`. This is of course 106true only for non-border elements. On the contrary, `n` is the outermost 107dimension here, meaning that if you need to take the same pixel `(c, h, w)` but 108on the next image, you have to jump over the whole image size `C*H*W`. 109 110This data format is called **NCHW** and is used by default in BVLC\* Caffe. 111TensorFlow\* also supports this data format. 112 113@note It is just a coincidence that `offset_nchw()` is the same as `value()` 114in this example. 115 116One can create memory with **NCHW** data layout using 117#dnnl_nchw of the enum type #dnnl_format_tag_t defined in 118[dnnl_types.h](https://github.com/oneapi-src/oneDNN/blob/master/include/oneapi/dnnl/dnnl_types.h) 119for the C API, and dnnl::memory::format_tag::nchw defined in 120[dnnl.hpp](https://github.com/oneapi-src/oneDNN/blob/master/include/oneapi/dnnl/dnnl.hpp) 121for the C++ API. 122 123 124#### NHWC 125 126Another quite popular data format is **NHWC**, which uses the following offset 127function: 128~~~ 129 offset_nhwc(n, c, h, w) = n * HWC + h * WC + w * C + c 130~~~ 131 132In this case, the inner-most dimension is channels (`[b:0]`), which is followed 133by width (`[b:1]`), height (`[b:2]`), and finally batch (`[b:3]`). 134 135For a single image (**N** = 1), this format is very similar to how 136[BMP-file format](https://en.wikipedia.org/wiki/BMP_file_format) works, 137where the image is kept pixel by pixel and every pixel contains all 138required information about colors (for instance, three channels for 24bit BMP). 139 140NHWC data format is the default one for 141[TensorFlow](https://www.tensorflow.org/performance/performance_guide#data_formats). 142 143This layout corresponds to #dnnl_nhwc or dnnl::memory::format_tag::nhwc. 144 145 146#### CHWN 147 148The last example here for the plain data layout is **CHWN**, which is used by 149[Neon](https://neon.nervanasys.com/index.html/design.html#data-layout). 150This layout might be very interesting from a vectorization perspective if 151an appropriate batch size is used, but on the other hand users cannot always 152have *good* batch size (for example, in case of real-time inference batch is 153typically 1). 154 155The dimensions order is (from inner-most to outer-most): 156batch (`[c:0]`), width (`[c:1]`), height (`[c:2]`), channels (`[c:3]`). 157 158The offset function for **CHWN** format is defined as: 159~~~ 160 offset_chwn(n, c, h, w) = c * HWN + h * WN + w * N + n 161~~~ 162 163This layout corresponds to #dnnl_chwn or dnnl::memory::format_tag::chwn. 164 165![Different plain layouts](mem_fmt_img2.png) 166 167#### Relevant reading 168 169[TensorFlow Doc. Shapes and Layout](https://www.tensorflow.org/performance/xla/shapes) 170 171### Generalization of the plain data layout 172 173#### Strides 174 175In the previous examples the data was kept packed or in dense form, meaning 176pixels follow one another. Sometimes it might be necessary to not keep data 177contiguous in memory. For instance, some might need to work with a sub-tensor 178within a bigger tensor. Sometimes it might be beneficial to artificially make 179the data disjoint, as in case of GEMM with a non-trivial leading dimension to 180get better performance 181([see Tips 6](https://software.intel.com/content/www/us/en/develop/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function)). 182 183The following picture shows a simplified case for a 2D matrix of size 184`rows x columns` kept in row-major format where rows have some non-trivial 185(that is, not equal to the number of columns) stride. 186 187![Strides](strides.png) 188 189In this case, the general offset function looks like: 190~~~ 191 offset(n, c, h, w) = n * stride_n 192 + c * stride_c 193 + h * stride_h 194 + w * stride_w 195~~~ 196 197Note that the **NCHW**, **NHWC**, and **CHWN** formats are just special cases of 198the format with strides. For example, for **NCHW** we have: 199~~~ 200 stride_n = CHW, stride_c = HW, stride_h = W, stride_w = 1 201~~~ 202 203A user can initialize a memory descriptor with strides: 204~~~cpp 205 dnnl_dims_t dims = {N, C, H, W}; 206 dnnl_dims_t strides = {stride_n, stride_c, stride_h, stride_w}; 207 208 dnnl_memory_desc_t md; 209 dnnl_memory_desc_init_by_strides(&md, 4, dims, dnnl_f32, strides); 210~~~ 211 212 213oneDNN supports strides via blocking structure. The pseudo-code for 214the function above is: 215~~~cpp 216 memory_desc_t md; // memory descriptor object 217 218 // logical description, layout independent 219 md.ndims = 4; // # dimensions 220 md.dims = {N, C, H, W}; // dimensions themselves 221 222 // physical description 223 md.format_kind = dnnl_blocked; // generic blocked format 224 md.format_desc.blocking.strides = { 225 stride_n, stride_c, stride_h, stride_w 226 }; 227~~~ 228In particular, whenever a user creates memory with the #dnnl_nchw format, 229oneDNN computes the strides and fills the structure on behalf of the 230user. 231 232 233## Blocked layout 234 235Plain layouts give great flexibility and are very convenient for use. That's 236why most of the frameworks and applications use either the **NCHW** or **NHWC** 237layout. However, depending on the operation that is performed on data, it might 238turn out that those layouts are sub-optimal from the performance perspective. 239 240In order to achieve better vectorization and cache reuse oneDNN 241introduces blocked layout that splits one or several dimensions into the 242blocks of fixed size. The most popular oneDNN data format is 243**nChw16c** on AVX512+ systems and **nChw8c** on SSE4.1+ systems. As one might 244guess from the name the only dimension that is blocked is channels and the 245block size is either 16 in the former case or 8 in the later case. 246 247Precisely, the offset function for **nChw8c** is: 248~~~ 249 offset_nChw8c(n, c, h, w) = n * CHW 250 + (c / 8) * HW*8 251 + h * W*8 252 + w * 8 253 + (c % 8) 254~~~ 255 256Note that blocks of 8 channels are kept contiguously in memory. Pixel by pixel 257the spatial domain is covered. Then next slice covers the subsequent 8 channels 258(that is, moving from `c=0..7` to `c=8..15`). Once all channel blocks are 259covered, the next image in the batch appears. 260 261![nChw8c format](mem_fmt_blk.png) 262 263@note We use lower- and uppercase letters in the formats to distinguish 264between the blocks (e.g. 8c) and the remaining co-dimension (**C** = channels 265/ 8). 266 267The reason behind the format choice can be found in 268[this paper](https://arxiv.org/pdf/1602.06709v1.pdf). 269 270oneDNN describes this type of memory via blocking structure as well. The 271pseudo-code is: 272~~~cpp 273 memory_desc_t md; 274 // logical description, layout independent 275 md.ndims = 4; // # dimensions 276 md.dims = {N, C, H, W}; // dimensions themselves 277 278 // physical description 279 md.memory_format = dnnl_blocked; // blocked layout 280 281 ptrdiff_t stride_n = C*H*W; 282 ptrdiff_t stride_C = H*W*8; 283 ptrdiff_t stride_h = W*8; 284 ptrdiff_t stride_w = 8; 285 286 md.format_desc.blocking.strides = { // strides between blocks 287 stride_n, stride_C, stride_h, stride_w 288 }; 289 290 md.format_desc.inner_nblks = 1; // number of blocked dimensions; 291 // 1, since only channels are blocked 292 293 md.format_desc.inner_idxs[0] = 1; // Only the 1st (c) dimension is blocked 294 // n -- 0st dim, w -- 3rd dim 295 296 md.format_desc.inner_blks[0] = 8; // This 1st dimensions is blocked by 8 297~~~ 298 299### What if channels are not a multiple of 8 (or 16)? 300 301The blocking data layout gives a significant performance improvement for the 302convolutions, but what to do when the number of channels is not a multiple of 303the block size (for example, 17 channels for **nChw8c** format)? 304 305One of the possible ways to handle that would be to use blocked layout for as 306many channels as possible by rounding them down to a number that is a multiple 307of the block size (in this case `16 = 17 / 8 * 8`) and process the tail somehow. 308However, that would lead to the introduction of very special tail-processing 309code into many oneDNN kernels. 310 311So we came up with another solution using zero-padding. The idea is to round the 312channels up to make them multiples of the block size and pad the resulting tail 313with zeros (in the example above, `24 = div_up(17, 8) * 8`). Then primitives 314like convolutions might work with a rounded-up number of channels instead of 315the original ones and compute the correct result (adding zeros does not change 316the result). 317 318That enables supporting an arbitrary number of channels with almost no changes 319to the kernels. The price would be some extra computations on those zeros, 320but either this is negligible or the performance with overheads is still higher 321than the performance with the plain data layout. 322 323The picture below depicts the idea. Note that some extra computations occur 324during computation of `d0`, but that does not affect the result. 325 326![Padded format](mem_fmt_padded_blk.png) 327 328Some pitfalls of the given approach: 329 330- The memory size required to keep the data cannot be computed by the formula 331 `sizeof(data_type) * N * C * H * W` anymore. The actual size should always be 332 queried via dnnl_memory_desc_get_size() in C and 333 dnnl::memory::desc::get_size() in C++. 334 335- The actual zero-padding of oneDNN memory objects happen inside the 336 primitive execution functions in order to minimize its performance 337 impact. The current convention is that a primitive execution can 338 assume its inputs are properly zero padded, and should guarantee 339 its outputs are properly zero padded. If a user implements custom 340 kernels on oneDNN blocked memory objects, then they should respect this 341 convention. In particular, element-wise operations that are 342 implemented in the user's code and directly operate on oneDNN 343 blocked layout like this: 344 ~~~ 345 for (int e = 0; e < phys_size; ++e) 346 x[e] = eltwise_op(x[e]) 347 ~~~ 348 are not safe if the data is padded with zeros and `eltwise_op(0) != 0`. 349 350Relevant oneDNN code: 351~~~cpp 352 const int C = 17; 353 const int C_padded = div_up(17, 8) * 8; // 24 354 355 // logical description, layout independent 356 const int ndims = 4; // # of dimensions 357 dnnl_dims_t dims = {N, C, H, W}; // dimensions themselves 358 359 memory_desc_t md; 360 // initialize memory descriptor 361 dnnl_memory_desc_init(&md, ndims, 362 dims, 363 dnnl_f32, // single precision data type 364 dnnl_nChw8c // blocked layout 365 ); 366 367 ptrdiff_t expect_stride_n = C_padded*H*W; // note C_padded here, not C 368 ptrdiff_t expect_stride_C = H*W*8; 369 ptrdiff_t expect_stride_h = W*8; 370 ptrdiff_t expect_stride_w = 8; 371 ptrdiff_t expect_stride_8c = 1; 372 373 bool expect_true = true 374 && true // logical dims stay as is 375 && md.dims[0] == N 376 && md.dims[1] == C 377 && md.dims[2] == H 378 && md.dims[3] == W 379 && true // padded dims are rounded accordingly 380 && md.padded_dims[0] == N 381 && md.padded_dims[1] == C_padded 382 && md.padded_dims[2] == H 383 && md.padded_dims[3] == W 384 && true // strides between blocks correspond to the physical layout 385 && md.format_desc.blocking.strides[0] == expect_stride_n 386 && md.format_desc.blocking.strides[1] == expect_stride_C 387 && md.format_desc.blocking.strides[2] == expect_stride_h 388 && md.format_desc.blocking.strides[3] == expect_stride_w 389 && true // inner-most blocking 390 && md.format_desc.blocking.inner_nblks == 1 // only 1 dim is blocked (c) 391 && md.format_desc.blocking.inner_idxs[0] == 1 // 1st (c) dim is blocked 392 && md.format_desc.blocking.inner_dims[0] == 8; // the block size is 8 393 394 assert(expect_true); 395~~~ 396