1Tiling 2====== 3 4The naive view of an image in memory is that the pixels are stored one after 5another in memory usually in an X-major order. An image that is arranged in 6this way is called "linear". Linear images, while easy to reason about, can 7have very bad cache locality. Graphics operations tend to act on pixels that 8are close together in 2-D euclidean space. If you move one pixel to the right 9or left in a linear image, you only move a few bytes to one side or the other 10in memory. However, if you move one pixel up or down you can end up kilobytes 11or even megabytes away. 12 13Tiling (sometimes referred to as swizzling) is a method of re-arranging the 14pixels of a surface so that pixels which are close in 2-D euclidean space are 15likely to be close in memory. 16 17Basics 18------ 19 20The basic idea of a tiled image is that the image is first divided into 21two-dimensional blocks or tiles. Each tile takes up a chunk of contiguous 22memory and the tiles are arranged like pixels in linear surface. This is best 23demonstrated with a specific example. Suppose we have a RGBA8888 X-tiled 24surface on Intel graphics. Then the surface is divided into 128x8 pixel tiles 25each of which is 4KB of memory. Within each tile, the pixels are laid out like 26a 128x8 linear image. The tiles themselves are laid out row-major in memory 27like giant pixels. This means that, as long as you don't leave your 128x8 28tile, you can move in both dimensions without leaving the same 4K page in 29memory. 30 31.. image:: tiling-basic.svg 32 :alt: Example of an X-tiled image 33 34You can, however do even better than this. Suppose that same image is, 35instead, Y-tiled. Then the surface is divided into 32x32 pixel tiles each of 36which is 4KB of memory. Within a tile, each 64B cache line corresponds to 4x4 37pixel region of the image (you can think of it as a tile within a tile). This 38means that very small deviations don't even leave the cache line. This added 39bit of pixel shuffling is known to have a substantial performance impact in 40most real-world applications. 41 42Intel graphics has several different tiling formats that we'll discuss in 43detail in later sections. The most commonly used as of the writing of this 44chapter is Y-tiling. In all tiling formats the basic principal is the same: 45The image is divided into tiles of a particular size and, within those tiles, 46the data is re-arranged (or swizzled) based on a particular pattern. A tile 47size will always be specified in bytes by rows and the actual X-dimension of 48the tile in elements depends on the size of the element in bytes. 49 50Bit-6 Swizzling 51^^^^^^^^^^^^^^^ 52 53On some older hardware, there is an additional address swizzle that is applied 54on top of the tiling format. This has been removed starting with Broadwell 55because, as it says in the Broadwell PRM Vol 5 "Tiling Algorithm" (p. 17): 56 57 Address Swizzling for Tiled-Surfaces is no longer used because the main 58 memory controller has a more effective address swizzling algorithm. 59 60Whether or not swizzling is enabled depends on the memory configuration of the 61system. Generally, systems with dual-channel RAM have swizzling enabled and 62single-channel do not. Supposedly, this swizzling allows for better balancing 63between the two memory channels and increases performance. Because it depends 64on the memory configuration which may change from one boot to the next, it 65requires a run-time check. 66 67The best documentation for bit-6 swizzling can be found in the Haswell PRM Vol. 685 "Memory Views" in the section entitled "Address Swizzling for Tiled-Y 69Surfaces". It exists on older platforms but the docs get progressively worse 70the further you go back. 71 72ISL Representation 73------------------ 74 75The structure of any given tiling format is represented by ISL using the 76:cpp:enum:`isl_tiling` enum and the :cpp:struct:`isl_tile_info` structure: 77 78.. doxygenenum:: isl_tiling 79 80.. doxygenfunction:: isl_tiling_get_info 81 82.. doxygenstruct:: isl_tile_info 83 :members: 84 85The `isl_tile_info` structure has two different sizes for a tile: a logical 86size in surface elements and a physical size in bytes. In order to determine 87the proper logical size, the bits-per-block of the underlying format has to be 88passed into `isl_tiling_get_info`. The proper way to compute the size of an 89image in bytes given a width and height in elements is as follows: 90 91.. code-block:: c 92 93 uint32_t width_tl = DIV_ROUND_UP(width_el * (format_bpb / tile_info.format_bpb), 94 tile_info.logical_extent_el.w); 95 uint32_t height_tl = DIV_ROUND_UP(height_el, tile_info.logical_extent_el.h); 96 uint32_t row_pitch = width_tl * tile_info.phys_extent_el.w; 97 uint32_t size = height_tl * tile_info.phys_extent_el.h * row_pitch; 98 99It is very important to note that there is no direct conversion between 100:cpp:member:`isl_tile_info::logical_extent_el` and 101:cpp:member:`isl_tile_info::phys_extent_B`. It is tempting to assume that the 102logical and physical heights are the same and simply divide the width of 103:cpp:member:`isl_tile_info::phys_extent_B` by the size of the format (which is 104what the PRM does) to get :cpp:member:`isl_tile_info::logical_extent_el` but 105this is not at all correct. Some tiling formats have logical and physical 106heights that differ and so no such calculation will work in general. The 107easiest case study for this is W-tiling. From the Sky Lake PRM Vol. 2d, 108"RENDER_SURFACE_STATE" (p. 427): 109 110 If the surface is a stencil buffer (and thus has Tile Mode set to 111 TILEMODE_WMAJOR), the pitch must be set to 2x the value computed based on 112 width, as the stencil buffer is stored with two rows interleaved. 113 114What does this mean? Why are we multiplying the pitch by two? What does it 115mean that "the stencil buffer is stored with two rows interleaved"? The 116explanation for all these questions is that a W-tile (which is only used for 117stencil) has a logical size of 64el x 64el but a physical size of 128B 118x 32rows. In memory, a W-tile has the same footprint as a Y-tile (128B 119x 32rows) but every pair of rows in the stencil buffer is interleaved into 120a single row of bytes yielding a two-dimensional area of 64el x 64el. You can 121consider this as its own tiling format or as a modification of Y-tiling. The 122interpretation in the PRMs vary by hardware generation; on Sandy Bridge they 123simply said it was Y-tiled but by Sky Lake there is almost no mention of 124Y-tiling in connection with stencil buffers and they are always W-tiled. This 125mismatch between logical and physical tile sizes are also relevant for 126hierarchical depth buffers as well as single-channel MCS and CCS buffers. 127 128X-tiling 129-------- 130 131The simplest tiling format available on Intel graphics (which has been 132available since gen4) is X-tiling. An X-tile is 512B x 8rows and, within the 133tile, the data is arranged in an X-major linear fashion. You can also look at 134X-tiling as being an 8x8 cache line grid where the cache lines are arranged 135X-major as follows: 136 137===== ===== ===== ===== ===== ===== ===== ===== 138===== ===== ===== ===== ===== ===== ===== ===== 1390x000 0x040 0x080 0x0c0 0x100 0x140 0x180 0x1c0 1400x200 0x240 0x280 0x2c0 0x300 0x340 0x380 0x3c0 1410x400 0x440 0x480 0x4c0 0x500 0x540 0x580 0x5c0 1420x600 0x640 0x680 0x6c0 0x700 0x740 0x780 0x7c0 1430x800 0x840 0x880 0x8c0 0x900 0x940 0x980 0x9c0 1440xa00 0xa40 0xa80 0xac0 0xb00 0xb40 0xb80 0xbc0 1450xc00 0xc40 0xc80 0xcc0 0xd00 0xd40 0xd80 0xdc0 1460xe00 0xe40 0xe80 0xec0 0xf00 0xf40 0xf80 0xfc0 147===== ===== ===== ===== ===== ===== ===== ===== 148 149Each cache line represents a piece of a single row of pixels within the image. 150The memory locations of two vertically adjacent pixels within the same X-tile 151always differs by 512B or 8 cache lines. 152 153As mentioned above, X-tiling is slower than Y-tiling (though still faster than 154linear). However, until Sky Lake, the display scan-out hardware could only do 155X-tiling so we have historically used X-tiling for all window-system buffers 156(because X or a Wayland compositor may want to put it in a plane). 157 158Bit-6 Swizzling 159^^^^^^^^^^^^^^^ 160 161When bit-6 swizzling is enabled, bits 9 and 10 are XOR'd in with bit 6 of the 162tiled address: 163 164.. code-block:: c 165 166 addr[6] ^= addr[9] ^ addr[10]; 167 168Y-tiling 169-------- 170 171The Y-tiling format, also available since gen4, is substantially different from 172X-tiling and performs much better in practice. Each Y-tile is an 8x8 grid of cache lines arranged Y-major as follows: 173 174===== ===== ===== ===== ===== ===== ===== ===== 175===== ===== ===== ===== ===== ===== ===== ===== 1760x000 0x200 0x400 0x600 0x800 0xa00 0xc00 0xe00 1770x040 0x240 0x440 0x640 0x840 0xa40 0xc40 0xe40 1780x080 0x280 0x480 0x680 0x880 0xa80 0xc80 0xe80 1790x0c0 0x2c0 0x4c0 0x6c0 0x8c0 0xac0 0xcc0 0xec0 1800x100 0x300 0x500 0x700 0x900 0xb00 0xd00 0xf00 1810x140 0x340 0x540 0x740 0x940 0xb40 0xd40 0xf40 1820x180 0x380 0x580 0x780 0x980 0xb80 0xd80 0xf80 1830x1c0 0x3c0 0x5c0 0x7c0 0x9c0 0xbc0 0xdc0 0xfc0 184===== ===== ===== ===== ===== ===== ===== ===== 185 186Each 64B cache line within the tile is laid out as 4 rows of 16B each: 187 188==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== 189==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== 1900x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0a 0x0b 0x0c 0x0d 0x0e 0x0f 1910x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 1920x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 1930x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 0x38 0x39 0x3a 0x3b 0x3c 0x3d 0x3e 0x3f 194==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== 195 196Y-tiling is widely regarded as being substantially faster than X-tiling so it 197is generally preferred. However, prior to Sky Lake, Y-tiling was not available 198for scanout so X tiling was used for any sort of window-system buffers. 199Starting with Sky Lake, we can scan out from Y-tiled buffers. 200 201Bit-6 Swizzling 202^^^^^^^^^^^^^^^ 203 204When bit-6 swizzling is enabled, bit 9 is XOR'd in with bit 6 of the tiled 205address: 206 207.. code-block:: c 208 209 addr[6] ^= addr[9]; 210 211W-tiling 212-------- 213 214W-tiling is a new tiling format added on Sandy Bridge for use in stencil 215buffers. W-tiling is similar to Y-tiling in that it's arranged as an 8x8 216Y-major grid of cache lines. The bytes within each cache line are arranged as 217follows: 218 219==== ==== ==== ==== ==== ==== ==== ==== 220==== ==== ==== ==== ==== ==== ==== ==== 2210x00 0x01 0x04 0x05 0x10 0x11 0x14 0x15 2220x02 0x03 0x06 0x07 0x12 0x13 0x16 0x17 2230x08 0x09 0x0c 0x0d 0x18 0x19 0x1c 0x1d 2240x0a 0x0b 0x0e 0x0f 0x1a 0x1b 0x1e 0x1f 2250x20 0x21 0x24 0x25 0x30 0x31 0x34 0x35 2260x22 0x23 0x26 0x27 0x32 0x33 0x36 0x37 2270x28 0x29 0x2c 0x2d 0x38 0x39 0x3c 0x3d 2280x2a 0x2b 0x2e 0x2f 0x3a 0x3b 0x3e 0x3f 229==== ==== ==== ==== ==== ==== ==== ==== 230 231While W-tiling has been required for stencil all the way back to Sandy Bridge, 232the docs are somewhat confused as to whether stencil buffers are W or Y-tiled. 233This seems to stem from the fact that the hardware seems to implement W-tiling 234as a sort of modified Y-tiling. One example of this is the somewhat odd 235requirement that W-tiled buffers have their pitch multiplied by 2. From the 236Sky Lake PRM Vol. 2d, "RENDER_SURFACE_STATE" (p. 427): 237 238 If the surface is a stencil buffer (and thus has Tile Mode set to 239 TILEMODE_WMAJOR), the pitch must be set to 2x the value computed based on 240 width, as the stencil buffer is stored with two rows interleaved. 241 242The last phrase holds the key here: "the stencil buffer is stored with two rows 243interleaved". More accurately, a W-tiled buffer can be viewed as a Y-tiled 244buffer with each set of 4 W-tiled lines interleaved to form 2 Y-tiled lines. In 245ISL, we represent a W-tile as a tiling with a logical dimension of 64el x 64el 246but a physical size of 128B x 32rows. This cleanly takes care of the pitch 247issue above and seems to nicely model the hardware. 248 249Tile4 250----- 251 252The tile4 format, introduced on Xe-HP, is somewhat similar to Y but with more 253internal shuffling. Each tile4 tile is an 8x8 grid of cache lines arranged 254as follows: 255 256===== ===== ===== ===== ===== ===== ===== ===== 257===== ===== ===== ===== ===== ===== ===== ===== 2580x000 0x040 0x080 0x0a0 0x200 0x240 0x280 0x2a0 2590x100 0x140 0x180 0x1a0 0x300 0x340 0x380 0x3a0 2600x400 0x440 0x480 0x4a0 0x600 0x640 0x680 0x6a0 2610x500 0x540 0x580 0x5a0 0x700 0x740 0x780 0x7a0 2620x800 0x840 0x880 0x8a0 0xa00 0xa40 0xa80 0xaa0 2630x900 0x940 0x980 0x9a0 0xb00 0xb40 0xb80 0xba0 2640xc00 0xc40 0xc80 0xca0 0xe00 0xe40 0xe80 0xea0 2650xd00 0xd40 0xd80 0xda0 0xf00 0xf40 0xf80 0xfa0 266===== ===== ===== ===== ===== ===== ===== ===== 267 268Each 64B cache line within the tile is laid out the same way as for a Y-tile, 269as 4 rows of 16B each: 270 271==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== 272==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== 2730x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0a 0x0b 0x0c 0x0d 0x0e 0x0f 2740x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 2750x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 2760x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 0x38 0x39 0x3a 0x3b 0x3c 0x3d 0x3e 0x3f 277==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== 278 279Tiling as a bit pattern 280----------------------- 281 282There is one more important angle on tiling that should be discussed before we 283finish. Every tiling can be described by three things: 284 285 1. A logical width and height in elements 286 2. A physical width in bytes and height in rows 287 3. A mapping from logical elements to physical bytes within the tile 288 289We have spent a good deal of time on the first two because this is what you 290really need for doing surface layout calculations. However, there are cases in 291which the map from logical to physical elements is critical. One example is 292W-tiling where we have code to do W-tiled encoding and decoding in the shader 293for doing stencil blits because the hardware does not allow us to render to 294W-tiled surfaces. 295 296There are many ways to mathematically describe the mapping from logical 297elements to physical bytes. In the PRMs they give a very complicated set of 298formulas involving lots of multiplication, modulus, and sums that show you how 299to compute the mapping. With a little creativity, you can easily reduce those 300to a set of bit shifts and ORs. By far the simplest formulation, however, is 301as a mapping from the bits of the texture coordinates to bits in the address. 302Suppose that :math:`(u, v)` is location of a 1-byte element within a tile. If 303you represent :math:`u` as :math:`u_n u_{n-1} \cdots u_2 u_1 u_0` where 304:math:`u_0` is the LSB and :math:`u_n` is the MSB of :math:`u` and similarly 305:math:`v = v_m v_{m-1} \cdots v_2 v_1 v_0`, then the bits of the address within 306the tile are given by the table below: 307 308=========================================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== 309 Tiling 11 10 9 8 7 6 5 4 3 2 1 0 310=========================================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== 311:cpp:enumerator:`isl_tiling::ISL_TILING_X` :math:`v_2` :math:`v_1` :math:`v_0` :math:`u_8` :math:`u_7` :math:`u_6` :math:`u_5` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0` 312:cpp:enumerator:`isl_tiling::ISL_TILING_Y0` :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_4` :math:`v_3` :math:`v_2` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0` 313:cpp:enumerator:`isl_tiling::ISL_TILING_W` :math:`u_5` :math:`u_4` :math:`u_3` :math:`v_5` :math:`v_4` :math:`v_3` :math:`v_2` :math:`u_2` :math:`v_1` :math:`u_1` :math:`v_0` :math:`u_0` 314:cpp:enumerator:`isl_tiling::ISL_TILING_4` :math:`v_4` :math:`v_3` :math:`u_6` :math:`v_2` :math:`u_5` :math:`u_4` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0` 315=========================================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== 316 317Constructing the mapping this way makes a lot of sense when you think about 318hardware. It may seem complex on paper but "simple" things such as addition 319are relatively expensive in hardware while interleaving bits in a well-defined 320pattern is practically free. For a format that has more than one byte per 321element, you simply chop bits off the bottom of the pattern, hard-code them to 3220, and adjust bit indices as needed. For a 128-bit format, for instance, the 323Y-tiled pattern becomes u_2 u_1 u_0 v_4 v_3 v_2 v_1 v_0. The Sky Lake PRM 324Vol. 5 in the section "2D Surfaces" contains an expanded version of the above 325table (which we will not repeat here) that also includes the bit patterns for 326the Ys and Yf tiling formats. 327