1Single-sampled Color Compression
2================================
3
4Starting with Ivy Bridge, Intel graphics hardware provides a form of color
5compression for single-sampled surfaces.  In its initial form, this provided an
6acceleration of render target clear operations that, in the common case, allows
7you to avoid almost all of the bandwidth of a full-surface clear operation.  On
8Sky Lake, single-sampled color compression was extended to allow for the
9compression color values from actual rendering and not just the initial clear.
10From here on, the older Ivy Bridge form of color compression will be called
11"fast-clears" and term "color compression" will be reserved for the more
12powerful Sky Lake form.
13
14The documentation for Ivy Bridge through Broadwell overloads the term MCS for
15referring both to the *multisample control surface* used for multisample
16compression and the control surface used for fast-clears. In ISL, the
17:cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_MCS` enum always refers to
18multisample color compression while the
19:cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_CCS_` enums always refer to
20single-sampled color compression. Throughout this chapter and the rest of the
21ISL documentation, we will use the term "color control surface", abbreviated
22CCS, to denote the control surface used for both fast-clears and color
23compression.  While this is still an overloaded term, Ivy Bridge fast-clears
24are much closer to Sky Lake color compression than they are to multisample
25compression.
26
27CCS data
28--------
29
30Fast clears and CCS are possibly the single most poorly documented aspect of
31surface layout/setup for Intel graphics hardware (with HiZ coming in a neat
32second). All the documentation really says is that you can use an MCS buffer on
33single-sampled surfaces (we will call it the CCS in this case). It also
34provides some documentation on how to program the hardware to perform clear
35operations, but that's it.  How big is this buffer?  What does it contain?
36Those question are left as exercises to the reader. Almost everything we know
37about the contents of the CCS is gleaned from reverse-engineering of the
38hardware.  The best bit of documentation we have ever had comes from the
39display section of the Sky Lake PRM Vol 12 section on planes (p. 159):
40
41   The Color Control Surface (CCS) contains the compression status of the
42   cache-line pairs. The compression state of the cache-line pair is
43   specified by 2 bits in the CCS.  Each CCS cache-line represents an area
44   on the main surface of 16x16 sets of 128 byte Y-tiled cache-line-pairs.
45   CCS is always Y tiled.
46
47While this is technically for color compression and not fast-clears, it
48provides a good bit of insight into how color compression and fast-clears
49operate.  Each cache-line pair, in the main surface corresponds to 1 or 2 bits
50in the CCS.  The primary difference, as far as the current discussion is
51concerned, is that fast-clears use only 1 bit per cache-line pair whereas color
52compression uses 2 bits.
53
54What is a cache-line pair?  Both the X and Y tiling formats are arranged as an
558x8 grid of cache lines.  (See the [chapter on tiling](#tiling) for more
56details.)  In either case, a cache-line pair is a pair of cache lines whose
57starting addresses differ by 512 bytes or 8 cache lines.  This results in the
58two cache lines being vertically adjacent when the main surface is X-tiled and
59horizontally adjacent when the main surface is Y-tiled.  For an X-tiled surface
60this forms an area of 64B x 2rows and for a Y-tiled surface this forms an area
61of 32B x 4rows.  In either case, it is guaranteed that, regardless of surface
62format, each 2x2 subspan coming out of a shader will land entirely within one
63cache-line pair.
64
65What is the correspondence between bits and cache-line pairs?  The best model I
66(Jason) know of is to consider the CCS as having a 1-bit color format for
67fast-clears and a 2-bit format for color compression and a special tiling
68format.  The CCS tiling formats operate on a 1 or 2-bit granularity rather than
69the byte granularity of most tiling formats.
70
71The following table represents the bit-layouts that yield the CCS tiling format
72on different hardware generations.  Bits 0-11 correspond to the regular swizzle
73of bytes within a 4KB page whereas the negative bits represent the address of
74the particular 1 or 2-bit portion of a byte. (Note: The haswell data was
75gathered on a dual-channel system so bit-6 swizzling was enabled.  It's unclear
76how this affects the CCS layout.)
77
78============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
79 Generation   Tiling       11          10               9                 8           7           6           5           4           3           2           1           0          -1          -2          -3
80============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
81 Ivy Bridge   X or Y  :math:`u_6` :math:`u_5`      :math:`u_4`       :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0`
82 Haswell        X     :math:`u_6` :math:`u_5` :math:`v_3 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0`
83 Haswell        Y     :math:`u_6` :math:`u_5` :math:`v_2 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0`
84 Broadwell      X     :math:`u_6` :math:`u_5`      :math:`u_4`       :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`u_3` :math:`v_3` :math:`u_2` :math:`u_1` :math:`u_0` :math:`v_2` :math:`v_1` :math:`v_0`
85 Broadwell      Y     :math:`u_6` :math:`u_5`      :math:`u_4`       :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_1` :math:`v_0` :math:`u_0`
86 Sky Lake       Y     :math:`u_6` :math:`u_5`      :math:`u_4`       :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_3` :math:`v_2` :math:`v_1` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_0` :math:`u_0`
87============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
88
89CCS surface layout
90------------------
91
92Starting with Broadwell, fast-clears and color compression can be used on
93mipmapped and array surfaces.  When considered from a higher level, the CCS is
94layed out like any other surface.  The Broadwell and Sky Lake PRMs describe
95this as follows:
96
97Broadwell PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 676):
98
99   Mip-mapped and arrayed surfaces are supported with MCS buffer layout with
100   these alignments in the RT space: Horizontal Alignment = 256 and Vertical
101   Alignment = 128.
102
103Broadwell PRM Vol 2d, "RENDER_SURFACE_STATE" (p. 279):
104
105   For non-multisampled render target's auxiliary surface, MCS, QPitch must be
106   computed with Horizontal Alignment = 256 and Surface Vertical Alignment =
107   128. These alignments are only for MCS buffer and not for associated render
108   target.
109
110Sky Lake PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 632):
111
112   Mip-mapped and arrayed surfaces are supported with MCS buffer layout with
113   these alignments in the RT space: Horizontal Alignment = 128 and Vertical
114   Alignment = 64.
115
116Sky Lake PRM Vol. 2d, "RENDER_SURFACE_STATE" (p. 435):
117
118   For non-multisampled render target's CCS auxiliary surface, QPitch must be
119   computed with Horizontal Alignment = 128 and Surface Vertical Alignment
120   = 256. These alignments are only for CCS buffer and not for associated
121   render target.
122
123Empirical evidence seems to confirm this.  On Sky Lake, the vertical alignment
124is always one cache line.  The horizontal alignment, however, varies by main
125surface format: 1 cache line for 32bpp, 2 for 64bpp and 4 cache lines for
126128bpp formats.  This nicely corresponds to the alignment of 128x64 pixels in
127the primary color surface.  The second PRM citation about Sky Lake CCS above
128gives a vertical alignment of 256 rather than 64.  With a little
129experimentation, this additional alignment appears to only apply to QPitch and
130not to the miplevels within a slice.
131
132On Broadwell, each miplevel in the CCS is aligned to a cache-line pair
133boundary: horizontal when the primary surface is X-tiled and vertical when
134Y-tiled. For a 32bpp format, this works out to an alignment of 256x128 main
135surface pixels regardless of X or Y tiling.  On Sky Lake, the alignment is
136a single cache line which works out to an alignment of 128x64 main surface
137pixels.
138
139TODO: More than just 32bpp formats on Broadwell!
140
141Once armed with the above alignment information, we can lay out the CCS surface
142itself.  The way ISL does CCS layout calculations is by a very careful  and
143subtle application of its normal surface layout code.
144
145Above, we described the CCS data layout as mapping of address bits. In
146ISL, this is represented by :cpp:enumerator:`isl_tiling::ISL_TILING_CCS`.  The
147logical and physical tile dimensions corresponding to the above mapping.
148
149We also have special :cpp:enum:`isl_format` enums for CCS.  These formats are 1
150bit-per-pixel on Ivy Bridge through Broadwell and 2 bits-per-pixel on Skylake
151and above to correspond to the 1 and 2-bit values represented in the CCS data.
152They have a block size (similar to a block compressed format such as BC or
153ASTC) which says what area (in surface elements) in the main surface is covered
154by a single CCS element (1 or 2-bit).  Because this depends on the main surface
155tiling and format, we have several different CCS formats.
156
157Once the appropriate :cpp:enum:`isl_format` has been selected, computing the
158size and layout of a CCS surface is as simple as passing the same surface
159creation parameters to :cpp:func:`isl_surf_init_s` as were used to create the
160primary surface only with :cpp:enumerator:`isl_tiling::ISL_TILING_CCS` and the
161correct CCS format.  This not only results in a correctly sized surface but
162most other ISL helpers for things such as computing offsets into surfaces work
163correctly as well.
164
165CCS on Tigerlake and above
166--------------------------
167
168Starting with Tigerlake, CCS is no longer done via a surface and, instead, the
169term CCS gets overloaded once again (gotta love it!) to now refer to a form of
170universal compression which can be applied to almost any surface.  Nothing in
171this chapter applies to any hardware with a graphics IP version 12 or above.
172