1r600-sb
2=======
3
4* * * * *
5
6Debugging
7---------
8
9### Environment variables
10
11-   **R600\_DEBUG**
12
13    There are new flags:
14
15    -   **nosb** - Disable sb backend for graphics shaders
16    -   **sbcl** - Enable optimization of compute shaders (experimental)
17    -   **sbdry** - Dry run, optimize but use source bytecode -
18        useful if you only want to check shader dumps
19        without the risk of lockups and other problems
20    -   **sbstat** - Print optimization statistics (only time so far)
21    -   **sbdump** - Print IR after some passes.
22    -   **sbnofallback** - Abort on errors instead of fallback
23    -   **sbdisasm** - Use sb disassembler for shader dumps
24    -   **sbsafemath** - Disable unsafe math optimizations
25
26### Regression debugging
27
28If there are any regressions as compared to the default backend
29(R600\_SB=0), it's possible to use the following environment variables
30to find the incorrectly optimized shader that causes the regression.
31
32-   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
33    shaders
34    -   0 - disabled (default)
35    -   1 - skip optimization for the shaders in the range
36        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
37        optimize only the shaders that are not in this range
38    -   2 - optimize only the shaders in the range
39        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
40
41-   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
42
43-   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
44
45Example - optimize only the shaders 5, 6, and 7:
46
47    R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
48
49All shaders compiled by the application are numbered starting from 1,
50the number of shaders used by the application may be obtained by running
51it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
52for each compiled shader.
53
54After figuring out the total number of shaders used by the application,
55the variables above allow to use bisection to find the shader that is
56the cause of regression. E.g. if the application uses 100 shaders, we
57can divide the range [1; 100] and run the application with the
58optimization enabled only for the first half of the shaders:
59
60    R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
61
62If the regression is reproduced with these parameters, then the failing
63shader is in the range [1; 50], if it's not reproduced - then it's in
64the range [51; 100]. Then we can divide the new range again and repeat
65the testing, until we'll reduce the range to a single failing shader.
66
67*NOTE: This method relies on the assumption that the application
68produces the same sequence of the shaders on each run. It's not always
69true - some applications may produce different sequences of the shaders,
70in such cases the tools like apitrace may be used to record the trace
71with the application, then this method may be applied when replaying the
72trace - also this may be faster and/or more convenient than testing the
73application itself.*
74
75* * * * *
76
77Intermediate Representation
78---------------------------
79
80### Values
81
82All kinds of the operands (literal constants, references to kcache
83constants, references to GPRs, etc) are currently represented by the
84**value** class (possibly it makes sense to switch to hierarchy of
85classes derived from **value** instead, to save some memory).
86
87All values (except some pseudo values like the exec\_mask or predicate
88register) represent 32bit scalar values - there are no vector values,
89CF/FETCH instructions use groups of 4 values for src and dst operands.
90
91### Nodes
92
93Shader programs are represented using the tree data structure, some
94nodes contain a list of subnodes.
95
96#### Control flow nodes
97
98Control flow information is represented using four special node types
99(based on the ideas from [[1]](#references) )
100
101-   **region\_node** - single-entry, single-exit region.
102
103    All loops and if's in the program are enclosed in region nodes.
104    Region nodes have two containers for phi nodes -
105    region\_node::loop\_phi contains the phi expressions to be executed
106    at the region entry, region\_node::phi contains the phi expressions
107    to be executed at the region exit. It's the only type of the node
108    that contains associated phi expressions.
109
110-   **depart\_node** - "depart region \$id after { ... }"
111
112    Depart target region (jump to exit point) after executing contained
113    code.
114
115-   **repeat\_node** - "repeat region \$id after { ... }"
116
117    Repeat target region (jump to entry point) after executing contained
118    code.
119
120-   **if\_node** - "if (cond) { ... }"
121
122    Execute contained code if condition is true. The difference from
123    [[1]](#references) is that we don't have associated phi expressions
124    for the **if\_node**, we enclose **if\_node** in the
125    **region\_node** and store corresponding phi's in the
126    **region\_node**, this allows more uniform handling.
127
128The target region of depart and repeat nodes is always the region where
129they are located (possibly in the nested region), there are no arbitrary
130jumps/goto's - control flow in the program is always structured.
131
132Typical control flow constructs can be represented as in the following
133examples:
134
135GLSL:
136
137    if (cond) {
138        < 1 >
139    } else {
140        < 2 >
141    }
142
143IR:
144
145    region #0 {
146        depart region #0 after {
147            if (cond) {
148                depart region #0 after {
149                    < 1 >
150                }
151            }
152            < 2 >
153        }
154        <region #0 phi nodes >
155    }
156
157GLSL:
158
159    while (cond) {
160        < 1 >
161    }
162
163IR:
164
165    region #0 {
166        <region #0 loop_phi nodes>
167        repeat region #0 after {
168            region #1 {
169                depart region #1 after {
170                    if (!cond) {
171                        depart region #0
172                    }
173                }
174            }
175            < 1 >
176        }
177        <region #0 phi nodes>
178    }
179
180'Break' and 'continue' inside the loops are directly translated to the
181depart and repeat nodes for the corresponding loop region.
182
183This may look a bit too complicated, but in fact this allows more simple
184and uniform handling of the control flow.
185
186All loop\_phi and phi nodes for some region always have the same number
187of source operands. The number of source operands for
188region\_node::loop\_phi nodes is 1 + number of repeat nodes that
189reference this region as a target. The number of source operands for
190region\_node::phi nodes is equal to the number of depart nodes that
191reference this region as a target. All depart/repeat nodes for the
192region have unique indices equal to the index of source operand for
193phi/loop\_phi nodes.
194
195First source operand for region\_node::loop\_phi nodes (src[0]) is an
196incoming value that enters the region from the outside. Each remaining
197source operand comes from the corresponding repeat node.
198
199More complex example:
200
201GLSL:
202
203    a = 1;
204    while (a < 5) {
205        a = a * 2;
206        if (b == 3) {
207            continue;
208        } else {
209            a = 6;
210        }
211        if (c == 4)
212            break;
213        a = a + 1;
214    }
215
216IR with SSA form:
217
218    a.1 = 1;
219    region #0 {
220        // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
221        region#0 loop_phi: a.2 = phi a.1, a.6, a.3
222
223        repeat_1 region #0 after {
224            a.3 = a.2 * 2;
225            cond1 = (b == 3);
226            region #1 {
227                depart_0 region #1 after {
228                    if (cond1) {
229                        repeat_2 region #0;
230                    }
231                }
232                a.4 = 6;
233
234                region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
235            }
236            cond2 = (c == 4);
237            region #2 {
238                depart_0 region #2 after {
239                    if (cond2) {
240                        depart_0 region #0;
241                    }
242                }
243            }
244            a.6 = a.5 + 1;
245        }
246
247        region #0 phi: a.7 = phi a.5 // src[0] from depart_0
248    }
249
250Phi nodes with single source operand are just copies, they are not
251really necessary, but this allows to handle all **depart\_node**s in the
252uniform way.
253
254#### Instruction nodes
255
256Instruction nodes represent different kinds of instructions -
257**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
258the "bc" structure where all fields of the bytecode are stored (the type
259is **bc\_alu** for **alu\_node**, etc). The operands are represented
260using the vectors of pointers to **value** class (node::src, node::dst)
261
262#### SSA-specific nodes
263
264Phi nodes currently don't have special node class, they are stored as
265**node**. Destination vector contains a single destination value, source
266vector contains 1 or more source values.
267
268Psi nodes [[5], [6]](#references) also don't have a special node class
269and stored as **node**. Source vector contains 3 values for each source
270operand - the **value** of predicate, **value** of corresponding
271PRED\_SEL field, and the source **value** itself.
272
273### Indirect addressing
274
275Special kind of values (VLK\_RELREG) is used to represent indirect
276operands. These values don't have SSA versions. The representation is
277mostly based on the [[2]](#references). Indirect operand contains the
278"offset/address" value (value::rel), (e.g. some SSA version of the AR
279register value, though after some passes it may be any value - constant,
280register, etc), also it contains the maydef and mayuse vectors of
281pointers to **value**s (similar to dst/src vectors in the **node**) to
282represent the effects of aliasing in the SSA form.
283
284E.g. if we have the array R5.x ... R8.x and the following instruction :
285
286    MOV R0.x, R[5 + AR].x
287
288then source indirect operand is represented with the VLK\_RELREG value,
289value::rel is AR, value::maydef is empty (in fact it always contain the
290same number of elements as mayuse to simplify the handling, but they are
291NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
292corresponding SSA versions after ssa\_rename).
293
294Additional "virtual variables" as in [HSSA [2]](#references) are not
295used, also there is no special handling for "zero versions". Typical
296programs in our case are small, indirect addressing is rare, array sizes
297are limited by max gpr number, so we don't really need to use special
298tricks to avoid the explosion of value versions. Also this allows more
299precise liveness computation for array elements without modifications to
300the algorithms.
301
302With the following instruction:
303
304    MOV R[5+AR].x, R0.x
305
306we'll have both maydef and mayuse vectors for dst operand filled with
307array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
308pass mayuse will contain previous versions, maydef will contain new
309potentially-defined versions.
310
311* * * * *
312
313Passes
314------
315
316-   **bc\_parser** - creates the IR from the source bytecode,
317    initializes src and dst value vectors for instruction nodes. Most
318    ALU nodes have one dst operand and the number of source operands is
319    equal to the number of source operands for the ISA instruction.
320    Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
321    gpr as in the original instruction, other two are pseudo-operands
322    that represent possibly updated predicate and exec\_mask. Predicate
323    values are used in the predicated alu instructions (node::pred),
324    exec\_mask values are used in the if\_nodes (if\_node::cond). Each
325    vector operand in the CF/TEX/VTX instructions is represented with 4
326    values - components of the vector.
327
328-   **ssa\_prepare** - creates phi expressions.
329
330-   **ssa\_rename** - renames the values (assigns versions).
331
332-   **liveness** - liveness computation, sets 'dead' flag for unused
333    nodes and values, optionally computes interference information for
334    the values.
335
336-   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
337    unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
338    instructions in the source, containers for ALU groups (they were
339    only needed for the ssa\_rename pass)
340
341-   **if\_conversion** - converts control flow with if\_nodes to the
342    data flow in cases where it can improve performance (small alu-only
343    branches). Both branches are executed speculatively and the phi
344    expressions are replaced with conditional moves (CNDxx) to select
345    the final value using the same condition predicate as was used by
346    the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
347    instruction, CNDxx now uses dst[0] from the same PREDSETxx
348    instruction.
349
350-   **peephole** - peephole optimizations
351
352-   **gvn** - Global Value Numbering [[2]](#references),
353    [[3]](#references)
354
355-   **gcm** - Global Code Motion [[3]](#references). Also performs
356    grouping of the instructions of the same kind (CF/FETCH/ALU).
357
358-   register allocation passes, some ideas are used from
359    [[4]](#references), but implementation is simplified to make it more
360    efficient in terms of the compilation speed (e.g. no recursive
361    recoloring) while achieving good enough results.
362
363    -   **ra\_split** - prepares the program to register allocation.
364        Splits live ranges for constrained values by inserting the
365        copies to/from temporary values, so that the live range of the
366        constrained values becomes minimal.
367
368    -   **ra\_coalesce** - performs global allocation on registers used
369        in CF/FETCH instructions. It's performed first to make sure they
370        end up in the same GPR. Also tries to allocate all values
371        involved in copies (inserted by the ra\_split pass) to the same
372        register, so that the copies may be eliminated.
373
374    -   **ra\_init** - allocates gpr arrays (if indirect addressing is
375        used), and remaining values.
376
377-   **post\_scheduler** - ALU scheduler, handles VLIW packing and
378    performs the final register allocation for local values inside ALU
379    clauses. Eliminates all coalesced copies (if src and dst of the copy
380    are allocated to the same register).
381
382-   **ra\_checker** - optional debugging pass that tries to catch basic
383    errors of the scheduler or regalloc,
384
385-   **bc\_finalize** - propagates the regalloc information from values
386    in node::src and node::dst vectors to the bytecode fields, converts
387    control flow structure (region/depart/repeat) to the target
388    instructions (JUMP/ELSE/POP,
389    LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
390
391-   **bc\_builder** - builds final bytecode,
392
393* * * * *
394
395References
396----------
397
398[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
399McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
400
401[2] ["Effective Representation of Aliases and Indirect Memory Operations
402in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
403Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
404
405[3] ["Global Code Motion. Global Value Numbering.", Cliff
406Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
407
408[4] ["Register Allocation for Programs in SSA Form", Sebastian
409Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
410
411[5] ["An extension to the SSA representation for predicated code",
412Francois de
413Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
414
415[6] ["Improvements to the Psi-SSA Representation", F. de
416Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)
417