1r600-sb
2=======
3
4* * * * *
5
6Debugging
7---------
8
9### Environment variables
10
11-   **R600\_DEBUG**
12
13    There are new flags:
14
15    -   **nosb** - Disable sb backend for graphics shaders
16    -   **sbdry** - Dry run, optimize but use source bytecode -
17        useful if you only want to check shader dumps
18        without the risk of lockups and other problems
19    -   **sbstat** - Print optimization statistics (only time so far)
20    -   **sbdump** - Print IR after some passes.
21    -   **sbnofallback** - Abort on errors instead of fallback
22    -   **sbdisasm** - Use sb disassembler for shader dumps
23    -   **sbsafemath** - Disable unsafe math optimizations
24
25### Regression debugging
26
27If there are any regressions as compared to the default backend
28(R600\_SB=0), it's possible to use the following environment variables
29to find the incorrectly optimized shader that causes the regression.
30
31-   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
32    shaders
33    -   0 - disabled (default)
34    -   1 - skip optimization for the shaders in the range
35        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
36        optimize only the shaders that are not in this range
37    -   2 - optimize only the shaders in the range
38        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
39
40-   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
41
42-   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
43
44Example - optimize only the shaders 5, 6, and 7:
45
46    R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
47
48All shaders compiled by the application are numbered starting from 1,
49the number of shaders used by the application may be obtained by running
50it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
51for each compiled shader.
52
53After figuring out the total number of shaders used by the application,
54the variables above allow to use bisection to find the shader that is
55the cause of regression. E.g. if the application uses 100 shaders, we
56can divide the range [1; 100] and run the application with the
57optimization enabled only for the first half of the shaders:
58
59    R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
60
61If the regression is reproduced with these parameters, then the failing
62shader is in the range [1; 50], if it's not reproduced - then it's in
63the range [51; 100]. Then we can divide the new range again and repeat
64the testing, until we'll reduce the range to a single failing shader.
65
66*NOTE: This method relies on the assumption that the application
67produces the same sequence of the shaders on each run. It's not always
68true - some applications may produce different sequences of the shaders,
69in such cases the tools like apitrace may be used to record the trace
70with the application, then this method may be applied when replaying the
71trace - also this may be faster and/or more convenient than testing the
72application itself.*
73
74* * * * *
75
76Intermediate Representation
77---------------------------
78
79### Values
80
81All kinds of the operands (literal constants, references to kcache
82constants, references to GPRs, etc) are currently represented by the
83**value** class (possibly it makes sense to switch to hierarchy of
84classes derived from **value** instead, to save some memory).
85
86All values (except some pseudo values like the exec\_mask or predicate
87register) represent 32bit scalar values - there are no vector values,
88CF/FETCH instructions use groups of 4 values for src and dst operands.
89
90### Nodes
91
92Shader programs are represented using the tree data structure, some
93nodes contain a list of subnodes.
94
95#### Control flow nodes
96
97Control flow information is represented using four special node types
98(based on the ideas from [[1]](#references) )
99
100-   **region\_node** - single-entry, single-exit region.
101
102    All loops and if's in the program are enclosed in region nodes.
103    Region nodes have two containers for phi nodes -
104    region\_node::loop\_phi contains the phi expressions to be executed
105    at the region entry, region\_node::phi contains the phi expressions
106    to be executed at the region exit. It's the only type of the node
107    that contains associated phi expressions.
108
109-   **depart\_node** - "depart region \$id after { ... }"
110
111    Depart target region (jump to exit point) after executing contained
112    code.
113
114-   **repeat\_node** - "repeat region \$id after { ... }"
115
116    Repeat target region (jump to entry point) after executing contained
117    code.
118
119-   **if\_node** - "if (cond) { ... }"
120
121    Execute contained code if condition is true. The difference from
122    [[1]](#references) is that we don't have associated phi expressions
123    for the **if\_node**, we enclose **if\_node** in the
124    **region\_node** and store corresponding phi's in the
125    **region\_node**, this allows more uniform handling.
126
127The target region of depart and repeat nodes is always the region where
128they are located (possibly in the nested region), there are no arbitrary
129jumps/goto's - control flow in the program is always structured.
130
131Typical control flow constructs can be represented as in the following
132examples:
133
134GLSL:
135
136    if (cond) {
137        < 1 >
138    } else {
139        < 2 >
140    }
141
142IR:
143
144    region #0 {
145        depart region #0 after {
146            if (cond) {
147                depart region #0 after {
148                    < 1 >
149                }
150            }
151            < 2 >
152        }
153        <region #0 phi nodes >
154    }
155
156GLSL:
157
158    while (cond) {
159        < 1 >
160    }
161
162IR:
163
164    region #0 {
165        <region #0 loop_phi nodes>
166        repeat region #0 after {
167            region #1 {
168                depart region #1 after {
169                    if (!cond) {
170                        depart region #0
171                    }
172                }
173            }
174            < 1 >
175        }
176        <region #0 phi nodes>
177    }
178
179'Break' and 'continue' inside the loops are directly translated to the
180depart and repeat nodes for the corresponding loop region.
181
182This may look a bit too complicated, but in fact this allows more simple
183and uniform handling of the control flow.
184
185All loop\_phi and phi nodes for some region always have the same number
186of source operands. The number of source operands for
187region\_node::loop\_phi nodes is 1 + number of repeat nodes that
188reference this region as a target. The number of source operands for
189region\_node::phi nodes is equal to the number of depart nodes that
190reference this region as a target. All depart/repeat nodes for the
191region have unique indices equal to the index of source operand for
192phi/loop\_phi nodes.
193
194First source operand for region\_node::loop\_phi nodes (src[0]) is an
195incoming value that enters the region from the outside. Each remaining
196source operand comes from the corresponding repeat node.
197
198More complex example:
199
200GLSL:
201
202    a = 1;
203    while (a < 5) {
204        a = a * 2;
205        if (b == 3) {
206            continue;
207        } else {
208            a = 6;
209        }
210        if (c == 4)
211            break;
212        a = a + 1;
213    }
214
215IR with SSA form:
216
217    a.1 = 1;
218    region #0 {
219        // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
220        region#0 loop_phi: a.2 = phi a.1, a.6, a.3
221
222        repeat_1 region #0 after {
223            a.3 = a.2 * 2;
224            cond1 = (b == 3);
225            region #1 {
226                depart_0 region #1 after {
227                    if (cond1) {
228                        repeat_2 region #0;
229                    }
230                }
231                a.4 = 6;
232
233                region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
234            }
235            cond2 = (c == 4);
236            region #2 {
237                depart_0 region #2 after {
238                    if (cond2) {
239                        depart_0 region #0;
240                    }
241                }
242            }
243            a.6 = a.5 + 1;
244        }
245
246        region #0 phi: a.7 = phi a.5 // src[0] from depart_0
247    }
248
249Phi nodes with single source operand are just copies, they are not
250really necessary, but this allows to handle all **depart\_node**s in the
251uniform way.
252
253#### Instruction nodes
254
255Instruction nodes represent different kinds of instructions -
256**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
257the "bc" structure where all fields of the bytecode are stored (the type
258is **bc\_alu** for **alu\_node**, etc). The operands are represented
259using the vectors of pointers to **value** class (node::src, node::dst)
260
261#### SSA-specific nodes
262
263Phi nodes currently don't have special node class, they are stored as
264**node**. Destination vector contains a single destination value, source
265vector contains 1 or more source values.
266
267Psi nodes [[5], [6]](#references) also don't have a special node class
268and stored as **node**. Source vector contains 3 values for each source
269operand - the **value** of predicate, **value** of corresponding
270PRED\_SEL field, and the source **value** itself.
271
272### Indirect addressing
273
274Special kind of values (VLK\_RELREG) is used to represent indirect
275operands. These values don't have SSA versions. The representation is
276mostly based on the [[2]](#references). Indirect operand contains the
277"offset/address" value (value::rel), (e.g. some SSA version of the AR
278register value, though after some passes it may be any value - constant,
279register, etc), also it contains the maydef and mayuse vectors of
280pointers to **value**s (similar to dst/src vectors in the **node**) to
281represent the effects of aliasing in the SSA form.
282
283E.g. if we have the array R5.x ... R8.x and the following instruction :
284
285    MOV R0.x, R[5 + AR].x
286
287then source indirect operand is represented with the VLK\_RELREG value,
288value::rel is AR, value::maydef is empty (in fact it always contain the
289same number of elements as mayuse to simplify the handling, but they are
290NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
291corresponding SSA versions after ssa\_rename).
292
293Additional "virtual variables" as in [HSSA [2]](#references) are not
294used, also there is no special handling for "zero versions". Typical
295programs in our case are small, indirect addressing is rare, array sizes
296are limited by max gpr number, so we don't really need to use special
297tricks to avoid the explosion of value versions. Also this allows more
298precise liveness computation for array elements without modifications to
299the algorithms.
300
301With the following instruction:
302
303    MOV R[5+AR].x, R0.x
304
305we'll have both maydef and mayuse vectors for dst operand filled with
306array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
307pass mayuse will contain previous versions, maydef will contain new
308potentially-defined versions.
309
310* * * * *
311
312Passes
313------
314
315-   **bc\_parser** - creates the IR from the source bytecode,
316    initializes src and dst value vectors for instruction nodes. Most
317    ALU nodes have one dst operand and the number of source operands is
318    equal to the number of source operands for the ISA instruction.
319    Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
320    gpr as in the original instruction, other two are pseudo-operands
321    that represent possibly updated predicate and exec\_mask. Predicate
322    values are used in the predicated alu instructions (node::pred),
323    exec\_mask values are used in the if\_nodes (if\_node::cond). Each
324    vector operand in the CF/TEX/VTX instructions is represented with 4
325    values - components of the vector.
326
327-   **ssa\_prepare** - creates phi expressions.
328
329-   **ssa\_rename** - renames the values (assigns versions).
330
331-   **liveness** - liveness computation, sets 'dead' flag for unused
332    nodes and values, optionally computes interference information for
333    the values.
334
335-   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
336    unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
337    instructions in the source, containers for ALU groups (they were
338    only needed for the ssa\_rename pass)
339
340-   **if\_conversion** - converts control flow with if\_nodes to the
341    data flow in cases where it can improve performance (small alu-only
342    branches). Both branches are executed speculatively and the phi
343    expressions are replaced with conditional moves (CNDxx) to select
344    the final value using the same condition predicate as was used by
345    the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
346    instruction, CNDxx now uses dst[0] from the same PREDSETxx
347    instruction.
348
349-   **peephole** - peephole optimizations
350
351-   **gvn** - Global Value Numbering [[2]](#references),
352    [[3]](#references)
353
354-   **gcm** - Global Code Motion [[3]](#references). Also performs
355    grouping of the instructions of the same kind (CF/FETCH/ALU).
356
357-   register allocation passes, some ideas are used from
358    [[4]](#references), but implementation is simplified to make it more
359    efficient in terms of the compilation speed (e.g. no recursive
360    recoloring) while achieving good enough results.
361
362    -   **ra\_split** - prepares the program to register allocation.
363        Splits live ranges for constrained values by inserting the
364        copies to/from temporary values, so that the live range of the
365        constrained values becomes minimal.
366
367    -   **ra\_coalesce** - performs global allocation on registers used
368        in CF/FETCH instructions. It's performed first to make sure they
369        end up in the same GPR. Also tries to allocate all values
370        involved in copies (inserted by the ra\_split pass) to the same
371        register, so that the copies may be eliminated.
372
373    -   **ra\_init** - allocates gpr arrays (if indirect addressing is
374        used), and remaining values.
375
376-   **post\_scheduler** - ALU scheduler, handles VLIW packing and
377    performs the final register allocation for local values inside ALU
378    clauses. Eliminates all coalesced copies (if src and dst of the copy
379    are allocated to the same register).
380
381-   **ra\_checker** - optional debugging pass that tries to catch basic
382    errors of the scheduler or regalloc,
383
384-   **bc\_finalize** - propagates the regalloc information from values
385    in node::src and node::dst vectors to the bytecode fields, converts
386    control flow structure (region/depart/repeat) to the target
387    instructions (JUMP/ELSE/POP,
388    LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
389
390-   **bc\_builder** - builds final bytecode,
391
392* * * * *
393
394References
395----------
396
397[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
398McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
399
400[2] ["Effective Representation of Aliases and Indirect Memory Operations
401in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
402Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
403
404[3] ["Global Code Motion. Global Value Numbering.", Cliff
405Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
406
407[4] ["Register Allocation for Programs in SSA Form", Sebastian
408Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
409
410[5] ["An extension to the SSA representation for predicated code",
411Francois de
412Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
413
414[6] ["Improvements to the Psi-SSA Representation", F. de
415Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)
416