1r600-sb 2======= 3 4* * * * * 5 6Debugging 7--------- 8 9### Environment variables 10 11- **R600\_DEBUG** 12 13 There are new flags: 14 15 - **nosb** - Disable sb backend for graphics shaders 16 - **sbdry** - Dry run, optimize but use source bytecode - 17 useful if you only want to check shader dumps 18 without the risk of lockups and other problems 19 - **sbstat** - Print optimization statistics (only time so far) 20 - **sbdump** - Print IR after some passes. 21 - **sbnofallback** - Abort on errors instead of fallback 22 - **sbdisasm** - Use sb disassembler for shader dumps 23 - **sbsafemath** - Disable unsafe math optimizations 24 25### Regression debugging 26 27If there are any regressions as compared to the default backend 28(R600\_SB=0), it's possible to use the following environment variables 29to find the incorrectly optimized shader that causes the regression. 30 31- **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some 32 shaders 33 - 0 - disabled (default) 34 - 1 - skip optimization for the shaders in the range 35 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is, 36 optimize only the shaders that are not in this range 37 - 2 - optimize only the shaders in the range 38 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END] 39 40- **R600\_SB\_DSKIP\_START** - start of the range (1-based) 41 42- **R600\_SB\_DSKIP\_END** - end of the range (1-based) 43 44Example - optimize only the shaders 5, 6, and 7: 45 46 R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2 47 48All shaders compiled by the application are numbered starting from 1, 49the number of shaders used by the application may be obtained by running 50it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#" 51for each compiled shader. 52 53After figuring out the total number of shaders used by the application, 54the variables above allow to use bisection to find the shader that is 55the cause of regression. E.g. if the application uses 100 shaders, we 56can divide the range [1; 100] and run the application with the 57optimization enabled only for the first half of the shaders: 58 59 R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app> 60 61If the regression is reproduced with these parameters, then the failing 62shader is in the range [1; 50], if it's not reproduced - then it's in 63the range [51; 100]. Then we can divide the new range again and repeat 64the testing, until we'll reduce the range to a single failing shader. 65 66*NOTE: This method relies on the assumption that the application 67produces the same sequence of the shaders on each run. It's not always 68true - some applications may produce different sequences of the shaders, 69in such cases the tools like apitrace may be used to record the trace 70with the application, then this method may be applied when replaying the 71trace - also this may be faster and/or more convenient than testing the 72application itself.* 73 74* * * * * 75 76Intermediate Representation 77--------------------------- 78 79### Values 80 81All kinds of the operands (literal constants, references to kcache 82constants, references to GPRs, etc) are currently represented by the 83**value** class (possibly it makes sense to switch to hierarchy of 84classes derived from **value** instead, to save some memory). 85 86All values (except some pseudo values like the exec\_mask or predicate 87register) represent 32bit scalar values - there are no vector values, 88CF/FETCH instructions use groups of 4 values for src and dst operands. 89 90### Nodes 91 92Shader programs are represented using the tree data structure, some 93nodes contain a list of subnodes. 94 95#### Control flow nodes 96 97Control flow information is represented using four special node types 98(based on the ideas from [[1]](#references) ) 99 100- **region\_node** - single-entry, single-exit region. 101 102 All loops and if's in the program are enclosed in region nodes. 103 Region nodes have two containers for phi nodes - 104 region\_node::loop\_phi contains the phi expressions to be executed 105 at the region entry, region\_node::phi contains the phi expressions 106 to be executed at the region exit. It's the only type of the node 107 that contains associated phi expressions. 108 109- **depart\_node** - "depart region \$id after { ... }" 110 111 Depart target region (jump to exit point) after executing contained 112 code. 113 114- **repeat\_node** - "repeat region \$id after { ... }" 115 116 Repeat target region (jump to entry point) after executing contained 117 code. 118 119- **if\_node** - "if (cond) { ... }" 120 121 Execute contained code if condition is true. The difference from 122 [[1]](#references) is that we don't have associated phi expressions 123 for the **if\_node**, we enclose **if\_node** in the 124 **region\_node** and store corresponding phi's in the 125 **region\_node**, this allows more uniform handling. 126 127The target region of depart and repeat nodes is always the region where 128they are located (possibly in the nested region), there are no arbitrary 129jumps/goto's - control flow in the program is always structured. 130 131Typical control flow constructs can be represented as in the following 132examples: 133 134GLSL: 135 136 if (cond) { 137 < 1 > 138 } else { 139 < 2 > 140 } 141 142IR: 143 144 region #0 { 145 depart region #0 after { 146 if (cond) { 147 depart region #0 after { 148 < 1 > 149 } 150 } 151 < 2 > 152 } 153 <region #0 phi nodes > 154 } 155 156GLSL: 157 158 while (cond) { 159 < 1 > 160 } 161 162IR: 163 164 region #0 { 165 <region #0 loop_phi nodes> 166 repeat region #0 after { 167 region #1 { 168 depart region #1 after { 169 if (!cond) { 170 depart region #0 171 } 172 } 173 } 174 < 1 > 175 } 176 <region #0 phi nodes> 177 } 178 179'Break' and 'continue' inside the loops are directly translated to the 180depart and repeat nodes for the corresponding loop region. 181 182This may look a bit too complicated, but in fact this allows more simple 183and uniform handling of the control flow. 184 185All loop\_phi and phi nodes for some region always have the same number 186of source operands. The number of source operands for 187region\_node::loop\_phi nodes is 1 + number of repeat nodes that 188reference this region as a target. The number of source operands for 189region\_node::phi nodes is equal to the number of depart nodes that 190reference this region as a target. All depart/repeat nodes for the 191region have unique indices equal to the index of source operand for 192phi/loop\_phi nodes. 193 194First source operand for region\_node::loop\_phi nodes (src[0]) is an 195incoming value that enters the region from the outside. Each remaining 196source operand comes from the corresponding repeat node. 197 198More complex example: 199 200GLSL: 201 202 a = 1; 203 while (a < 5) { 204 a = a * 2; 205 if (b == 3) { 206 continue; 207 } else { 208 a = 6; 209 } 210 if (c == 4) 211 break; 212 a = a + 1; 213 } 214 215IR with SSA form: 216 217 a.1 = 1; 218 region #0 { 219 // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2 220 region#0 loop_phi: a.2 = phi a.1, a.6, a.3 221 222 repeat_1 region #0 after { 223 a.3 = a.2 * 2; 224 cond1 = (b == 3); 225 region #1 { 226 depart_0 region #1 after { 227 if (cond1) { 228 repeat_2 region #0; 229 } 230 } 231 a.4 = 6; 232 233 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0 234 } 235 cond2 = (c == 4); 236 region #2 { 237 depart_0 region #2 after { 238 if (cond2) { 239 depart_0 region #0; 240 } 241 } 242 } 243 a.6 = a.5 + 1; 244 } 245 246 region #0 phi: a.7 = phi a.5 // src[0] from depart_0 247 } 248 249Phi nodes with single source operand are just copies, they are not 250really necessary, but this allows to handle all **depart\_node**s in the 251uniform way. 252 253#### Instruction nodes 254 255Instruction nodes represent different kinds of instructions - 256**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains 257the "bc" structure where all fields of the bytecode are stored (the type 258is **bc\_alu** for **alu\_node**, etc). The operands are represented 259using the vectors of pointers to **value** class (node::src, node::dst) 260 261#### SSA-specific nodes 262 263Phi nodes currently don't have special node class, they are stored as 264**node**. Destination vector contains a single destination value, source 265vector contains 1 or more source values. 266 267Psi nodes [[5], [6]](#references) also don't have a special node class 268and stored as **node**. Source vector contains 3 values for each source 269operand - the **value** of predicate, **value** of corresponding 270PRED\_SEL field, and the source **value** itself. 271 272### Indirect addressing 273 274Special kind of values (VLK\_RELREG) is used to represent indirect 275operands. These values don't have SSA versions. The representation is 276mostly based on the [[2]](#references). Indirect operand contains the 277"offset/address" value (value::rel), (e.g. some SSA version of the AR 278register value, though after some passes it may be any value - constant, 279register, etc), also it contains the maydef and mayuse vectors of 280pointers to **value**s (similar to dst/src vectors in the **node**) to 281represent the effects of aliasing in the SSA form. 282 283E.g. if we have the array R5.x ... R8.x and the following instruction : 284 285 MOV R0.x, R[5 + AR].x 286 287then source indirect operand is represented with the VLK\_RELREG value, 288value::rel is AR, value::maydef is empty (in fact it always contain the 289same number of elements as mayuse to simplify the handling, but they are 290NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the 291corresponding SSA versions after ssa\_rename). 292 293Additional "virtual variables" as in [HSSA [2]](#references) are not 294used, also there is no special handling for "zero versions". Typical 295programs in our case are small, indirect addressing is rare, array sizes 296are limited by max gpr number, so we don't really need to use special 297tricks to avoid the explosion of value versions. Also this allows more 298precise liveness computation for array elements without modifications to 299the algorithms. 300 301With the following instruction: 302 303 MOV R[5+AR].x, R0.x 304 305we'll have both maydef and mayuse vectors for dst operand filled with 306array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename 307pass mayuse will contain previous versions, maydef will contain new 308potentially-defined versions. 309 310* * * * * 311 312Passes 313------ 314 315- **bc\_parser** - creates the IR from the source bytecode, 316 initializes src and dst value vectors for instruction nodes. Most 317 ALU nodes have one dst operand and the number of source operands is 318 equal to the number of source operands for the ISA instruction. 319 Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst 320 gpr as in the original instruction, other two are pseudo-operands 321 that represent possibly updated predicate and exec\_mask. Predicate 322 values are used in the predicated alu instructions (node::pred), 323 exec\_mask values are used in the if\_nodes (if\_node::cond). Each 324 vector operand in the CF/TEX/VTX instructions is represented with 4 325 values - components of the vector. 326 327- **ssa\_prepare** - creates phi expressions. 328 329- **ssa\_rename** - renames the values (assigns versions). 330 331- **liveness** - liveness computation, sets 'dead' flag for unused 332 nodes and values, optionally computes interference information for 333 the values. 334 335- **dce\_cleanup** - eliminates 'dead' nodes, also removes some 336 unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP 337 instructions in the source, containers for ALU groups (they were 338 only needed for the ssa\_rename pass) 339 340- **if\_conversion** - converts control flow with if\_nodes to the 341 data flow in cases where it can improve performance (small alu-only 342 branches). Both branches are executed speculatively and the phi 343 expressions are replaced with conditional moves (CNDxx) to select 344 the final value using the same condition predicate as was used by 345 the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx 346 instruction, CNDxx now uses dst[0] from the same PREDSETxx 347 instruction. 348 349- **peephole** - peephole optimizations 350 351- **gvn** - Global Value Numbering [[2]](#references), 352 [[3]](#references) 353 354- **gcm** - Global Code Motion [[3]](#references). Also performs 355 grouping of the instructions of the same kind (CF/FETCH/ALU). 356 357- register allocation passes, some ideas are used from 358 [[4]](#references), but implementation is simplified to make it more 359 efficient in terms of the compilation speed (e.g. no recursive 360 recoloring) while achieving good enough results. 361 362 - **ra\_split** - prepares the program to register allocation. 363 Splits live ranges for constrained values by inserting the 364 copies to/from temporary values, so that the live range of the 365 constrained values becomes minimal. 366 367 - **ra\_coalesce** - performs global allocation on registers used 368 in CF/FETCH instructions. It's performed first to make sure they 369 end up in the same GPR. Also tries to allocate all values 370 involved in copies (inserted by the ra\_split pass) to the same 371 register, so that the copies may be eliminated. 372 373 - **ra\_init** - allocates gpr arrays (if indirect addressing is 374 used), and remaining values. 375 376- **post\_scheduler** - ALU scheduler, handles VLIW packing and 377 performs the final register allocation for local values inside ALU 378 clauses. Eliminates all coalesced copies (if src and dst of the copy 379 are allocated to the same register). 380 381- **ra\_checker** - optional debugging pass that tries to catch basic 382 errors of the scheduler or regalloc, 383 384- **bc\_finalize** - propagates the regalloc information from values 385 in node::src and node::dst vectors to the bytecode fields, converts 386 control flow structure (region/depart/repeat) to the target 387 instructions (JUMP/ELSE/POP, 388 LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK). 389 390- **bc\_builder** - builds final bytecode, 391 392* * * * * 393 394References 395---------- 396 397[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl 398McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf) 399 400[2] ["Effective Representation of Aliases and Indirect Memory Operations 401in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark 402Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf) 403 404[3] ["Global Code Motion. Global Value Numbering.", Cliff 405Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf) 406 407[4] ["Register Allocation for Programs in SSA Form", Sebastian 408Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532) 409 410[5] ["An extension to the SSA representation for predicated code", 411Francois de 412Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf) 413 414[6] ["Improvements to the Psi-SSA Representation", F. de 415Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf) 416