1r600-sb 2======= 3 4* * * * * 5 6Debugging 7--------- 8 9### Environment variables 10 11- **R600\_DEBUG** 12 13 There are new flags: 14 15 - **nosb** - Disable sb backend for graphics shaders 16 - **sbcl** - Enable optimization of compute shaders (experimental) 17 - **sbdry** - Dry run, optimize but use source bytecode - 18 useful if you only want to check shader dumps 19 without the risk of lockups and other problems 20 - **sbstat** - Print optimization statistics (only time so far) 21 - **sbdump** - Print IR after some passes. 22 - **sbnofallback** - Abort on errors instead of fallback 23 - **sbdisasm** - Use sb disassembler for shader dumps 24 - **sbsafemath** - Disable unsafe math optimizations 25 26### Regression debugging 27 28If there are any regressions as compared to the default backend 29(R600\_SB=0), it's possible to use the following environment variables 30to find the incorrectly optimized shader that causes the regression. 31 32- **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some 33 shaders 34 - 0 - disabled (default) 35 - 1 - skip optimization for the shaders in the range 36 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is, 37 optimize only the shaders that are not in this range 38 - 2 - optimize only the shaders in the range 39 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END] 40 41- **R600\_SB\_DSKIP\_START** - start of the range (1-based) 42 43- **R600\_SB\_DSKIP\_END** - end of the range (1-based) 44 45Example - optimize only the shaders 5, 6, and 7: 46 47 R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2 48 49All shaders compiled by the application are numbered starting from 1, 50the number of shaders used by the application may be obtained by running 51it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#" 52for each compiled shader. 53 54After figuring out the total number of shaders used by the application, 55the variables above allow to use bisection to find the shader that is 56the cause of regression. E.g. if the application uses 100 shaders, we 57can divide the range [1; 100] and run the application with the 58optimization enabled only for the first half of the shaders: 59 60 R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app> 61 62If the regression is reproduced with these parameters, then the failing 63shader is in the range [1; 50], if it's not reproduced - then it's in 64the range [51; 100]. Then we can divide the new range again and repeat 65the testing, until we'll reduce the range to a single failing shader. 66 67*NOTE: This method relies on the assumption that the application 68produces the same sequence of the shaders on each run. It's not always 69true - some applications may produce different sequences of the shaders, 70in such cases the tools like apitrace may be used to record the trace 71with the application, then this method may be applied when replaying the 72trace - also this may be faster and/or more convenient than testing the 73application itself.* 74 75* * * * * 76 77Intermediate Representation 78--------------------------- 79 80### Values 81 82All kinds of the operands (literal constants, references to kcache 83constants, references to GPRs, etc) are currently represented by the 84**value** class (possibly it makes sense to switch to hierarchy of 85classes derived from **value** instead, to save some memory). 86 87All values (except some pseudo values like the exec\_mask or predicate 88register) represent 32bit scalar values - there are no vector values, 89CF/FETCH instructions use groups of 4 values for src and dst operands. 90 91### Nodes 92 93Shader programs are represented using the tree data structure, some 94nodes contain a list of subnodes. 95 96#### Control flow nodes 97 98Control flow information is represented using four special node types 99(based on the ideas from [[1]](#references) ) 100 101- **region\_node** - single-entry, single-exit region. 102 103 All loops and if's in the program are enclosed in region nodes. 104 Region nodes have two containers for phi nodes - 105 region\_node::loop\_phi contains the phi expressions to be executed 106 at the region entry, region\_node::phi contains the phi expressions 107 to be executed at the region exit. It's the only type of the node 108 that contains associated phi expressions. 109 110- **depart\_node** - "depart region \$id after { ... }" 111 112 Depart target region (jump to exit point) after executing contained 113 code. 114 115- **repeat\_node** - "repeat region \$id after { ... }" 116 117 Repeat target region (jump to entry point) after executing contained 118 code. 119 120- **if\_node** - "if (cond) { ... }" 121 122 Execute contained code if condition is true. The difference from 123 [[1]](#references) is that we don't have associated phi expressions 124 for the **if\_node**, we enclose **if\_node** in the 125 **region\_node** and store corresponding phi's in the 126 **region\_node**, this allows more uniform handling. 127 128The target region of depart and repeat nodes is always the region where 129they are located (possibly in the nested region), there are no arbitrary 130jumps/goto's - control flow in the program is always structured. 131 132Typical control flow constructs can be represented as in the following 133examples: 134 135GLSL: 136 137 if (cond) { 138 < 1 > 139 } else { 140 < 2 > 141 } 142 143IR: 144 145 region #0 { 146 depart region #0 after { 147 if (cond) { 148 depart region #0 after { 149 < 1 > 150 } 151 } 152 < 2 > 153 } 154 <region #0 phi nodes > 155 } 156 157GLSL: 158 159 while (cond) { 160 < 1 > 161 } 162 163IR: 164 165 region #0 { 166 <region #0 loop_phi nodes> 167 repeat region #0 after { 168 region #1 { 169 depart region #1 after { 170 if (!cond) { 171 depart region #0 172 } 173 } 174 } 175 < 1 > 176 } 177 <region #0 phi nodes> 178 } 179 180'Break' and 'continue' inside the loops are directly translated to the 181depart and repeat nodes for the corresponding loop region. 182 183This may look a bit too complicated, but in fact this allows more simple 184and uniform handling of the control flow. 185 186All loop\_phi and phi nodes for some region always have the same number 187of source operands. The number of source operands for 188region\_node::loop\_phi nodes is 1 + number of repeat nodes that 189reference this region as a target. The number of source operands for 190region\_node::phi nodes is equal to the number of depart nodes that 191reference this region as a target. All depart/repeat nodes for the 192region have unique indices equal to the index of source operand for 193phi/loop\_phi nodes. 194 195First source operand for region\_node::loop\_phi nodes (src[0]) is an 196incoming value that enters the region from the outside. Each remaining 197source operand comes from the corresponding repeat node. 198 199More complex example: 200 201GLSL: 202 203 a = 1; 204 while (a < 5) { 205 a = a * 2; 206 if (b == 3) { 207 continue; 208 } else { 209 a = 6; 210 } 211 if (c == 4) 212 break; 213 a = a + 1; 214 } 215 216IR with SSA form: 217 218 a.1 = 1; 219 region #0 { 220 // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2 221 region#0 loop_phi: a.2 = phi a.1, a.6, a.3 222 223 repeat_1 region #0 after { 224 a.3 = a.2 * 2; 225 cond1 = (b == 3); 226 region #1 { 227 depart_0 region #1 after { 228 if (cond1) { 229 repeat_2 region #0; 230 } 231 } 232 a.4 = 6; 233 234 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0 235 } 236 cond2 = (c == 4); 237 region #2 { 238 depart_0 region #2 after { 239 if (cond2) { 240 depart_0 region #0; 241 } 242 } 243 } 244 a.6 = a.5 + 1; 245 } 246 247 region #0 phi: a.7 = phi a.5 // src[0] from depart_0 248 } 249 250Phi nodes with single source operand are just copies, they are not 251really necessary, but this allows to handle all **depart\_node**s in the 252uniform way. 253 254#### Instruction nodes 255 256Instruction nodes represent different kinds of instructions - 257**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains 258the "bc" structure where all fields of the bytecode are stored (the type 259is **bc\_alu** for **alu\_node**, etc). The operands are represented 260using the vectors of pointers to **value** class (node::src, node::dst) 261 262#### SSA-specific nodes 263 264Phi nodes currently don't have special node class, they are stored as 265**node**. Destination vector contains a single destination value, source 266vector contains 1 or more source values. 267 268Psi nodes [[5], [6]](#references) also don't have a special node class 269and stored as **node**. Source vector contains 3 values for each source 270operand - the **value** of predicate, **value** of corresponding 271PRED\_SEL field, and the source **value** itself. 272 273### Indirect addressing 274 275Special kind of values (VLK\_RELREG) is used to represent indirect 276operands. These values don't have SSA versions. The representation is 277mostly based on the [[2]](#references). Indirect operand contains the 278"offset/address" value (value::rel), (e.g. some SSA version of the AR 279register value, though after some passes it may be any value - constant, 280register, etc), also it contains the maydef and mayuse vectors of 281pointers to **value**s (similar to dst/src vectors in the **node**) to 282represent the effects of aliasing in the SSA form. 283 284E.g. if we have the array R5.x ... R8.x and the following instruction : 285 286 MOV R0.x, R[5 + AR].x 287 288then source indirect operand is represented with the VLK\_RELREG value, 289value::rel is AR, value::maydef is empty (in fact it always contain the 290same number of elements as mayuse to simplify the handling, but they are 291NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the 292corresponding SSA versions after ssa\_rename). 293 294Additional "virtual variables" as in [HSSA [2]](#references) are not 295used, also there is no special handling for "zero versions". Typical 296programs in our case are small, indirect addressing is rare, array sizes 297are limited by max gpr number, so we don't really need to use special 298tricks to avoid the explosion of value versions. Also this allows more 299precise liveness computation for array elements without modifications to 300the algorithms. 301 302With the following instruction: 303 304 MOV R[5+AR].x, R0.x 305 306we'll have both maydef and mayuse vectors for dst operand filled with 307array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename 308pass mayuse will contain previous versions, maydef will contain new 309potentially-defined versions. 310 311* * * * * 312 313Passes 314------ 315 316- **bc\_parser** - creates the IR from the source bytecode, 317 initializes src and dst value vectors for instruction nodes. Most 318 ALU nodes have one dst operand and the number of source operands is 319 equal to the number of source operands for the ISA instruction. 320 Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst 321 gpr as in the original instruction, other two are pseudo-operands 322 that represent possibly updated predicate and exec\_mask. Predicate 323 values are used in the predicated alu instructions (node::pred), 324 exec\_mask values are used in the if\_nodes (if\_node::cond). Each 325 vector operand in the CF/TEX/VTX instructions is represented with 4 326 values - components of the vector. 327 328- **ssa\_prepare** - creates phi expressions. 329 330- **ssa\_rename** - renames the values (assigns versions). 331 332- **liveness** - liveness computation, sets 'dead' flag for unused 333 nodes and values, optionally computes interference information for 334 the values. 335 336- **dce\_cleanup** - eliminates 'dead' nodes, also removes some 337 unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP 338 instructions in the source, containers for ALU groups (they were 339 only needed for the ssa\_rename pass) 340 341- **if\_conversion** - converts control flow with if\_nodes to the 342 data flow in cases where it can improve performance (small alu-only 343 branches). Both branches are executed speculatively and the phi 344 expressions are replaced with conditional moves (CNDxx) to select 345 the final value using the same condition predicate as was used by 346 the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx 347 instruction, CNDxx now uses dst[0] from the same PREDSETxx 348 instruction. 349 350- **peephole** - peephole optimizations 351 352- **gvn** - Global Value Numbering [[2]](#references), 353 [[3]](#references) 354 355- **gcm** - Global Code Motion [[3]](#references). Also performs 356 grouping of the instructions of the same kind (CF/FETCH/ALU). 357 358- register allocation passes, some ideas are used from 359 [[4]](#references), but implementation is simplified to make it more 360 efficient in terms of the compilation speed (e.g. no recursive 361 recoloring) while achieving good enough results. 362 363 - **ra\_split** - prepares the program to register allocation. 364 Splits live ranges for constrained values by inserting the 365 copies to/from temporary values, so that the live range of the 366 constrained values becomes minimal. 367 368 - **ra\_coalesce** - performs global allocation on registers used 369 in CF/FETCH instructions. It's performed first to make sure they 370 end up in the same GPR. Also tries to allocate all values 371 involved in copies (inserted by the ra\_split pass) to the same 372 register, so that the copies may be eliminated. 373 374 - **ra\_init** - allocates gpr arrays (if indirect addressing is 375 used), and remaining values. 376 377- **post\_scheduler** - ALU scheduler, handles VLIW packing and 378 performs the final register allocation for local values inside ALU 379 clauses. Eliminates all coalesced copies (if src and dst of the copy 380 are allocated to the same register). 381 382- **ra\_checker** - optional debugging pass that tries to catch basic 383 errors of the scheduler or regalloc, 384 385- **bc\_finalize** - propagates the regalloc information from values 386 in node::src and node::dst vectors to the bytecode fields, converts 387 control flow structure (region/depart/repeat) to the target 388 instructions (JUMP/ELSE/POP, 389 LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK). 390 391- **bc\_builder** - builds final bytecode, 392 393* * * * * 394 395References 396---------- 397 398[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl 399McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf) 400 401[2] ["Effective Representation of Aliases and Indirect Memory Operations 402in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark 403Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf) 404 405[3] ["Global Code Motion. Global Value Numbering.", Cliff 406Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf) 407 408[4] ["Register Allocation for Programs in SSA Form", Sebastian 409Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532) 410 411[5] ["An extension to the SSA representation for predicated code", 412Francois de 413Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf) 414 415[6] ["Improvements to the Psi-SSA Representation", F. de 416Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf) 417