1IR3 NOTES
2=========
3
4Some notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx.  The same shader ISA is present, with some small differences, in adreno a4xx.
5
6Compared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set.  However, the compiler is responsible, in most cases, to schedule the instructions.  The hardware does not try to hide the shader core pipeline stages.  For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or NOPs).  When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit.  Although that results in a lot of edge cases where things fall over, like:
7
8::
9
10  ADD TEMP[0], TEMP[1], TEMP[2]
11  MUL TEMP[0], TEMP[1], TEMP[0].wzyx
12
13Here, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``.  Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over.
14
15So the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment.
16
17For additional documentation about the hardware, see wiki: `a3xx ISA
18<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_.
19
20External Structure
21------------------
22
23``ir3_shader``
24    A single vertex/fragment/etc shader from gallium perspective (i.e.
25    maps to a single TGSI shader), and manages a set of shader variants
26    which are generated on demand based on the shader key.
27
28``ir3_shader_key``
29    The configuration key that identifies a shader variant.  I.e. based
30    on other GL state (two-sided-color, render-to-alpha, etc) or render
31    stages (binning-pass vertex shader) different shader variants are
32    generated.
33
34``ir3_shader_variant``
35    The actual hw shader generated based on input TGSI and shader key.
36
37``ir3_compiler``
38    Compiler frontend which generates ir3 and runs the various backend
39    stages to schedule and do register assignment.
40
41The IR
42------
43
44The ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s).  But there are a few extensions, in the form of meta_ instructions.  And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value.  So, for example, the following TGSI shader:
45
46::
47
48  VERT
49  DCL IN[0]
50  DCL IN[1]
51  DCL OUT[0], POSITION
52  DCL TEMP[0], LOCAL
53    1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz
54    2: MOV OUT[0], TEMP[0].xxxx
55    3: END
56
57eventually generates:
58
59.. graphviz::
60
61  digraph G {
62  rankdir=RL;
63  nodesep=0.25;
64  ranksep=1.5;
65  subgraph clusterdce198 {
66  label="vert";
67  inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"];
68  instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
69  instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
70  inputdce198:<in2>:w -> instrdcedd0:<src0>
71  inputdce198:<in6>:w -> instrdcedd0:<src1>
72  instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
73  inputdce198:<in1>:w -> instrdcec30:<src0>
74  inputdce198:<in5>:w -> instrdcec30:<src1>
75  instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"];
76  inputdce198:<in0>:w -> instrdceb60:<src0>
77  inputdce198:<in4>:w -> instrdceb60:<src1>
78  instrdceb60:<dst0> -> instrdcec30:<src2>
79  instrdcec30:<dst0> -> instrdcedd0:<src2>
80  instrdcedd0:<dst0> -> instrdcf348:<src0>
81  instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
82  instrdcedd0:<dst0> -> instrdcf400:<src0>
83  instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
84  instrdcedd0:<dst0> -> instrdcf4b8:<src0>
85  outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"];
86  instrdcf348:<dst0> -> outputdce198:<out0>:e
87  instrdcf400:<dst0> -> outputdce198:<out1>:e
88  instrdcf4b8:<dst0> -> outputdce198:<out2>:e
89  instrdcedd0:<dst0> -> outputdce198:<out3>:e
90  }
91  }
92
93(after scheduling, etc, but before register assignment).
94
95Internal Structure
96~~~~~~~~~~~~~~~~~~
97
98``ir3_block``
99    Represents a basic block.
100
101    TODO: currently blocks are nested, but I think I need to change that
102    to a more conventional arrangement before implementing proper flow
103    control.  Currently the only flow control handles is if/else which
104    gets flattened out and results chosen with ``sel`` instructions.
105
106``ir3_instruction``
107    Represents a machine instruction or meta_ instruction.  Has pointers
108    to dst register (``regs[0]``) and src register(s) (``regs[1..n]``),
109    as needed.
110
111``ir3_register``
112    Represents a src or dst register, flags indicate const/relative/etc.
113    If ``IR3_REG_SSA`` is set on a src register, the actual register
114    number (name) has not been assigned yet, and instead the ``instr``
115    field points to src instruction.
116
117In addition there are various util macros/functions to simplify manipulation/traversal of the graph:
118
119``foreach_src(srcreg, instr)``
120    Iterate each instruction's source ``ir3_register``\s
121
122``foreach_src_n(srcreg, n, instr)``
123    Like ``foreach_src``, also setting ``n`` to the source number (starting
124    with ``0``).
125
126``foreach_ssa_src(srcinstr, instr)``
127    Iterate each instruction's SSA source ``ir3_instruction``\s.  This skips
128    non-SSA sources (consts, etc), but includes virtual sources (such as the
129    address register if `relative addressing`_ is used).
130
131``foreach_ssa_src_n(srcinstr, n, instr)``
132    Like ``foreach_ssa_src``, also setting ``n`` to the source number.
133
134For example:
135
136.. code-block:: c
137
138  foreach_ssa_src_n(src, i, instr) {
139    unsigned d = delay_calc_srcn(ctx, src, instr, i);
140    delay = MAX2(delay, d);
141  }
142
143
144TODO probably other helper/util stuff worth mentioning here
145
146.. _meta:
147
148Meta Instructions
149~~~~~~~~~~~~~~~~~
150
151**input**
152    Used for shader inputs (registers configured in the command-stream
153    to hold particular input values, written by the shader core before
154    start of execution.  Also used for connecting up values within a
155    basic block to an output of a previous block.
156
157**output**
158    Used to hold outputs of a basic block.
159
160**flow**
161    TODO
162
163**phi**
164    TODO
165
166**collect**
167    Groups registers which need to be assigned to consecutive scalar
168    registers, for example `sam` (texture fetch) src instructions (see
169    `register groups`_) or array element dereference
170    (see `relative addressing`_).
171
172**split**
173    The counterpart to **collect**, when an instruction such as `sam`
174    writes multiple components, splits the result into individual
175    scalar components to be consumed by other instructions.
176
177
178.. _`flow control`:
179
180Flow Control
181~~~~~~~~~~~~
182
183TODO
184
185
186.. _`register groups`:
187
188Register Groups
189~~~~~~~~~~~~~~~
190
191Certain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers.  In the simplest example:
192
193::
194
195  sam (f32)(xyz)r2.x, r0.z, s#0, t#0
196
197for a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``.
198
199Before register assignment, to group the two components of the texture src together:
200
201.. graphviz::
202
203  digraph G {
204    { rank=same;
205      collect;
206    };
207    { rank=same;
208      coord_x;
209      coord_y;
210    };
211    sam -> collect [label="regs[1]"];
212    collect -> coord_x [label="regs[1]"];
213    collect -> coord_y [label="regs[2]"];
214    coord_x -> coord_y [label="right",style=dotted];
215    coord_y -> coord_x [label="left",style=dotted];
216    coord_x [label="coord.x"];
217    coord_y [label="coord.y"];
218  }
219
220The frontend sets up the SSA ptrs from ``sam`` source register to the ``collect`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values.  And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``collect``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers.
221
222And likewise, for the consecutive scalar registers for the destination:
223
224.. graphviz::
225
226  digraph {
227    { rank=same;
228      A;
229      B;
230      C;
231    };
232    { rank=same;
233      split_0;
234      split_1;
235      split_2;
236    };
237    A -> split_0;
238    B -> split_1;
239    C -> split_2;
240    split_0 [label="split\noff=0"];
241    split_0 -> sam;
242    split_1 [label="split\noff=1"];
243    split_1 -> sam;
244    split_2 [label="split\noff=2"];
245    split_2 -> sam;
246    split_0 -> split_1 [label="right",style=dotted];
247    split_1 -> split_0 [label="left",style=dotted];
248    split_1 -> split_2 [label="right",style=dotted];
249    split_2 -> split_1 [label="left",style=dotted];
250    sam;
251  }
252
253.. _`relative addressing`:
254
255Relative Addressing
256~~~~~~~~~~~~~~~~~~~
257
258Most instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers.  In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, i.e. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number).
259
260    Note that cat5 (texture sample) instructions are the notable exception, not
261    supporting relative addressing of src or dst.
262
263Relative addressing of the const file (for example, a uniform array) is relatively simple.  We don't do register assignment of the const file, so all that is required is to schedule things properly.  I.e. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time.
264
265But relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (i.e. the array elements must be assigned to consecutive scalar registers).  And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written.
266
267Each instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s).  This behaves as an additional virtual src register, i.e. ``foreach_ssa_src()`` will also iterate the address register (last).
268
269    Note that ``nop``\'s for timing constraints, type specifiers (i.e.
270    ``add.f`` vs ``add.u``), etc, omitted for brevity in examples
271
272::
273
274  mova a0.x, hr1.y
275  sub r1.y, r2.x, r3.x
276  add r0.x, r1.y, c<a0.x + 2>
277
278results in:
279
280.. graphviz::
281
282  digraph {
283    rankdir=LR;
284    sub;
285    const [label="const file"];
286    add;
287    mova;
288    add -> mova;
289    add -> sub;
290    add -> const [label="off=2"];
291  }
292
293The scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time.
294
295To implement variable arrays, the NIR registers are stored as an ``ir3_array``,
296which will be register allocated to consecutive hardware registers.  The array
297access uses the id field in the ``ir3_register`` to map to the array being
298accessed, and the offset field for the fixed offset within the array.  A NIR
299indirect register read such as:
300
301::
302
303  decl_reg vec2 32 r0[2]
304  ...
305  vec2 32 ssa_19 = mov r0[0 + ssa_9]
306
307
308results in:
309
310::
311
312  0000:0000:001:  shl.b hssa_19, hssa_17, himm[0.000000,1,0x1]
313  0000:0000:002:  mov.s16s16 hr61.x, hssa_19
314  0000:0000:002:  mov.u32u32 ssa_21, arr[id=1, offset=0, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
315  0000:0000:002:  mov.u32u32 ssa_22, arr[id=1, offset=1, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
316
317
318Array writes write to the array in ``instr->regs[0]->array.id``.  A NIR indirect
319register write such as:
320
321::
322
323  decl_reg vec2 32 r0[2]
324  ...
325  r0[0 + ssa_12] = mov ssa_13
326
327results in:
328
329::
330
331  0000:0000:001:  shl.b hssa_29, hssa_27, himm[0.000000,1,0x1]
332  0000:0000:002:  mov.s16s16 hr61.x, hssa_29
333  0000:0000:001:  mov.u32u32 arr[id=1, offset=0, size=4, ssa_17], c2.y, address=_[0000:0000:002:  mov.s16s16]
334  0000:0000:004:  mov.u32u32 arr[id=1, offset=1, size=4, ssa_31], c2.z, address=_[0000:0000:002:  mov.s16s16]
335
336Note that only cat1 (mov) can do indirect write, and thus NIR register stores
337may need to introduce an extra mov.
338
339ir3 array accesses in the DAG get serialized by the ``instr->barrier_class`` and
340containing ``IR3_BARRIER_ARRAY_W`` or ``IR3_BARRIER_ARRAY_R``.
341
342Shader Passes
343-------------
344
345After the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_.  Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail.
346
347    Note that we essentially have ~256 scalar registers in the
348    architecture (although larger register usage will at some thresholds
349    limit the number of threads which can run in parallel).  And at some
350    point we will have to deal with spilling.
351
352.. _flatten:
353
354Flatten
355~~~~~~~
356
357In this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions.  The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else.
358
359
360.. _`copy propagation`:
361
362Copy Propagation
363~~~~~~~~~~~~~~~~
364
365Currently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources.  And the CP pass simply removes all simple ``mov``\s (i.e. src-type is same as dst-type, no abs/neg flags, etc).
366
367The eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things.
368
369
370.. _grouping:
371
372Grouping
373~~~~~~~~
374
375In the grouping pass, instructions which need to be grouped (for ``collect``\s, etc) have their ``left`` / ``right`` neighbor pointers setup.  In cases where there is a conflict (i.e. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted.  This ensures that there is some possible valid `register assignment`_ at the later stages.
376
377
378.. _depth:
379
380Depth
381~~~~~
382
383In the depth pass, a depth is calculated for each instruction node within its basic block.  The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of its source instructions.  (meta_ instructions don't add to the depth).  As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction.  Unreachable instructions and inputs are marked.
384
385    TODO: we should probably calculate both hard and soft depths (?) to
386    try to coax additional instructions to fit in places where we need
387    to use sync bits, such as after a texture fetch or SFU.
388
389.. _scheduling:
390
391Scheduling
392~~~~~~~~~~
393
394After the grouping_ pass, there are no more instructions to insert or remove.  Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after its source instructions plus delay slots.  Insert ``nop``\s as required.
395
396.. _`register assignment`:
397
398Register Assignment
399~~~~~~~~~~~~~~~~~~~
400
401TODO
402
403
404