• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

README.mdH A D21-Dec-202124.1 KiB588483

config.hppH A D21-Dec-202158.1 KiB1,6201,232

conv_kernel.hppH A D21-Dec-2021120.1 KiB3,3042,697

fma_support.cppH A D21-Dec-20215.5 KiB165123

fma_support.hppH A D21-Dec-20218.7 KiB281209

gemm_schedule.cppH A D21-Dec-20214.7 KiB12697

gemm_schedule.hppH A D21-Dec-202133.3 KiB987737

gen_convolution.cppH A D21-Dec-202115.6 KiB411336

gen_convolution.hppH A D21-Dec-20213.7 KiB13278

ir.cppH A D21-Dec-202126.1 KiB893725

ir.hppH A D21-Dec-202124.7 KiB898632

ir_core.cppH A D21-Dec-202116.3 KiB541420

ir_core.hppH A D21-Dec-202159.2 KiB2,1611,572

ir_simplify.cppH A D21-Dec-202160.7 KiB2,0251,542

kernel_arg_info.hppH A D21-Dec-20218.8 KiB261200

kernel_builder.cppH A D21-Dec-2021226.4 KiB6,4875,011

kernel_builder.hppH A D21-Dec-20212.6 KiB7849

message_support.cppH A D21-Dec-20219.4 KiB248176

message_support.hppH A D21-Dec-202121 KiB544447

ngen_proxy.hppH A D21-Dec-20212.3 KiB9959

post_op_support.hppH A D21-Dec-202116.5 KiB427349

reduce_support.hppH A D21-Dec-20212 KiB6839

reorder_support.hppH A D21-Dec-20214.8 KiB155105

tensor.cppH A D21-Dec-202125.1 KiB765627

tensor.hppH A D21-Dec-202151.7 KiB1,6531,290

utils.hppH A D21-Dec-202110.1 KiB388298

README.md

1GPU Convolution Kernel Generator
2===========================================
3
4# Introduction
5
6The GPU convolution kernel generator is used to generate kernels for different
7kinds of convolution for Intel GPUs. The main goals for the generator are:
8- Assembly level performance
9- Broad coverage including data types, propagation kind, HW, etc
10- Generalization
11    - Rely on common building blocks and optimizations
12    - Avoid specialization unless it's absolutely necessary
13
14# Convolution
15
16## Generalized Convolution Algorithm
17
18See [oneDNN documentation](https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html)
19for the naming conventions that are used below.
20
21Convolution has more variations than GEMM but for simplicity we will rely on
22the GEMM naming conventions to come up with a generalized convolution
23algorithm. GEMM performs the following operation: `C += A * B`, where:
24
25- `A` is `(M x K)` matrix
26- `B` is `(K x N)` matrix
27- `C` is `(M x N)` matrix
28
29Algorithmically, any convolution can be expressed in a form which is very
30similar to GEMM:
31
32```python
33for i0 in range(0, I0):
34    ...
35    for j0 in range(0, J0):
36        ...
37        c_val = 0
38        for k0 in range(0, K0):
39            ...
40            a_val = load_A(i0, ..., k0)
41            b_val = load_B(k0, ..., j0)
42            c_val += a_val * b_val
43        store_C(i0, ..., j0, ..., c_val)
44```
45
46`i0`, `j0`, and `k0` are `M`, `N` and `K` dimensions respectively:
47- `M` dimensions are shared between `A` and `C`
48- `N` dimensions are shared between `B` and `C`
49- `K` dimensions are shared between `A` and `B`
50
51Convolution may have many `I`/`J`/`K` dimensions.
52
53Let's consider 1D forward convolution:
54
55```python
56for mb in range(0, MB):                                          # M
57    for ow in range(0, OW):                                      # M
58        for oc in range(0, OC):                                  # N
59            dst_val = 0
60            for ic in range(0, IC):                              # K
61                for kw in range(0, KW):                          # K
62                    src_val = load_src(mb, ic, ow, kw)
63                    wei_val = load_wei(oc, ic, kw)
64                    dst_val += src_val * wei_val
65            store_dst(mb, oc, ow, dst_val)
66```
67
68Here `load_wei()` and `store_dst()` translate ND indices into 1D to calculate
69the offset to load from/store to. However `load_src()` should also performs
70convolution-specific `4D` to `3D` translation in the beginning. `ow` and `kw`
71are translated to `iw` using the following expression `iw = ow * SW + kw * (DW + 1) - PW`.
72
73Another convolution-specific feature is the logical padding support.
74`load_src()` should return `0` in cases when `iw` index is out of bounds.
75
76Considering all the above mentioned, load/store functions for `ncw`/`oiw`/`ncw`
77layouts can be implemented as follows:
78
79```python
80def load_src(mb, ic, ow, kw):
81    iw = ow * SW + kw * (DW + 1) - PW
82    if iw < 0 or iw >= IW:
83        return 0
84    off = 0
85    off += mb * IC * IW
86    off += ic * IW
87    off += iw
88    return src[off]
89
90def load_wei(oc, ic, kw):
91    off = 0
92    off += oc * IC * KW
93    off += ic * KW
94    off += kw
95    return wei[off]
96
97def store_dst(mb, oc, ow, val):
98    off = 0
99    off += mb * OC * OW
100    off += oc * OW
101    off += ow
102    dst[off] = val
103```
104
105Backward by data and backward by weights convolutions can be expressed in the
106same, GEMM-like form.
107
108Here are the steps needed to transform any convolution to the GEMM-like
109form:
110
111- Identify M/N/K loops and map convolution tensors to A/B/C
112    - For forward: source -> A, weights -> B, destination -> C
113    - For backward by data: destination -> A, weights -> B, source -> C
114    - For backward by weights: source -> A, destination -> B, weights -> C
115- Describe load/store functions for the tensors. There are two parts:
116    - Underlying layout (e.g. blocked layout `NChw32n32c`)
117    - Access condition or mask (e.g. `iw >= 0 and iw < IW`). Load/store the
118      element if the mask is true. Otherwise, for loads: return zero, for stores:
119      drop the store.
120
121Both steps depend on whether the convolution is forward/backward and should be
122specialized accordingly. To properly do loads/stores, the generator introduces
123the "view" abstraction which contains information about the tensor: its
124underlying layout and corresponding masks (see the detailed description below).
125
126## GPU Convolution Optimizations
127
128GPU convolution requires a number of optimizations to reach close to roofline
129performance:
130
131- High-level optimizations
132    - Loop order and blocking
133        - Including assigning loops to the kernel grid and thread group grid
134    - Single SLM buffering
135        - Including selecting the SLM layout
136    - Load/store decomposition into instructions
137        - Block vs scattered messages
138        - This also includes mask handling (for out-of-bound and stride
139          conditions)
140    - Multiplication decomposition into instructions (mad, dp4a, dpas(w))
141        - This also includes optional GRF blocking (load -> compute -> load to
142          the same GRF buffer -> compute)
143        - May require auxiliary reorders to match the instruction layout
144- Middle-level optimizations
145    - Double/triple SLM buffering with split barriers
146    - GRF buffering for SLM loads
147- Low-level optimizations
148    - Loop unrolling
149    - Assigning GRF banks for multiplication buffers to avoid bank conflicts
150    - SLM layout padding to avoid SLM write conflicts
151    - Transforming dpas to dpasw to reduce SLM traffic and SLM consumption
152    - Offset/address arithmetic optimizations
153
154## Kernel Generation High-Level Flow
155
156The generation flow consists of three main stages:
157
158- Creating a kernel skeleton using intermediate representation (IR). The
159  resulting kernel includes the high-level optimizations: loop nest,
160  loads/stores, multiplications, SLM buffering.
161    - After this stage the kernel is functionally correct but it needs further
162      passes/optimizations to apply more low-level optimizations and to convert
163      it to the form that can be lowered to assembly
164- Fine-grained optimizations. This is mostly about applying low-level/local
165  optimizations:
166  - Transforming single SLM buffering to double/triple buffering
167  - Expression simplification
168  - Loop hoisting
169  - Common subexpression elimination
170  - Strength reduction
171- Binary code generation. At this stage the kernel is fully optimized and needs
172  to be translated to nGEN which is responsible for binary code generation.
173
174# Generator Design
175
176The main modules of the generator are:
177- Intermediate representation (IR)
178    - Used to describe the convolution kernel
179    - The initial IR form contains all high-level optimizations (blocking,
180      multiplication decomposition into instructions, etc)
181    - Middle-level and low-level optimizations are implemented as IR passes
182        - IR pass takes a kernel (or an IR statement) and returns a transformed
183          kernel with some optimizations (e.g. with applied double SLM
184          buffering)
185- Expression simplification
186    - Various algorithms and rules to simplify IR expressions
187    - Used for offset simplification
188- Binary code generator
189    - Performs lowering from IR to nGEN
190    - nGEN is used to generate assembly binary code
191- Tensor, layout and view abstractions
192    - Layout describes how logical tensor indices are stored in memory
193        - Semantically it's the same as the oneDNN blocked memory descriptor
194    - View describes a "virtual" tensor
195        - Virtual tensor in general doesn't exist in memory, instead a view
196          contains information about how to access elements of such a tensor
197        - View helps to express out-of-bound/stride conditions and generalize
198          forward/backward convolution algorithms
199        - See the detailed description below
200
201## IR
202
203IR of the generator adopted many ideas from the IR used by the
204[Halide](https://halide-lang.org/) project.
205
206All IR objects are immutable by design and use reference counting. The base
207class is `object_t` which implements intrusive reference-counting for
208`object_impl_t` objects. `object_t` is a wrapper over the real implementation
209in `object_impl_t`. All IR objects must have `object_impl_t` in their
210inheritance hierarchy as the top-most class.
211
212IR objects support equality comparison: `a.is_equal(b)`. `operator==()` is
213reserved and overloaded for boolean comparisons. Additionally IR objects
214provide `get_hash()` method to allow using them as keys for
215`std::unordered_set` and `std::unordered_map` containers, see corresponding aliases:
216
217- `object_map_t` - an unordered map with `object_t` as the key
218- `object_set_t` - an unordered set with `object_t` as the key
219
220Main IR objects are:
221
222- Expressions: class `expr_t` (inherited from `object_t`). Examples:
223    - Variables: `var_t`
224    - Immediate values (constants): `bool_imm_t`, `int_imm_t` and `float_imm_t`
225    - Unary/binary/ternary operations: `unary_op_t`, `binary_op_t`,
226      `ternary_op_t`
227- Statements: class `stmt_t` (inherited from `object_t`). Examples:
228    - Let statement: `let_t`. Binds a value to a variable for the scope defined
229      by the let statement.
230    - Loop statement: `for_t`
231    - If statement: `if_t`
232    - Function call: `func_call_t`
233- Functions: class `func_t` (inherited from `object_t`). A function and its
234  arguments-expressions are used to construct a function call - statement
235  with some side effects. Many GPU assembly constructs are represented with
236  functions, for example:
237    - Synchronization instructions: barrier-wait, barrier-signal, memory fences
238    - FMA instructions: fma, dp4a, dpas(w)
239    - Send instruction
240
241IR expressions support operator overloading for convenience of use:
242
243```c++
244expr_t a = var_t::make(type_t::s32(), "a");
245expr_t b = var_t::make(type_t::s32(), "b");
246expr_t c = 2 * (a + b) - a;
247expr_t c_simplified = simplify(c);
248// (gdb) call c.dump()
249// ((2 * (a + b)) - a)
250// (gdb) call c_simplified.dump()
251// (a + (b * 2))
252```
253
254### IR Printing and Debugging
255
256All IR objects provide:
257
258- Overloaded `operator<<` to use with `std::ostream`
259- `str()` method returning a textual representation of the object
260- `dump()` method to call it under gdb to print a textual representation of
261  the object:
262    - `call obj.dump()` (under gdb)
263
264All the main IR passes trace the after-pass IR statement when tracing is
265enabled (controlled by `LOG_LEVEL`).
266
267`ir_printer_t` class is mainly responsible for the IR printing-related logic.
268
269### Functionality to Traverse and Modify IR
270
271A convolution kernel is an IR statement. Most IR objects contain other IR
272objects. In general an IR object can be considered as a tree. Some rules apply:
273
274- Statements can include other statements, expressions and functions
275- Expressions can include other expressions but can't contain statements or
276  functions
277
278`ir_visitor_t` implements generic functionality to traverse an
279arbitrary IR object. Example:
280
281```c++
282// Custom visitor to count the total number of loops in the given IR object.
283class loop_visitor_t : public ir_visitor_t {
284public:
285    void _visit(const for_t *obj) override {
286        refs++;
287        // To count nested loops.
288        ir_visitor_t::_visit(obj);
289    }
290    int refs = 0;
291};
292
293// root_stmt is an IR statement
294loop_visitor_t visitor;
295visitor.visit(root_stmt);
296```
297
298`ir_mutator_t` is similar to the IR visitor but is used to update IR trees.
299
300## Expression Simplification
301
302To be added.
303
304## Binary Code Generator
305
306The main logic for code generation is implemented as an IR visitor, in
307`ir_to_ngen_t` class. The final IR is very close to assembly so the generation
308process is straightforward. Some examples of how different IR objects are
309handled:
310
311- Let statement (to introduce variables and bind them to a value)
312    - The register allocator is used to allocate a subregister for the variable
313    - The variable is initialized either with a `mov` instruction or the value
314      is evaluated in the subregister directly
315    - Expression binding (`expr_binding_t`) is updated to bind the IR variable
316      object to the subregister (to be able to access it in nested
317      statements/expressions later)
318    - The nested statement of the let statement is visited
319    - The subregister is released after traversing the nested statement
320- Expressions
321    - Expression evaluation is handled by `expr_evaluator_t` class
322    - Expression is evaluated recursively. For each expression:
323        - Its operands are evaluated (and bound to subregisters if needed)
324        - The expression itself is evaluated
325    - Sometimes we want to compute an expression in a pre-allocated subregister
326      (for example when lowering a let statement). This case is also supported
327      by `expr_evaluator_t`.
328
329Additionally, the generator implements extra logic for functionality such as:
330
331- Instruction emulation. Some platforms don't have support for some instructions. Examples:
332    - 64-bit arithmetic emulation. This is not handled by the generator and
333      implemented in `gpu/jit/gemm/emulation.hpp`.
334    - `add3` instruction. Emulated as two `add` instructions on older architectures.
335- GRF region restrictions. Example:
336    - `d` <-> `q` or `d` <-> `b` conversions require to align the smaller data
337      type GRF region to match the other data type
338- Direct implementation of IR functions. Examples:
339    - Reorder between GRF layouts. For simplicity reorders are emitted in
340      place. In the kernel IR they are represented as function calls.
341    - Reduction of a GRF buffer. Similar to the GRF reorder.
342
343## Tensor, Layout and View
344
345Tensor, layout and view are the core abstractions of the generator.
346
347**Tensor** - describes a tensor with offsets (stored as IR expressions). Example:
348`32x32x1x1` tensor with `[mb_idx, ic_idx, 0, 0]` offsets (activations for 2D
349convolution: `N x C x H x W`).
350
351**Layout** - describes a memory layout, contains a physical representation of a
352tensor. Layout properties:
353
354- Data type
355- Number of dimensions
356- Offset to the start of the tensor (in elements of the data type)
357    - Same as `offset0` in `dnnl_memory_desc_t`
358- Layout blocks
359    - Blocks are stored with their dimension index, block size and stride
360        - Outer blocks and non-blocked dimensions are also fully specified with
361          dedicated blocks
362    - Example: `4n2c7h7w32n32c` (6 blocks) (`NChw32n32c` in oneDNN convention)
363
364**View** - describes a "virtual" tensor (view) with its underlying tensor/layout:
365
366- View tensor `V` has `m` dimensions: `V(v0, v1, ..., v(m-1))`
367- Underlying tensor `T` has `n` dimensions: `T(t0, t1, ..., t(n-1))`
368- Mapping from view dimensions to tensor dimensions is defined by special
369  functions:
370    - `t_j = F_j(v0, v1, ..., v(m-1))`
371- M/N/K dimension kinds (GEMM behavior) for `V` dimensions
372- Each `t_j` dimension may have an associated access mask
373    - When the mask is evaluated to false, the element is assumed to be `0`
374
375View example: 2D convolution, 3x3 filter:
376
377- View `V` has 6 dimensions: `mb`, `ic`, `oh`, `ow`, `kh` and `kw`
378- Tensor `T` has 4 dimensions: `mb`, `ic`, `ih`, `iw`
379- Mapping from view to tensor:
380    - `mb` is directly mapped (`t_0 = v_0`)
381    - `ic` is directly mapped (`t_1 = v_1`)
382    - `ih = oh * SH + kh * (DH + 1) - PH`
383    - `iw = ow * SW + kw * (DW + 1) - PW`
384- M/N/K dimension kinds:
385    - M dimensions: `mb`, `oh`, `ow`
386    - K dimensions: `ic`, `kh`, `kw`
387- Access masks:
388    - `mb` mask: empty
389    - `ic` mask: empty
390    - `ih` mask: `ih >= 0 and ih < IH`
391    - `iw` mask: `iw >= 0 and iw < IW`
392
393The view abstraction encapsulates computation semantics including
394convolution-specific stride and out-of-bound conditions and M/N/K dimension
395kinds. Having an outer loop nest and defined A/B/C views for the inner blocked
396multiplication is enough to fully describe the convolution computation in
397terms of the algorithm.
398
399## Kernel Generation Flow
400
401### Configuring Kernel Parameters
402
403This is performed during primitive descriptor initialization.
404
405Kernel configuration is defined by `config_t` object. The configuration
406includes:
407
408- Convolution problem description (`conv_problem_t`)
409    - Propagation kind, batch size, input/output channels, etc
410- Implementation-specific kernel parameters. Examples:
411    - Block sizes
412    - SLM buffering parameters: enabled/disabled, single/double/triple
413    - FMA instruction to use: mad, dp4a, dpas(w)
414
415Kernel parameters are set depending on the architecture, propagation, problem
416shape, etc.
417
418The configuration object is further passed to the kernel builder which
419generates the kernel according to the configured parameters.
420
421### Initializing Kernel ABI (kernel_arg_info_t)
422
423This is performed during primitive descriptor initialization.
424
425Kernel ABI defines the order of the kernel arguments, their types and their
426bindings to IR expressions.
427
428During kernel creation/generation, `kernel_arg_info_t` is used to:
4291) Set up the kernel interface via nGEN
4302) To access IR expressions corresponding to the kernel arguments
431
432During execution, `kernel_arg_info_t` is used to set arguments according to the
433kernel ABI.
434
435### IR Generation
436
437This and further steps are performed during primitive initialization.
438
439`kernel_builder_t` class is responsible for the whole kernel IR generation.
440There are other builder classes which are responsible for more specialized
441functionality, for example:
442
443- Builder to construct load-multiply statements
444- Builder to decompose view load/store into messages
445
446#### Forming Loop Nest and A/B/C Views
447
448The generation starts with creating the outer loop nest by forming
449corresponding IR statements. The order of loops and blocking schemes are
450hard-coded for different propagation kinds though some parameters are
451configurable such as block sizes and the thread group size.
452
453For simplicity there are some conventions and limitations:
454
455- The outer loop nest is mapped to the kernel grid (grid of thread groups,
456  supports up to 3 dimensions)
457    - In many cases this requires packing of several problem-specific
458      dimensions into one. In this case the kernel must implement unpacking
459      using division and modulus operations.
460- The outer loop nest typically includes M and N dimensions. It may also
461  include K dimensions - in this case we have partial reduction, and the kernel
462  must use atomic stores for C updates.
463- Convention: thread group doesn't have outer loops across M or N dimensions,
464  only across K dimensions.
465
466According to these rules, the generic looping scheme looks as follows (assume
467one dimension per M/N/K, for simplicity):
468
469```python
470for m_tg_idx in range(0, m, m_tg_blk):         # Mapped to the kernel grid
471    for n_tg_idx in range(0, n, n_tg_blk):     # Mapped to the kernel grid
472        for k_tg_idx in range(0, k, k_tg_blk): # Mapped to the kernel grid
473            ...
474            for k_idx in range(k_tg_idx, k_tg_idx + k_tg_blk, k_blk): # Loop inside thread
475                ...
476               # Perform C += A * B multiplication cooperatively by a thread group
477               # A is (m_tg_blk x    k_blk)
478               # B is (   k_blk x n_tg_blk)
479               # C is (m_tg_blk x n_tg_blk)
480```
481
482After this step we have the following blocks:
483- Let statements to unpack M/N/K indices from the kernel grid
484- IR statements for the explicit reduction loops (inside a thread)
485- A/B/C views describing thread group level blocked matrix multiplication
486    - These views contain sizes, M/N/K semantics, access masks and the
487      underlying layout
488
489With these blocks the representation is generic enough so that all further
490steps in the flow are common between different propagation kinds.
491
492#### SLM Buffering, Loads, Blocked Multiplication, Epilogue and Final Store
493
494`compute_builder_t` is responsible for generation of the innermost blocked
495computation and the final store of tensor C. According to `config_t` object the
496builder generates the following kernel parts:
497
498- SLM loads and stores (when SLM buffering is enabled):
499    - Define SLM layout. Use FMA-friendly layout, if reorders are necessary,
500      perform them earlier, between loads from global memory and stores to SLM
501    - Split a view between all threads in the thread group for cooperative loads/stores
502        - Sometimes only part of the thread group should participate in
503          loads/stores - use thread group sub-grid and conditions to guard loads/stores
504    - Add barriers before and after stores to SLM
505- Loads from SLM/global memory and multiplication
506    - Split A/B thread group views across the thread group grid (X x Y x 1)
507        - Split M dimension across Y dimension of the grid
508        - Split N dimension across X dimension of the grid
509        - Each thread computes `(m_thr x n_thr)` tile of C tensor
510            - `m_thr = m_tg / Y`
511            - `n_thr = n_tg / X`
512    - Split per-thread blocked multiplication to sub-tiles according to the
513      configuration (to reduce GRF consumption and reuse GRF buffers)
514        - Typically `b_sub_tiles > 1` is used (B tile is split into sub-tiles)
515    - Generate loads (from SLM or global memory) for A/B tensors
516    - Generate GRF-to-GRF reorders (when needed) to match FMA layout. This is
517      mainly needed for dpas.
518    - Generate IR function calls matching FMA instructions
519    - Apply dpas to dpasw transformation (if possible). Save GRF permutation
520      `grf_permutator_t` to restore registers back after applying dpasw. This
521      permutation will be applied to the final C result.
522    - Restore per-thread C tensor in terms of problem layout/dimensions
523        - Per-thread C tensor is `(m_thr x n_thr)` but the way M/N dimensions
524          are mapped back to the problem dimensions completely depends on how A/B
525          were split across the grid. `mnk_mapper_t` is used to track that.
526- Epilogue (`epilogue_builder_t`) in case when convolution includes bias, post-ops or output scales
527    - Bias/post-ops/output scales are handled similarly by `post_op_builder_t`
528    - C is split into blocks to apply post-ops
529    - Flow for applying post-ops is:
530        - Pre-load the corresponding blocks of right-hand side tensors to GRF
531          (for binary post-ops or for bias/output scales)
532        - Convert C to `f32`
533        - Generate IR statements to handle all post-ops step by step
534        - Convert the updated C to the final data type
535- Final stores to global memory
536    - Generate stores (maybe atomic stores, for partial reduction)
537
538#### More Optimizations and IR Passes
539
540At this step the kernel is functionally correct. Now, more transformations and
541optimizations need to be applied:
542
543- Injecting double/triple SLM buffering
544    - `simple_slm_buffering_injector_t` or `unrolled_slm_buffering_injector_t`
545      is used to convert single buffering to double/triple SLM buffering
546      according to some rules
547- Expression simplification
548- Let optimization
549    - Remove unused or redundant let statements
550- Loop hoisting
551- Common subexpression elimination
552- Strength reduction (this is only applied with unrolled SLM buffering)
553- Generating proper send headers
554    - Initial `send_t` function calls contain byte offsets. Messages headers
555      are generated according to the message specification.
556- Peephole optimizations
557    - `add` + `add` -> `add3`
558    - `mul` + `add` -> `mad`
559
560### Binary Code Generation
561
562At this step the kernel in the IR form includes all optimizations. The binary
563code generator is implemented similarly to other nGEN-based kernels. The main
564differences are related to IR usage. The binary generator flow steps are
565described below:
566
567- Configuring the kernel interface:
568    - Use `require\*()` API to specifying/enabling kernel features, such as:
569      SIMD size, SLM size, dpas, barrier, etc
570    - Listing kernel parameters according to `kernel_arg_info_t` (kernel ABI)
571- Expression binding initialization for thread grid IDs and kernel arguments
572    - This is to bind IR variables for external parameters the corresponding
573      registers
574    - Further, any reference to such an external variable is resolved based on
575      the expression binding
576- Visiting the kernel IR statement. The IR tree structure is recursively
577  traversed and corresponding instructions are emitted using nGEN calls.
578
579#### Lowering IR to nGEN
580
581`ir_to_ngen_t` implements the IR visitor interface and walks through the whole
582kernel. Handling of `for_t`, `if_t`, `let_t` and similar statements follows the
583same pattern. First, we evaluate the related conditions, values, loop bounds.
584During evaluation they are bound to some registers. After that we can form the
585statement using proper instructions, e.g. `cmp`/`jmpi` for the loop or
586`if`/`else`/`endif` for the if statement. The body statement is visited and
587lowered to nGEN recursively.
588