1# MLIR: The case for a simplified polyhedral form
2
3MLIR embraces polyhedral compiler techniques for their many advantages
4representing and transforming dense numerical kernels, but it uses a form that
5differs significantly from other polyhedral frameworks.
6
7**Disclaimer / Warning**
8
9This document is a very early design proposal (which has since been accepted)
10that explored the tradeoffs of using this simplified form vs the traditional
11polyhedral schedule list form. At some point, this document could be dusted off
12and written as a proper academic paper, but until now, it is better to included
13it in this crafty form than not to. Beware that this document uses archaic
14syntax and should not be considered a canonical reference to modern MLIR.
15
16## Introduction
17
18This document discusses general goals of the project, introduces context and the
19two alternatives, then talks about the tradeoffs of these designs. Written by
20Chris Lattner.
21
22## General goals of an IR, and goals of mlfunc's specifically
23
24Our currently planned representation for MLIR consists of two kinds of
25functions: an LLVM-like "CFG Function" and an "ML Function": a function
26represented in multidimensional loop form. The idea is that a CFG function is
27capable of full generality for expressing arbitrary computation, but is awkward
28for loop transformations. In contrast, mlfunc's are limited (e.g. to control
29flow involving loop nests over affine spaces) but these limitations make it much
30easier to transform and analyze, particularly for the set of computations in a
31machine learning kernel.
32
33The design of an intermediate representations is an optimization problem, which
34makes intentional tradeoffs that aim to make certain kinds of compiler
35transformations simple. After all, it is "possible" to do almost any
36transformation on any IR: we could theoretically do loop transformations on
37assembly language. OTOH, such transformations would take too long to write,
38would be fragile due to irrelevant changes, would be difficult to maintain, and
39difficult to make target independent. Performing transformations on the "right
40level" of IR makes it much easier to do analysis and transformation of code, and
41can make them faster by reducing the size of the IR, and eliminating
42possibilities that would have otherwise have to be considered.
43
44This is the reason we're interested in adding polyhedral techniques to an IR in
45the first place: though our base "CFG function" representation is fully capable
46of expressing any computation, it is "too" expressive. The limitations imposed
47by polyhedral techniques (e.g. on affine loop bounds and array subscripts)
48define a closed algebra that can represent an interesting range of
49transformations and their compositions, and because of their simplicity, we can
50perform (e.g.) dependence analysis more efficiently and more reliably.
51
52This raises an important question that this document examines: given we are
53introducing a redundant and limited way to express code and transformations,
54exactly what form is best to perform the analyses and transformations we want?
55
56We explore two different design points that are capable of expressing the same
57class of affine loop computations, but which use different representational
58forms. These forms trade off verbosity, ease of transformation, and ease of
59analysis in interesting ways.
60
61## Context: Traditional Polyhedral Form
62
63We started by discussing a representation that uses the traditional polyhedral
64schedule set + domain representation, e.g. consider C-like code like:
65
66```c
67  void simple_example(...) {
68    for (int i = 0; i < N; ++i) {
69      for (int j = 0; j < N; ++j) {
70         float tmp = X[i,j]    // S1
71         A[i,j] = tmp + 1      // S2
72         B[i,j] = tmp * 42     // S3
73       }
74    }
75  }
76```
77
78The polyhedral representation doesn't care about the actual computation, so we
79will abstract them into S1/S2/S3 in the discussion below. Originally, we planned
80to represent this with a classical form like (syntax details are not important
81and probably slightly incorrect below):
82
83```mlir
84  mlfunc @simple_example(... %N) {
85    %tmp = call @S1(%X, %i, %j)
86      domain: (0 <= %i < %N), (0 <= %j < %N)
87      schedule: (i, j, 0)
88
89    call @S2(%tmp, %A, %i, %j)
90      domain: (0 <= %i < %N), (0 <= %j < %N)
91      schedule: (i, j, 1)
92
93    call @S3(%tmp, %B, %i, %j)
94      domain: (0 <= %i < %N), (0 <= %j < %N)
95      schedule: (i, j, 2)
96  }
97```
98
99In this design, an mlfunc is an unordered bag of instructions whose execution
100order is fully controlled by their schedule.
101
102However, we recently agreed that a more explicit schedule tree representation is
103a better fit for our needs, because it exposes important structure that will
104make analyses and optimizations more efficient, and also makes the scoping of
105SSA values more explicit. This leads us to a representation along the lines of:
106
107```mlir
108  mlfunc @simple_example(... %N) {
109    d0/d1 = mlspace
110    for S1(d0), S2(d0), S3(d0) {
111      for S1(d1), S2(d1), S3(d1) {
112
113        %tmp = call @S1(%X, d0, d1)      ;; S1
114          domain: (0 <= d0 < %N), (0 <= d1 < %N)
115
116        call @S2(%tmp, %A, d0, d1)      ;; S2
117          domain: (0 <= d0 < %N), (0 <= d1 < %N)
118
119        call @S3(%tmp, %B, d0, d1)      ;; S3
120          domain: (0 <= d0 < %N), (0 <= d1 < %N)
121      }
122    }
123  }
124```
125
126This change makes the nesting structure of the loops an explicit part of the
127representation, and makes lexical ordering within a loop significant
128(eliminating the constant 0/1/2 of schedules).
129
130It isn't obvious in the example above, but the representation allows for some
131interesting features, including the ability for instructions within a loop nest
132to have non-equal domains, like this - the second instruction ignores the outer
13310 points inside the loop:
134
135```mlir
136  mlfunc @reduced_domain_example(... %N) {
137    d0/d1 = mlspace
138    for S1(d0), S2(d0) {
139      for S1(d1), S2(d1) {
140        %tmp = call @S1(%X, d0, d1)    ;; S1
141          domain: (0 <= d0 < %N), (0 <= d1 < %N)
142
143        call @S2(%tmp, %A, d0, d1)      ;; S2
144          domain: (10 <= d0 < %N-10), (10 <= d1 < %N-10)
145      }
146    }
147  }
148```
149
150It also allows schedule remapping within the instruction, like this example that
151introduces a diagonal skew through a simple change to the schedules of the two
152instructions:
153
154```mlir
155  mlfunc @skewed_domain_example(... %N) {
156    d0/d1 = mlspace
157    for S1(d0), S2(d0+d1) {
158      for S1(d0+d1), S2(d1) {
159        %tmp = call @S1(%X, d0, d1)    ;; S1
160          domain: (0 <= d0 < %N), (0 <= d1 < %N)
161
162        call @S2(%tmp, %A, d0, d1)      ;; S2
163          domain: (0 <= d0 < %N), (0 <= d1 < %N)
164      }
165    }
166  }
167```
168
169This form has great power, and the polyhedral code generator (which lowers from
170an mlfunc to a cfgfunc representation) handles this power so things that
171introduce loop transformations don't have to explicitly manipulate the looping
172structure.
173
174## Proposal: Simplified Polyhedral Form
175
176This document proposes and explores the idea of going one step further, moving
177all of the domain and schedule information into the "schedule tree". In this
178form, we would have a representation where all instructions inside of a given
179for-loop are known to have the same domain, which is maintained by the loop. In
180the simplified form, we also have an "if" instruction that takes an affine
181condition.
182
183Our simple example above would be represented as:
184
185```mlir
186  mlfunc @simple_example(... %N) {
187    affine.for %i = 0 ... %N step 1 {
188      affine.for %j = 0 ... %N step 1 {
189        // identity noop in this case, but can exist in general.
190        %0,%1 = affine.apply #57(%i, %j)
191
192        %tmp = call @S1(%X, %0, %1)
193
194        call @S2(%tmp, %A, %0, %1)
195
196        call @S3(%tmp, %B, %0, %1)
197      }
198    }
199  }
200```
201
202The example with the reduced domain would be represented with an if instruction:
203
204```mlir
205  mlfunc @reduced_domain_example(... %N) {
206    affine.for %i = 0 ... %N step 1 {
207      affine.for %j = 0 ... %N step 1 {
208        // identity noop in this case, but can exist in general.
209        %0,%1 = affinecall #57(%i, %j)
210
211        %tmp = call @S1(%X, %0, %1)
212
213        if (10 <= %i < %N-10), (10 <= %j < %N-10) {
214
215          %2,%3 = affine.apply(%i, %j)    // identity noop in this case
216
217          call @S2(%tmp, %A, %2, %3)
218        }
219      }
220    }
221  }
222```
223
224These IRs represent exactly the same information, and use a similar information
225density. The 'traditional' form introduces an extra level of abstraction
226(schedules and domains) that make it easy to transform instructions at the
227expense of making it difficult to reason about how those instructions will come
228out after code generation. With the simplified form, transformations have to do
229parts of code generation inline with their transformation: instead of simply
230changing a schedule to **(i+j, j)** to get skewing, you'd have to generate this
231code explicitly (potentially implemented by making polyhedral codegen a library
232that transformations call into):
233
234```mlir
235mlfunc @skewed_domain_example(... %N) {
236  affine.for %t1 = 0 ... 2*N-2 step 1 {
237    affine.for %t2 = max(0, t1-N+1) ... min(N, t1) step 1 {
238      (%i, %j) = (%t1-%t2, %t2)
239      ...
240    }
241  }
242}
243```
244
245## Evaluation
246
247Both of these forms are capable of expressing the same class of computation:
248multidimensional loop nests with affine loop bounds and affine memory
249references. That said, they pose very different tradeoffs in other ways.
250
251### Commonality: can express same computation
252
253Both of these can express the same sorts of computation, e.g. kernels written in
254one form are representable in the other form in all cases.
255
256### Commonality: dependence analysis
257
258These representations both use affine functions for data layout mapping and
259access subscripts, and dependence analysis works the same way.
260
261### Commonality: difficulty of determining optimal transformation series
262
263One major challenge in performance of optimization of this sort of code is
264choosing the ordering and behavior of various loop transformations that get
265applied. There are non-local effects of every decision, and neither
266representation helps solve this inherently hard problem.
267
268### Commonality: compactness of IR
269
270In the cases that are most relevant to us (hyper rectangular spaces) these forms
271are directly equivalent: a traditional instruction with a limited domain (e.g.
272the "reduced_domain_example" above) ends up having one level of ML 'if' inside
273its loops. The simplified form pays for this by eliminating schedules and
274domains from the IR. Both forms allow code duplication to reduce dynamic
275branches in the IR: the traditional approach allows instruction splitting, the
276simplified form supports instruction duplication.
277
278It is important to point out that the traditional form wins on compactness in
279the extreme cases: e.g. the loop skewing case. These cases will be rare in
280practice for our workloads, and are exactly the cases that downstream
281transformations want to be explicit about what they are doing.
282
283### Simplicity of code generation
284
285A key final stage of an mlfunc is its conversion to a CFG function, which is
286required as part of lowering to the target machine. The simplified form has a
287clear advantage here: the IR has a direct correspondence to the structure of the
288generated code.
289
290In contrast, the traditional form has significant complexity in the lowering
291process to a CFG function, because the verbosity not imbued in the IR needs to
292come out during code generation. Code generation from ISL shows that it is
293possible to do this, but it is a non-trivial transformation.
294
295### Ease of transformation
296
297An advantage for the traditional form is that it is easier to perform certain
298transformations on it: skewing and tiling are just transformations on the
299schedule of the instructions in question, it doesn't require changing the loop
300structure.
301
302In practice, the simplified form requires moving the complexity of code
303generation into the transformations themselves - this is sometimes trivial,
304sometimes involved. The author believes that this should be possible by making
305the code generation algorithms themselves be library functions that
306transformations call into, instead of an opaque block that happens at the end of
307the mlfunc processing.
308
309Also, the sorts of transformations performed today by XLA (including tiling,
310padding, unrolling, and other rectangular transformations) should be easy enough
311to implement on either representation. The only cases that are a challenge are
312more advanced cases like skewing, e.g. for DMA data movement generation.
313
314### Ease of analysis: Cost models
315
316The simplified form is much easier for analyses and transformations to build
317cost models for (e.g. answering the question of "how much code bloat will be
318caused by unrolling a loop at this level?"), because it is easier to predict
319what target code will be generated. With the traditional form, these analyses
320will have to anticipate what polyhedral codegen will do to a set of instructions
321under consideration: something that is non-trivial in the interesting cases in
322question (see "Cost of code generation").
323
324### Cost of code generation
325
326State of the art polyhedral code generation is
327[expensive and complicated](https://lirias.kuleuven.be/bitstream/123456789/497238/1/toplas-astgen.pdf),
328sometimes exponential time complexity. We expect that most machine learning
329workloads will be hyper-rectangular, and thus it should be easy to specialize in
330important cases. That said, the traditional polyhedral representation makes it
331very easy to introduce complicated and expensive schedules, and provides no way
332to understand and project a cost model for using them. All downstream clients of
333the IR need to be prepared to handle the full generality of IR that may come to
334them.
335
336The simplified form defines this away: the concepts in the IR remain simple, and
337the code much more directly reflects the cost model for lowering to CFG
338functions and machine code. This is expected to be very important in the late
339stages of a code generator for an accelerator.
340
341### SSA in ML Functions
342
343We agree already that values defined in an mlfunc can include scalar values and
344they are defined based on traditional dominance. In the simplified form, this is
345very simple: arguments and induction variables defined in for-loops are live
346inside their lexical body, and linear series of instructions have the same "top
347down" dominance relation that a basic block does.
348
349In the traditional form though, this is not the case: it seems that a lot of
350knowledge about how codegen will emit the code is necessary to determine if SSA
351form is correct or not. For example, this is invalid code:
352
353```mlir
354  %tmp = call @S1(%X, %0, %1)
355    domain: (10 <= %i < %N), (0 <= %j < %N)
356    schedule: (i, j)
357
358  call @S2(%tmp, %A, %0, %1)
359    domain: (0 <= %i < %N), (0 <= %j < %N)
360    schedule: (i, j)
361```
362
363Because `%tmp` isn't defined on some iterations of the %i loop.
364
365This matters because it makes the verifier more complicated, but more
366significantly, it means that load promotion and other optimizations that will
367produce SSA form will need to be aware of this and be able to model what codegen
368does.
369
370An emergent property of this that we discussed recently is that PHI nodes in
371mlfunc's (if we support them) will also have to have domains.
372
373### Lack of redundancy in IR
374
375The traditional form has multiple encodings for the same sorts of behavior: you
376end up having bits on `affine.for` loops to specify whether codegen should use
377"atomic/separate" policies, unroll loops, etc. Instructions can be split or can
378generate multiple copies of their instruction because of overlapping domains,
379etc.
380
381This is a problem for analyses and cost models, because they each have to reason
382about these additional forms in the IR.
383
384### Suitability to purpose: lowering to machine code
385
386One of the main drivers for this work is lowering to low-level accelerator code,
387including two-dimensional vectorization, insertion of DMAs, and other
388utilization of the matrix accelerator units. In the author's opinion, the extra
389compactness of the traditional form is a negative for this purpose: reasoning
390about the generated machine code will require understanding the mapping from
391mlfunc to lowered code, which means that it must understand what code generation
392will do.
393
394In the simplified form, the effect of "code generation" is always obvious from
395the IR itself, which should make it easier to perform vectorization to target
396instructions and other analyses we need to perform.
397
398## Third Alternative: two different levels of mlfunc
399
400One hybrid alternative is to support both the traditional and simplified forms
401of mlfunc in our IR.
402
403The stages could look like this, for example:
404
4051.  Early performance transformations could be done on the traditional form.
4061.  Partial code generation lowers to the simplified form
4071.  Target specific lowering phases for tiling, and vectorization and other 2D
408    transforms that don't benefit much from the traditional form could be run.
4091.  Final codegen to a cfg func can be done when all of the instructions are
410    replaced with ones valid on the target.
411
412While this is possible, it isn't clear what would justify the complexity of this
413approach. Unless there is a super compelling reason for this, it would be nice
414to not do this. **Update:** we discussed this as a design team and agreed that
415this wouldn't be a good way to go.
416