1# MLIR: The case for a simplified polyhedral form 2 3MLIR embraces polyhedral compiler techniques for their many advantages 4representing and transforming dense numerical kernels, but it uses a form that 5differs significantly from other polyhedral frameworks. 6 7**Disclaimer / Warning** 8 9This document is a very early design proposal (which has since been accepted) 10that explored the tradeoffs of using this simplified form vs the traditional 11polyhedral schedule list form. At some point, this document could be dusted off 12and written as a proper academic paper, but until now, it is better to included 13it in this crafty form than not to. Beware that this document uses archaic 14syntax and should not be considered a canonical reference to modern MLIR. 15 16## Introduction 17 18This document discusses general goals of the project, introduces context and the 19two alternatives, then talks about the tradeoffs of these designs. Written by 20Chris Lattner. 21 22## General goals of an IR, and goals of mlfunc's specifically 23 24Our currently planned representation for MLIR consists of two kinds of 25functions: an LLVM-like "CFG Function" and an "ML Function": a function 26represented in multidimensional loop form. The idea is that a CFG function is 27capable of full generality for expressing arbitrary computation, but is awkward 28for loop transformations. In contrast, mlfunc's are limited (e.g. to control 29flow involving loop nests over affine spaces) but these limitations make it much 30easier to transform and analyze, particularly for the set of computations in a 31machine learning kernel. 32 33The design of an intermediate representations is an optimization problem, which 34makes intentional tradeoffs that aim to make certain kinds of compiler 35transformations simple. After all, it is "possible" to do almost any 36transformation on any IR: we could theoretically do loop transformations on 37assembly language. OTOH, such transformations would take too long to write, 38would be fragile due to irrelevant changes, would be difficult to maintain, and 39difficult to make target independent. Performing transformations on the "right 40level" of IR makes it much easier to do analysis and transformation of code, and 41can make them faster by reducing the size of the IR, and eliminating 42possibilities that would have otherwise have to be considered. 43 44This is the reason we're interested in adding polyhedral techniques to an IR in 45the first place: though our base "CFG function" representation is fully capable 46of expressing any computation, it is "too" expressive. The limitations imposed 47by polyhedral techniques (e.g. on affine loop bounds and array subscripts) 48define a closed algebra that can represent an interesting range of 49transformations and their compositions, and because of their simplicity, we can 50perform (e.g.) dependence analysis more efficiently and more reliably. 51 52This raises an important question that this document examines: given we are 53introducing a redundant and limited way to express code and transformations, 54exactly what form is best to perform the analyses and transformations we want? 55 56We explore two different design points that are capable of expressing the same 57class of affine loop computations, but which use different representational 58forms. These forms trade off verbosity, ease of transformation, and ease of 59analysis in interesting ways. 60 61## Context: Traditional Polyhedral Form 62 63We started by discussing a representation that uses the traditional polyhedral 64schedule set + domain representation, e.g. consider C-like code like: 65 66```c 67 void simple_example(...) { 68 for (int i = 0; i < N; ++i) { 69 for (int j = 0; j < N; ++j) { 70 float tmp = X[i,j] // S1 71 A[i,j] = tmp + 1 // S2 72 B[i,j] = tmp * 42 // S3 73 } 74 } 75 } 76``` 77 78The polyhedral representation doesn't care about the actual computation, so we 79will abstract them into S1/S2/S3 in the discussion below. Originally, we planned 80to represent this with a classical form like (syntax details are not important 81and probably slightly incorrect below): 82 83```mlir 84 mlfunc @simple_example(... %N) { 85 %tmp = call @S1(%X, %i, %j) 86 domain: (0 <= %i < %N), (0 <= %j < %N) 87 schedule: (i, j, 0) 88 89 call @S2(%tmp, %A, %i, %j) 90 domain: (0 <= %i < %N), (0 <= %j < %N) 91 schedule: (i, j, 1) 92 93 call @S3(%tmp, %B, %i, %j) 94 domain: (0 <= %i < %N), (0 <= %j < %N) 95 schedule: (i, j, 2) 96 } 97``` 98 99In this design, an mlfunc is an unordered bag of instructions whose execution 100order is fully controlled by their schedule. 101 102However, we recently agreed that a more explicit schedule tree representation is 103a better fit for our needs, because it exposes important structure that will 104make analyses and optimizations more efficient, and also makes the scoping of 105SSA values more explicit. This leads us to a representation along the lines of: 106 107```mlir 108 mlfunc @simple_example(... %N) { 109 d0/d1 = mlspace 110 for S1(d0), S2(d0), S3(d0) { 111 for S1(d1), S2(d1), S3(d1) { 112 113 %tmp = call @S1(%X, d0, d1) ;; S1 114 domain: (0 <= d0 < %N), (0 <= d1 < %N) 115 116 call @S2(%tmp, %A, d0, d1) ;; S2 117 domain: (0 <= d0 < %N), (0 <= d1 < %N) 118 119 call @S3(%tmp, %B, d0, d1) ;; S3 120 domain: (0 <= d0 < %N), (0 <= d1 < %N) 121 } 122 } 123 } 124``` 125 126This change makes the nesting structure of the loops an explicit part of the 127representation, and makes lexical ordering within a loop significant 128(eliminating the constant 0/1/2 of schedules). 129 130It isn't obvious in the example above, but the representation allows for some 131interesting features, including the ability for instructions within a loop nest 132to have non-equal domains, like this - the second instruction ignores the outer 13310 points inside the loop: 134 135```mlir 136 mlfunc @reduced_domain_example(... %N) { 137 d0/d1 = mlspace 138 for S1(d0), S2(d0) { 139 for S1(d1), S2(d1) { 140 %tmp = call @S1(%X, d0, d1) ;; S1 141 domain: (0 <= d0 < %N), (0 <= d1 < %N) 142 143 call @S2(%tmp, %A, d0, d1) ;; S2 144 domain: (10 <= d0 < %N-10), (10 <= d1 < %N-10) 145 } 146 } 147 } 148``` 149 150It also allows schedule remapping within the instruction, like this example that 151introduces a diagonal skew through a simple change to the schedules of the two 152instructions: 153 154```mlir 155 mlfunc @skewed_domain_example(... %N) { 156 d0/d1 = mlspace 157 for S1(d0), S2(d0+d1) { 158 for S1(d0+d1), S2(d1) { 159 %tmp = call @S1(%X, d0, d1) ;; S1 160 domain: (0 <= d0 < %N), (0 <= d1 < %N) 161 162 call @S2(%tmp, %A, d0, d1) ;; S2 163 domain: (0 <= d0 < %N), (0 <= d1 < %N) 164 } 165 } 166 } 167``` 168 169This form has great power, and the polyhedral code generator (which lowers from 170an mlfunc to a cfgfunc representation) handles this power so things that 171introduce loop transformations don't have to explicitly manipulate the looping 172structure. 173 174## Proposal: Simplified Polyhedral Form 175 176This document proposes and explores the idea of going one step further, moving 177all of the domain and schedule information into the "schedule tree". In this 178form, we would have a representation where all instructions inside of a given 179for-loop are known to have the same domain, which is maintained by the loop. In 180the simplified form, we also have an "if" instruction that takes an affine 181condition. 182 183Our simple example above would be represented as: 184 185```mlir 186 mlfunc @simple_example(... %N) { 187 affine.for %i = 0 ... %N step 1 { 188 affine.for %j = 0 ... %N step 1 { 189 // identity noop in this case, but can exist in general. 190 %0,%1 = affine.apply #57(%i, %j) 191 192 %tmp = call @S1(%X, %0, %1) 193 194 call @S2(%tmp, %A, %0, %1) 195 196 call @S3(%tmp, %B, %0, %1) 197 } 198 } 199 } 200``` 201 202The example with the reduced domain would be represented with an if instruction: 203 204```mlir 205 mlfunc @reduced_domain_example(... %N) { 206 affine.for %i = 0 ... %N step 1 { 207 affine.for %j = 0 ... %N step 1 { 208 // identity noop in this case, but can exist in general. 209 %0,%1 = affinecall #57(%i, %j) 210 211 %tmp = call @S1(%X, %0, %1) 212 213 if (10 <= %i < %N-10), (10 <= %j < %N-10) { 214 215 %2,%3 = affine.apply(%i, %j) // identity noop in this case 216 217 call @S2(%tmp, %A, %2, %3) 218 } 219 } 220 } 221 } 222``` 223 224These IRs represent exactly the same information, and use a similar information 225density. The 'traditional' form introduces an extra level of abstraction 226(schedules and domains) that make it easy to transform instructions at the 227expense of making it difficult to reason about how those instructions will come 228out after code generation. With the simplified form, transformations have to do 229parts of code generation inline with their transformation: instead of simply 230changing a schedule to **(i+j, j)** to get skewing, you'd have to generate this 231code explicitly (potentially implemented by making polyhedral codegen a library 232that transformations call into): 233 234```mlir 235mlfunc @skewed_domain_example(... %N) { 236 affine.for %t1 = 0 ... 2*N-2 step 1 { 237 affine.for %t2 = max(0, t1-N+1) ... min(N, t1) step 1 { 238 (%i, %j) = (%t1-%t2, %t2) 239 ... 240 } 241 } 242} 243``` 244 245## Evaluation 246 247Both of these forms are capable of expressing the same class of computation: 248multidimensional loop nests with affine loop bounds and affine memory 249references. That said, they pose very different tradeoffs in other ways. 250 251### Commonality: can express same computation 252 253Both of these can express the same sorts of computation, e.g. kernels written in 254one form are representable in the other form in all cases. 255 256### Commonality: dependence analysis 257 258These representations both use affine functions for data layout mapping and 259access subscripts, and dependence analysis works the same way. 260 261### Commonality: difficulty of determining optimal transformation series 262 263One major challenge in performance of optimization of this sort of code is 264choosing the ordering and behavior of various loop transformations that get 265applied. There are non-local effects of every decision, and neither 266representation helps solve this inherently hard problem. 267 268### Commonality: compactness of IR 269 270In the cases that are most relevant to us (hyper rectangular spaces) these forms 271are directly equivalent: a traditional instruction with a limited domain (e.g. 272the "reduced_domain_example" above) ends up having one level of ML 'if' inside 273its loops. The simplified form pays for this by eliminating schedules and 274domains from the IR. Both forms allow code duplication to reduce dynamic 275branches in the IR: the traditional approach allows instruction splitting, the 276simplified form supports instruction duplication. 277 278It is important to point out that the traditional form wins on compactness in 279the extreme cases: e.g. the loop skewing case. These cases will be rare in 280practice for our workloads, and are exactly the cases that downstream 281transformations want to be explicit about what they are doing. 282 283### Simplicity of code generation 284 285A key final stage of an mlfunc is its conversion to a CFG function, which is 286required as part of lowering to the target machine. The simplified form has a 287clear advantage here: the IR has a direct correspondence to the structure of the 288generated code. 289 290In contrast, the traditional form has significant complexity in the lowering 291process to a CFG function, because the verbosity not imbued in the IR needs to 292come out during code generation. Code generation from ISL shows that it is 293possible to do this, but it is a non-trivial transformation. 294 295### Ease of transformation 296 297An advantage for the traditional form is that it is easier to perform certain 298transformations on it: skewing and tiling are just transformations on the 299schedule of the instructions in question, it doesn't require changing the loop 300structure. 301 302In practice, the simplified form requires moving the complexity of code 303generation into the transformations themselves - this is sometimes trivial, 304sometimes involved. The author believes that this should be possible by making 305the code generation algorithms themselves be library functions that 306transformations call into, instead of an opaque block that happens at the end of 307the mlfunc processing. 308 309Also, the sorts of transformations performed today by XLA (including tiling, 310padding, unrolling, and other rectangular transformations) should be easy enough 311to implement on either representation. The only cases that are a challenge are 312more advanced cases like skewing, e.g. for DMA data movement generation. 313 314### Ease of analysis: Cost models 315 316The simplified form is much easier for analyses and transformations to build 317cost models for (e.g. answering the question of "how much code bloat will be 318caused by unrolling a loop at this level?"), because it is easier to predict 319what target code will be generated. With the traditional form, these analyses 320will have to anticipate what polyhedral codegen will do to a set of instructions 321under consideration: something that is non-trivial in the interesting cases in 322question (see "Cost of code generation"). 323 324### Cost of code generation 325 326State of the art polyhedral code generation is 327[expensive and complicated](https://lirias.kuleuven.be/bitstream/123456789/497238/1/toplas-astgen.pdf), 328sometimes exponential time complexity. We expect that most machine learning 329workloads will be hyper-rectangular, and thus it should be easy to specialize in 330important cases. That said, the traditional polyhedral representation makes it 331very easy to introduce complicated and expensive schedules, and provides no way 332to understand and project a cost model for using them. All downstream clients of 333the IR need to be prepared to handle the full generality of IR that may come to 334them. 335 336The simplified form defines this away: the concepts in the IR remain simple, and 337the code much more directly reflects the cost model for lowering to CFG 338functions and machine code. This is expected to be very important in the late 339stages of a code generator for an accelerator. 340 341### SSA in ML Functions 342 343We agree already that values defined in an mlfunc can include scalar values and 344they are defined based on traditional dominance. In the simplified form, this is 345very simple: arguments and induction variables defined in for-loops are live 346inside their lexical body, and linear series of instructions have the same "top 347down" dominance relation that a basic block does. 348 349In the traditional form though, this is not the case: it seems that a lot of 350knowledge about how codegen will emit the code is necessary to determine if SSA 351form is correct or not. For example, this is invalid code: 352 353```mlir 354 %tmp = call @S1(%X, %0, %1) 355 domain: (10 <= %i < %N), (0 <= %j < %N) 356 schedule: (i, j) 357 358 call @S2(%tmp, %A, %0, %1) 359 domain: (0 <= %i < %N), (0 <= %j < %N) 360 schedule: (i, j) 361``` 362 363Because `%tmp` isn't defined on some iterations of the %i loop. 364 365This matters because it makes the verifier more complicated, but more 366significantly, it means that load promotion and other optimizations that will 367produce SSA form will need to be aware of this and be able to model what codegen 368does. 369 370An emergent property of this that we discussed recently is that PHI nodes in 371mlfunc's (if we support them) will also have to have domains. 372 373### Lack of redundancy in IR 374 375The traditional form has multiple encodings for the same sorts of behavior: you 376end up having bits on `affine.for` loops to specify whether codegen should use 377"atomic/separate" policies, unroll loops, etc. Instructions can be split or can 378generate multiple copies of their instruction because of overlapping domains, 379etc. 380 381This is a problem for analyses and cost models, because they each have to reason 382about these additional forms in the IR. 383 384### Suitability to purpose: lowering to machine code 385 386One of the main drivers for this work is lowering to low-level accelerator code, 387including two-dimensional vectorization, insertion of DMAs, and other 388utilization of the matrix accelerator units. In the author's opinion, the extra 389compactness of the traditional form is a negative for this purpose: reasoning 390about the generated machine code will require understanding the mapping from 391mlfunc to lowered code, which means that it must understand what code generation 392will do. 393 394In the simplified form, the effect of "code generation" is always obvious from 395the IR itself, which should make it easier to perform vectorization to target 396instructions and other analyses we need to perform. 397 398## Third Alternative: two different levels of mlfunc 399 400One hybrid alternative is to support both the traditional and simplified forms 401of mlfunc in our IR. 402 403The stages could look like this, for example: 404 4051. Early performance transformations could be done on the traditional form. 4061. Partial code generation lowers to the simplified form 4071. Target specific lowering phases for tiling, and vectorization and other 2D 408 transforms that don't benefit much from the traditional form could be run. 4091. Final codegen to a cfg func can be done when all of the instructions are 410 replaced with ones valid on the target. 411 412While this is possible, it isn't clear what would justify the complexity of this 413approach. Unless there is a super compelling reason for this, it would be nice 414to not do this. **Update:** we discussed this as a design team and agreed that 415this wouldn't be a good way to go. 416