1# MLIR Rationale 2 3This document is intended to capture some of the alternatives considered and 4open debates in the design of MLIR, along with the rationale for certain 5decisions we made. This is not intended to be a "finely groomed" document - we 6prefer the ability to dump in interesting tidbits without worrying too much 7about their consistency or readability. 8 9[TOC] 10 11## Abstract 12 13MLIR is a compiler intermediate representation with similarities to traditional 14three-address SSA representations (like 15[LLVM IR](http://llvm.org/docs/LangRef.html) or 16[SIL](https://github.com/apple/swift/blob/master/docs/SIL.rst)), but which 17introduces notions from the polyhedral loop optimization works as first class 18concepts. This hybrid design is optimized to represent, analyze, and transform 19high level dataflow graphs as well as target-specific code generated for high 20performance data parallel systems. Beyond its representational capabilities, its 21single continuous design provides a framework to lower from dataflow graphs to 22high performance target specific code. 23 24MLIR stands for one of "Multi-Level IR" or "Multi-dimensional Loop IR" or 25"Machine Learning IR" or "Mid Level IR", we prefer the first. This document only 26provides the rationale behind MLIR -- its actual 27[specification document](../LangRef.md) and other content is hosted elsewhere. 28 29## Introduction and Motivation 30 31The Multi-Level Intermediate Representation (MLIR) is intended for easy 32expression and optimization of computations involving deep loop nests and dense 33matrices of high dimensionality. It is thus well-suited to deep learning 34computations in particular. Yet it is general enough to also represent arbitrary 35sequential computation. The representation allows high-level optimization and 36parallelization for a wide range of parallel architectures including those with 37deep memory hierarchies --- general-purpose multicores, GPUs, and specialized 38neural network accelerators. 39 40MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs 41while combining them with ideas from the polyhedral abstraction to represent 42loop nests, multidimensional data (tensors), and transformations on these 43entities as first class concepts in the IR. 44 45MLIR is a multi-level IR, i.e., it represents code at a domain-specific 46representation such as HLO or TensorFlow graphs, all the way down to the machine 47level. MLIR is able to represent arbitrary control flow and arbitrary data 48accesses, and is general enough to represent nearly all sequential computation. 49This is a key distinction from existing polyhedral representation 50implementations (such as LLVM [Polly](https://polly.llvm.org/)) that are able to 51use the polyhedral abstraction in a way isolated from the LLVM IR and only for 52affine loop nests, i.e., portions of the code where array accesses, loop bounds, 53and conditionals are regular (involve linear functions of loop iterators and 54constant symbols). The presence of statically unpredictable data accesses or 55control flow does not preclude representation in MLIR, but only limits to a 56certain extent the ability to reason about and apply transformations using the 57polyhedral abstraction. 58 59Maps, sets, and relations with affine constraints are the core structures 60underlying a polyhedral representation of high-dimensional loop nests and 61multidimensional arrays. These structures are represented as textual 62expressions in a form close to their mathematical form. These structures are 63used to capture loop nests, tensor data structures, and how they are reordered 64and mapped for a target architecture. All structured or "conforming" loops are 65captured as part of the polyhedral information, and so are tensor variables, 66their layouts, and subscripted accesses to these tensors in memory. 67 68The information captured in the IR allows a compact expression of all loop 69transformations, data remappings, explicit copying necessary for explicitly 70addressed memory in accelerators, mapping to pre-tuned expert-written 71primitives, and mapping to specialized vector instructions. Loop transformations 72that can be easily implemented include the body of affine transformations: these 73subsume all traditional loop transformations (unimodular and non-unimodular) 74such as loop tiling, interchange, permutation, skewing, scaling, relative 75shifting, reversal, fusion, and distribution/fission. Transformations on data 76layout such as padding and transforming to blocked layouts are also represented 77well via affine layout maps. 78 79MLIR's design allows a progressive lowering to target-specific forms. Besides 80high-level transformations for loop nests and data layouts that a typical 81mid-level optimizer is expected to deal with, MLIR is also designed to perform 82certain low-level scheduling and mapping decisions that a typical backend IR is 83entrusted with: these include mapping to specialized vector instructions, 84auto-vectorization, and software pipelining. The need to support these 85transformations stems from the fact that neural network accelerators have 86specialized units that deal with large chunks of data whose computation maps 87back to chunks of more than one loop of the loop nests as viewed by a program at 88a level closer to the original specification. Such specialized units or 89instructions operate on multidimensional data chunks from a programmer's 90viewpoint. It thus makes it hard or infeasible for a backend operating on a very 91low-level IR close to assembly to lift and reconstruct loops and perform such a 92mapping. This is in contrast to classic instruction selection and scheduling in 93today's compilers that primarily only deals with the body of the innermost loop. 94MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor 95libraries operating on data at higher levels (or at the highest level) of the 96memory hierarchy. 97 98In summary, MLIR is convenient for and closed under the kind of transformations 99needed to lower to general-purpose as well as specialized accelerators. It also 100allows one to build modular and reusable target independent and target dependent 101passes. 102 103## Design Decisions 104 105This section sheds light on some of the design decisions -- some of these are 106indirectly implied by the specification document. 107 108### Loads and stores 109 110The 'load' and 'store' instructions are specifically crafted to fully resolve to 111an element of a memref. These instructions take as arguments n+1 indices for an 112n-ranked tensor. This disallows the equivalent of pointer arithmetic or the 113ability to index into the same memref in other ways (something which C arrays 114allow for example). Furthermore, for the affine constructs, the compiler can 115follow use-def chains (e.g. through 116[affine.apply operations](../Dialects/Affine.md/#affineapply-affineapplyop)) or through 117the map attributes of [affine operations](../Dialects/Affine.md/#operations)) to 118precisely analyze references at compile-time using polyhedral techniques. This 119is possible because of the [restrictions on dimensions and symbols](../Dialects/Affine.md/#restrictions-on-dimensions-and-symbols). 120 121A scalar of element-type (a primitive type or a vector type) that is stored in 122memory is modeled as a 0-d memref. This is also necessary for scalars that are 123live out of for loops and if conditionals in a function, for which we don't yet 124have an SSA representation -- 125[an extension](#affineif-and-affinefor-extensions-for-escaping-scalars) to allow that is 126described later in this doc. 127 128### Symbols and types 129 130The current MLIR disallows use of symbols in types. For example, when a tensor 131or memref dimension is statically unknown, it is denoted in the type as '?'. An 132SSA symbol is then bound to it when a memref is created. The actual value of the 133unknown dimension can be queried using the "dim" builtin as shown below. 134 135Example: 136 137```mlir 138func foo(...) { 139 %A = alloc <8x?xf32, #lmap> (%N) 140 ... 141 call bar(%A) : (memref<8x?xf32, #lmap>) 142} 143 144func bar(%A : memref<8x?xf32, #lmap>) { 145 // Type of %A indicates that %A has dynamic shape with 8 rows 146 // and unknown number of columns. The number of columns is queried 147 // dynamically using dim instruction. 148 %N = dim %A, 1 : memref<8x?xf32, #lmap> 149 150 affine.for %i = 0 to 8 { 151 affine.for %j = 0 to %N { 152 // A[i,j] += 1 153 %s1 = affine.load %A[%i, %j] : memref<8x?xf32, #lmap> 154 %s2 = add %s1, 1 155 affine.store %s2, %A[%i, %j] : memref<8x?xf32, #lmap> 156 } 157 } 158 return 159} 160 161``` 162 163An alternative design is to embed the reference to symbols directly in the 164type - memref<8x%Nxf32>. We went for the current approach in MLIR because it 165simplifies the design --- types remain immutable when the values of symbols 166change. 167 168### Block Arguments vs PHI nodes 169 170MLIR Regions represent SSA using "[block arguments](../LangRef.md/#blocks)" rather 171than [PHI instructions](http://llvm.org/docs/LangRef.html#i-phi) used in LLVM. 172This choice is representationally identical (the same constructs can be 173represented in either form) but block arguments have several advantages: 174 1751. LLVM PHI nodes always have to be kept at the top of a block, and 176 transformations frequently have to manually skip over them. This is defined 177 away with BB arguments. 1781. LLVM has a separate function Argument node. This is defined away with BB 179 arguments, because the arguments to the entry block serve this purpose. 1801. Blocks of PHI nodes in LLVM execute atomically, which is surprising and 181 super confusing to compiler engineers and it is easy to introduce bugs with 182 this (very related to the 183 "[lost copy](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.5461&rep=rep1&type=pdf)" 184 problem in SSA lowering literature.) With the BB argument representation, 185 this confusion is defined away. 1861. The entry list of PHI nodes in LLVM are unordered, and some blocks have 187 thousands of predecessors (e.g. unwind blocks). This can cause long compile 188 time problems because transformations have to linearly scan this list. This 189 is defined away with BB argument representation. 1901. LLVM has no way to represent values that are available only in one successor 191 but not the other, e.g. its invoke instruction cannot produce the exception 192 value JUST on the exception edge. Instead, the 193 [landingpad instruction](http://llvm.org/docs/LangRef.html#landingpad-instruction) 194 is a hack used to represent this. MLIR doesn't make use of this capability, 195 but SIL uses it extensively, e.g. in the 196 [switch_enum instruction](https://github.com/apple/swift/blob/master/docs/SIL.rst#switch-enum). 197 198For more context, block arguments were previously used in the Swift 199[SIL Intermediate Representation](https://github.com/apple/swift/blob/master/docs/SIL.rst), 200and described in 201[a talk on YouTube](https://www.youtube.com/watch?v=Ntj8ab-5cvE). The section of 202interest 203[starts here](https://www.google.com/url?q=https://youtu.be/Ntj8ab-5cvE?t%3D596&sa=D&ust=1529450150971000&usg=AFQjCNFQHEWL7m8q3eO-1DiKw9zqC2v24Q). 204 205### Index type usage and limitations 206 207Index types are intended to be used for platform-specific "size" values and may 208appear in subscripts, sizes of aggregate types and affine expressions. They are 209also tightly coupled with `affine.apply` and affine.load/store operations; 210having `index` type is a necessary precondition of a value to be acceptable by 211these operations. 212 213We allow `index` types in tensors, vectors, and memrefs as a code generation 214strategy has to map `index` to an implementation type and hence needs to be able 215to materialize corresponding values. However, the target might lack support for 216`vector` values with the target specific equivalent of the `index` type. 217 218### Data layout of non-primitive types 219 220Data layout information such as the bit width or the alignment of types may be 221target and ABI-specific and thus should be configurable rather than imposed by 222the compiler. Especially, the layout of compound or `index` types may vary. MLIR 223specifies default bit widths for certain primitive _types_, in particular for 224integers and floats. It is equal to the number that appears in the type 225definition, e.g. the bit width of `i32` is `32`, so is the bit width of `f32`. 226The bit width is not _necessarily_ related to the amount of memory (in bytes) or 227the register size (in bits) that is necessary to store the value of the given 228type. For example, `vector<3xi57>` is likely to be lowered to a vector of four 22964-bit integers, so that its storage requirement is `4 x 64 / 8 = 32` bytes, 230rather than `(3 x 57) ceildiv 8 = 22` bytes as can be naively computed from the 231bit width. MLIR makes such [data layout information](../DataLayout.md) 232configurable using attributes that can be queried during lowering, for example, 233when allocating a compound type. 234 235The data layout of dialect-specific types is undefined at MLIR level. Yet 236dialects are free to define their own quantities and make them available via the 237data layout infrastructure. 238 239### Integer signedness semantics 240 241Integers in the builtin MLIR type system have a bitwidth (note that the `index` 242type has a symbolic width equal to the machine word size), and they *may* 243additionally have signedness semantics. The purpose is to satisfy the needs of 244different dialects, which can model different levels of abstractions. Certain 245abstraction, especially closer to source language, might want to differentiate 246signedness with integer types; while others, especially closer to machine 247instruction, might want signless integers. Instead of forcing each abstraction 248to adopt the same integer modelling or develop its own one in house, Integer 249type provides this as an option to help code reuse and consistency. 250 251For the standard dialect, the choice is to have signless integer types. An 252integer value does not have an intrinsic sign, and it's up to the specific op 253for interpretation. For example, ops like `addi` and `muli` do two's complement 254arithmetic, but some other operations get a sign, e.g. `divis` vs `diviu`. 255 256LLVM uses the [same design](http://llvm.org/docs/LangRef.html#integer-type), 257which was introduced in a revamp rolled out 258[in the LLVM 2.0 integer type](http://releases.llvm.org/2.0/docs/LangRef.html#t_derived). 259Prior to that, from 260[LLVM 1.0](http://releases.llvm.org/1.0/docs/LangRef.html#t_classifications) to 261[1.9](http://releases.llvm.org/1.9/docs/LangRef.html#t_classifications), LLVM 262uses signed types like "sbyte" and "ubyte". This shift was important and has 263served LLVM well over the years. The reason this is important is that it is a 264good thing for an intermediate representation to represent the same computation 265with the same instruction. Signed types got in the way, because (e.g.) an "add 266of an sbyte" does the same computation as an "add of a ubyte", but the type 267system made them look artificially different. This split also required casts 268like "cast from sbyte to ubyte" which do nothing at the machine level. Removing 269signs from the type system eliminated these problems, making the compiler 270simpler. 271 272More information about this split is available in an old 273[talk on youtube](https://www.youtube.com/watch?v=VeRaLPupGks) talking about 274LLVM 2.0. 275 276Note that this rationale only applies to the "standard ops" dialect in which we 277can express an opinion about its design. Other dialects generally try to model 278an external system, and should aim to reflect its design as closely as possible. 279 280### Splitting floating point vs integer operations 281 282The MLIR "standard" operation set splits many integer and floating point 283operations into different categories, for example `addf` vs `addi` and `cmpf` vs 284`cmpi` 285([following the design of LLVM](http://llvm.org/docs/LangRef.html#binary-operations)). 286These instructions _are_ polymorphic on the number of elements in the type 287though, for example `addf` is used with scalar floats, vectors of floats, and 288tensors of floats (LLVM does the same thing with its scalar/vector types). 289 290This split is important because floating point and integer operations are quite 291different in practice: for example, floating point values include NaN's, so 292[integer comparisons](http://llvm.org/docs/LangRef.html#icmp-instruction) and 293[floating point comparisons](http://llvm.org/docs/LangRef.html#fcmp-instruction) 294should use different comparison opcodes. On the arithmetic side of things, 295floating point operations support rounding modes, floating point contractions, 296["fast math"](http://llvm.org/docs/LangRef.html#fadd-instruction), and integers 297may want to have two's complement overflow behavior or be undefined on 298[various forms of wrapping](http://llvm.org/docs/LangRef.html#add-instruction) 299for performance. 300 301We are a long way from this sort of thing being a priority to care about in 302MLIR, but since we have experience and know the right way to do this, we'd 303rather design it in from the beginning. 304 305Note that this rationale only applies to the "standard ops" dialect in which we 306can express an opinion about its design. Other dialects generally try to model 307an external system, and should aim to reflect its design as closely as possible. 308 309### Specifying sign in integer comparison operations 310 311Since integers are [signless](#integer-signedness-semantics), it is necessary to define the 312sign for integer comparison operations. This sign indicates how to treat the 313foremost bit of the integer: as sign bit or as most significant bit. For 314example, comparing two `i4` values `0b1000` and `0b0010` yields different 315results for unsigned (`8 > 3`) and signed (`-8 < 3`) interpretations. This 316difference is only significant for _order_ comparisons, but not for _equality_ 317comparisons. Indeed, for the latter all bits must have the same value 318independently of the sign. Since both arguments have exactly the same bit width 319and cannot be padded by this operation, it is impossible to compare two values 320whose bit representations would differ while the values are interpreted as 321equal. 322 323### Specifying comparison kind as attribute 324 325Unlike arithmetic, comparison operators share several common properties, e.g. 326they cannot be considered associative. In practice, comparisons are sometimes 327implemented by the same instruction or its variants so it makes sense to group 328them together at the IR level. 329 330An alternative would be introducing ten distinct operators for all currently 331supported kinds of integer comparisons. These operators would have increased the 332number of "reserved" names used by standard operations as well as the size of 333the C++ API while their implementations would have been mostly identical. 334 335The comparison kind is internally an integer attribute. However, for the sake of 336readability by humans, custom assembly form accepts string literals that are 337mapped to the underlying integer values: `cmpi "eq", %lhs, %rhs` better implies 338integer equality comparison than `cmpi 0, %lhs, %rhs` where it is unclear what 339gets compared to what else. This syntactic sugar is possible thanks to parser 340logic redefinitions for custom assembly form of non-builtin operations. 341Supporting it in the full notation would have required changing how the main 342parsing algorithm works and may have unexpected repercussions. While it had been 343possible to store the predicate as string attribute, it would have rendered 344impossible to implement switching logic based on the comparison kind and made 345attribute validity checks (one out of ten possible kinds) more complex. 346 347### 'select' operation to implement min/max 348 349Although `min` and `max` operations are likely to occur as a result of 350transforming affine loops in ML functions, we did not make them first-class 351operations. Instead, we provide the `select` operation that can be combined with 352`cmpi` to implement the minimum and maximum computation. Although they now 353require two operations, they are likely to be emitted automatically during the 354transformation inside MLIR. On the other hand, there are multiple benefits of 355introducing `select`: standalone min/max would concern themselves with the 356signedness of the comparison, already taken into account by `cmpi`; `select` can 357support floats transparently if used after a float-comparison operation; the 358lower-level targets provide `select`-like instructions making the translation 359trivial. 360 361This operation could have been implemented with additional control flow: `%r = 362select %cond, %t, %f` is equivalent to 363 364```mlir 365^bb0: 366 cond_br %cond, ^bb1(%t), ^bb1(%f) 367^bb1(%r): 368``` 369 370However, this control flow granularity is not available in the ML functions 371where min/max, and thus `select`, are likely to appear. In addition, simpler 372control flow may be beneficial for optimization in general. 373 374### Regions 375 376#### Attributes of type 'Block' 377 378We considered representing regions through `ArrayAttr`s containing a list of a 379special type `IRBlockAttr`, which in turn would contain a list of operations. 380All attributes in MLIR are unique’d within the context, which would make the IR 381inside the regions immortal for no good reason. 382 383#### Use "inlined" functions as regions 384 385We considered attaching a "force-inline" attribute on a function and/or a 386function `call` operation. Even the minimal region support (use cases in 387affine.for and affine.if existing before the regions) requires access to the 388values defined in the dominating block, which is not supported by functions. 389Conceptually, function bodies are instances of regions rather than the inverse; 390regions can also be device kernels, alternative sections, etc. 391 392#### Dedicated `region` operation 393 394This would mean we have a special kind of operation that is allowed to have 395regions while other operations are not. Such distinction is similar to the 396Stmt/Op difference we have had and chose to remove to make the IR simpler and 397more flexible. It would also require analyses and passes to consider the 398interplay between operations (e.g., an `affine.for` operation must be followed 399by a region operation). Finally, a region operation can be introduced using the 400current implementation, among other operations and without being special in any 401sense. 402 403#### Explicit capture of the values used in a region 404 405Being able to use values defined outside the region implies that use-def chains 406may contain uses from different nested regions. Consequently, IR transformations 407and analyses can pull the instruction defining the value across region 408boundaries, for example in case of TableGen-defined canonicalization patterns. 409This would not be the case if all used values had been passed as region 410arguments. One of the motivations for introducing regions in the IR is precisely 411to enable cross-region analyses and transformations that are simpler than 412inter-procedural transformations. Having uses from different regions appear in 413the same use-def chain, contrary to an additional data structure maintaining 414correspondence between function call arguments as uses of the original 415definitions and formal arguments as new definitions, enables such 416simplification. Since individual operations now belong to blocks, which belong 417to regions, it is always possible to check if the definition of the value 418belongs to the same region as its particular use. The risk is that any IR 419traversal will need to handle explicitly this situation and it is easy to forget 420a check (or conversely it isn’t easy to design the right check in a tablegen 421pattern for example): traversing use-def chains potentially crosses implicitly 422semantic barriers, making it possible to unknowingly break region semantics. 423This is expected to be caught in the verifier after the transformation. 424 425At the same time, one may choose to pass certain or all values as region 426arguments to explicitly break the use-def chains in the current proposal. This 427can be combined with an attribute-imposed semantic requirement disallowing the 428body of the region to refer to any value from outside it. 429 430### Dialect type extensions 431 432This section describes the design decisions that shaped the dialect extensible 433type system present in MLIR. 434 435#### Interactions between dialects 436 437There are two different interactions between dialects that are important to 438understand. When types of a dialect are: 439 440* In operations of other dialects 441 442 - For standard/builtin operations, only builtin types are allowed. This 443 restriction allows for operations to clearly understand the invariants 444 that they are working under. 445 - Outside of standard/builtin operations, dialects are expected to verify 446 the allowable operation types per operation. 447 448* In types of other dialects 449 450 - For builtin types, these types are allowed to contain types from other 451 dialects. This simplifies the type system and removes the need for 452 dialects to redefine all of the builtin aggregate types, e.g. tensor, as 453 well as the memref type. Dialects are expected to verify that a specific 454 type is valid within a builtin type, e.g. if a type can be an element of 455 a tensor. 456 - For dialect types, the dialect is expected to verify any type 457 invariants, e.g. if the tensor type can contain a specific type of that 458 dialect. 459 460#### Separating builtin and standard types 461 462Following the separation between the built-in and standard dialect, it makes 463sense to separate built-in types and standard dialect types. Built-in types are 464required for the validity of the IR itself, e.g. the function type (which 465appears in function signatures and generic assembly forms of operations). 466Integer, float, vector, memref and tensor types, while important, are not 467necessary for IR validity. 468 469#### Unregistered types 470 471MLIR supports unregistered operations in generic assembly form. MLIR also 472supports a similar concept for types. When parsing, if the dialect for dialect 473type has not been registered the type is modeled as an 'OpaqueType'. This allows 474for types to be round-tripped without needing to link in the dialect library 475that defined them. No additional information about opaque types, outside of 476parsing/printing, will be available. 477 478#### Dialect type syntax 479 480Dialect extended types are represented as string literals wrapped inside of the 481dialect namespace. This means that the parser delegates to the dialect for 482parsing specific type instances. This differs from the representation of dialect 483defined operations, of which have an identifier name that the parser uses to 484identify and parse them. 485 486This representation was chosen for several reasons: 487 488##### Dialects must provide custom type parsers 489 490Dialect type parsing cannot plug into the existing parser infrastructure as 491operations do with the OpAsmParser/Printer. Operations have a defined syntax 492structure that is the same across all dialects. Types, on the other hand, may 493have many different, and sometimes conflicting, parsing constraints that would 494be difficult/unmaintainable to provide within a single interface. 495 496This also has the added benefit of encouraging dialects to reuse existing 497external type parsers. For example, an LLVM dialect may provide an MLIR LLVM 498type that is simply a wrapper around LLVM types. The LLVM dialect would then use 499the existing LLVM type parsing infrastructure. 500 501Example: 502 503```mlir 504%s = "foo"() : () -> !llvm<"i32*"> 505``` 506 507##### Types do not always have canonical names 508 509Unlike operations, types generally do not have a formal canonical name. For 510example, function types have no defined keyword and integer types are defined by 511a regular expression to support arbitrary bitwidth. Dialects with existing type 512systems, e.g. LLVM, are likely to provide wrappers around their existing type 513systems. For these wrapper types there is no simple canonical name, it's logical 514to think of these types as existing within the namespace of the dialect. If a 515dialect wishes to assign a canonical name to a type, it can be done via 516[type aliases](../LangRef.md/#type-aliases). 517 518### Tuple types 519 520The MLIR type system provides first class support for defining 521[tuple types](../Dialects/Builtin/#tupletype). This is due to the fact that `Tuple` 522represents a universal concept that is likely to, and has already begun to, 523present itself in many different dialects. Though this type is first class in 524the type system, it merely serves to provide a common mechanism in which to 525represent this concept in MLIR. As such, MLIR provides no standard operations 526for interfacing with `tuple` types. It is up to dialect authors to provide 527operations, e.g. extract_tuple_element, to interpret and manipulate them. When 528possible, operations should prefer to use multiple results instead. These 529provide a myriad of benefits, such as alleviating any need for tuple-extract 530operations that merely get in the way of analysis and transformation. 531 532### Assembly forms 533 534MLIR decides to support both generic and custom assembly forms under the 535following considerations: 536 537MLIR is an open system; it is designed to support modular and pluggable 538dialects. Depending on whether there exists a corresponding dialect and whether 539the dialect is plugged in, operations may or may not be registered into MLIR 540system. Yet we still need a way to investigate these operations. So the generic 541assembly form is mandated by this aspect of MLIR system. It provides a default 542textual form for operations. 543 544On the other hand, an assembly form is for assisting developers to investigate 545the IR. The generic form serves as a safe fallback but it can be too verbose for 546certain ops. Therefore, MLIR gives each dialect the choice to define a custom 547assembly form for each operation according to the operation's semantics and 548specific needs. The custom assembly form can de-duplicate information from the 549operation to derive a more concise form, thus better facilitating the 550comprehension of the IR. 551 552## Examples 553 554This section describes a few very simple examples that help understand how MLIR 555represents computation. 556 557### Non-affine control flow 558 559```mlir 560// A simple linear search in every row of a matrix 561for (i = 0; i < N; i++) { 562 for (j = 0; j < N; j++) { 563 // dynamic control flow 564 if (a[i][j] == key) { 565 s[i] = j; 566 break; 567 } 568 } 569} 570``` 571 572The presence of dynamic control flow leads to an inner non-affine function 573nested in an outer function that uses affine loops. 574 575```mlir 576func @search(%A: memref<?x?xi32>, %S: <?xi32>, %key : i32) { 577 %ni = dim %A, 0 : memref<?x?xi32> 578 // This loop can be parallelized 579 affine.for %i = 0 to %ni { 580 call @search_body (%A, %S, %key, %i) : (memref<?x?xi32>, memref<?xi32>, i32, i32) 581 } 582 return 583} 584 585func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32, %i : i32) { 586 %nj = dim %A, 1 : memref<?x?xi32> 587 br ^bb1(0) 588 589^bb1(%j: i32) 590 %p1 = cmpi "lt", %j, %nj : i32 591 cond_br %p1, ^bb2, ^bb5 592 593^bb2: 594 %v = affine.load %A[%i, %j] : memref<?x?xi32> 595 %p2 = cmpi "eq", %v, %key : i32 596 cond_br %p2, ^bb3(%j), ^bb4 597 598^bb3(%j: i32) 599 affine.store %j, %S[%i] : memref<?xi32> 600 br ^bb5 601 602^bb4: 603 %jinc = addi %j, 1 : i32 604 br ^bb1(%jinc) 605 606^bb5: 607 return 608} 609``` 610 611As per the [MLIR spec](../LangRef.md), the restrictions on dimensions and symbol 612identifiers to be used with the affine.apply operation only apply to accesses 613inside `affine.for` and `affine.if` operations. However, an analysis of accesses 614inside the called function (`@search_body`) is necessary to determine if the 615`%i` loop could be parallelized: such function access analysis is calling 616context sensitive. 617 618### Non-affine loop bounds 619 620Loop bounds that are not affine lead to a nesting of functions as shown below. 621 622```c 623for (i = 0; i < N; i++) 624 for (j = 0; j < N; j++) 625 // Non-affine loop bound for k loop. 626 for (k = 0; k < pow(2, j); k++) 627 for (l = 0; l < N; l++) { 628 // block loop body 629 ... 630 } 631``` 632 633```mlir 634func @outer_nest(%n : index) { 635 affine.for %i = 0 to %n { 636 affine.for %j = 0 to %n { 637 %pow = call @pow(2, %j) : (index, index) -> index 638 call @inner_nest(%pow, %n) : ... 639 } 640 } 641 return 642} 643 644func @inner_nest(%m : index, %n : index) { 645 affine.for %k = 0 to %m { 646 affine.for %l = 0 to %n { 647 ... 648 } 649 } 650 return 651} 652``` 653 654### Reference 2D Convolution 655 656The following example illustrates a reference implementation of a 2D 657convolution, which uses an integer set `#domain` to represent valid input data 658in a dilated convolution. 659 660```mlir 661// Dilation factors S0 and S1 can be constant folded if constant at compile time. 662#domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0, 663 S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0) 664// Identity map (shown here for illustration). 665#map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6) 666 667// Affine map from output to input coordinate space. 668// d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w 669// S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation 670// S4 = h_pad_low, S5 = w_pad_low 671// %out0 = %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low 672// %out1= %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low 673#map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4) 674#map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5) 675 676// Semi-affine map to undilated input coordinate space. 677// d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation. 678#map2_0 = (d0, d1) [S0, S1] -> (d0 / S0) 679#map2_1 = (d0, d1) [S0, S1] -> (d1 / S1) 680 681// Conv2D shapes: 682// input: [batch, input_height, input_width, input_feature] 683// kernel: [kernel_height, kernel_width, input_feature, output_feature] 684// output: [batch, output_height, output_width, output_feature] 685func @conv2d(%input: memref<16x1024x1024x3xf32, #lm0, /*scratchpad=*/1>, 686 %kernel: memref<5x5x3x32xf32, #lm0, /*scratchpad=*/1>, 687 %output: memref<16x512x512x32xf32, #lm0, /*scratchpad=*/1>) { 688 affine.for %b = 0 to %batch { 689 affine.for %oh = 0 to %output_height { 690 affine.for %ow = 0 to %output_width { 691 affine.for %of = 0 to %output_feature { 692 affine.for %kh = 0 to %kernel_height { 693 affine.for %kw = 0 to %kernel_width { 694 affine.for %if = 0 to %input_feature { 695 // Calculate input indices. 696 %1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5) 697 [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation, 698 %h_pad_low, %w_pad_low] 699 %1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5) 700 [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation, 701 %h_pad_low, %w_pad_low] 702 703 // Check if access is not in padding. 704 affine.if #domain(%1_0, %1_1) 705 [%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] { 706 %2_0 = affine.apply #map2 (%1_0, %1_1) 707 %2_1 = affine.apply #map2 (%1_0, %1_1) 708 // Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices] 709 call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1) 710 } 711 } 712 } 713 } 714 } 715 } 716 } 717 } 718 return 719} 720``` 721 722TODO: (Add more examples showing the IR for a variety of interesting cases) 723 724## Design alternatives and extensions 725 726This is a list of some design alternatives and extensions that we discussed in 727detail but did not include in the spec or postponed them for future 728consideration on demand. We will revisit these discussions when we have more 729implementation experience and learn more about the challenges and limitations of 730our current design in practice. 731 732### Polyhedral code representation alternatives: schedule lists vs schedules trees vs affine loop/if forms 733 734The current MLIR uses a representation of polyhedral schedules using a tree of 735if/for loops. We extensively debated the tradeoffs involved in the typical 736unordered polyhedral instruction representation (where each instruction has 737multidimensional schedule information), discussed the benefits of schedule tree 738forms, and eventually decided to go with a syntactic tree of affine if/else 739conditionals and affine for loops. Discussion of the tradeoff was captured in 740this document: 741[ MLIR: The case for a simplified polyhedral form](RationaleSimplifiedPolyhedralForm.md). 742 743At a high level, we have two alternatives here: 744 7451. Schedule tree representation instead of an affine loop AST form: The current 746 proposal uses an affine loop and conditional tree form, which is syntactic 747 and with no separation of domains as sets and schedules as multidimensional 748 affine functions. A schedule tree form however makes polyhedral domains and 749 schedules a first class concept in the IR allowing compact expression of 750 transformations through the schedule tree without changing the domains of 751 instructions. Such a representation also hides prologues, epilogues, partial 752 tiles, complex loop bounds and conditionals making loop nests free of 753 "syntax". Cost models instead look at domains and schedules. In addition, if 754 necessary such a domain schedule representation can be normalized to 755 explicitly propagate the schedule into domains and model all the cleanup 756 code. An example and more detail on the schedule tree form is in the next 757 section. 7581. Having two different forms of "affine regions": an affine loop tree form 759 and a polyhedral schedule tree form. In the latter, ops could carry 760 attributes capturing domain, scheduling, and other polyhedral code 761 generation options with IntegerSet, AffineMap, and other attributes. 762 763#### Schedule Tree Representation for Affine Regions 764 765This representation is based on a simplified form of the domain/schedule 766representation used by the polyhedral compiler community. Domains represent what 767has to be executed while schedules represent the order in which domain elements 768are interleaved. We model domains as non-piece-wise convex integer sets, and 769schedules as affine functions; however, the former can be disjunctive, and the 770latter can be piece-wise affine relations. In the schedule tree representation, 771domain and schedules for instructions are represented in a tree-like structure 772which is called a schedule tree. Each non-leaf node of the tree is an abstract 773polyhedral dimension corresponding to an abstract fused loop for each ML 774instruction that appears in that branch. Each leaf node is an ML Instruction. 775 776```mlir 777// A tiled matmul code (128x128x128) represented in schedule tree form 778 779// #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5) 780#intset_ij = (i, j) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0 781#intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, 782 -j + M-1 >= 0, k >= 0, -k + N - 1 >= 0) 783func @matmul(%A, %B, %C, %M, %N, %K) : (...) { // %M, N, K are symbols 784 // t1, t2, t3, t4, t5, t6 are abstract polyhedral loops 785 mldim %t1 : {S1,S2,S3,S4,S5} floordiv (i, 128) { 786 mldim %t2 : {S1,S2,S3,S4,S5} floordiv (j, 128) { 787 // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2) 788 call dma_mem_to_scratchpad(%C, %i, %j, %M, %N, %K) 789 with @intset_ij(%i, %j) [%M, %N, %K] 790 mldim %t3 : {S2,S3,S4,S5} floordiv (k, 128) { 791 // (%i, %j, %k) = affine.apply (d0, d1, d2) 792 // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3) 793 call dma_mem_to_scratchpad(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K] 794 // (%i, %j, %k) = affine.apply (d0, d1, d2) 795 // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3) 796 call dma_mem_to_scratchpad(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K] 797 mldim %t4 : {S4} i mod 128 { 798 mldim %t5 : {S4} j mod 128 { 799 mldim %t6 : {S4} k mod 128 { 800 // (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6) 801 call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K) 802 with #inset_ijk(%i, %j, %k) [%M, %N, %K] 803 } // end mld4im t6 804 } // end mldim t5 805 } // end mldim t4 806 } // end mldim t3 807 // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2) 808 call $dma_scratchpad_to_mem_C ... with #intset(%i, %j) [%M, %N, %K] 809 } // end mldim t2 810 } // end mldim t1 811 return 812} 813 814``` 815 816### Affine Relations 817 818The current MLIR spec includes affine maps and integer sets, but not 819affine relations. Affine relations are a natural way to model read and 820write access information, which can be very useful to capture the 821behavior of external library calls where no implementation is 822available, high-performance vendor libraries, or user-provided / 823user-tuned routines. 824 825An affine relation is a relation between input and output dimension identifiers 826while being symbolic on a list of symbolic identifiers and with affine 827constraints on the identifiers. 828 829Syntax: 830 831``` 832// Affine relation definition at the top of file 833affine-rel-def ::= affine-rel-id `=` affine-relation-inline 834 835affine-rel-id ::= `##` prefixed-id 836 837affine-relation-inline ::= 838 `(` input-dims `)` (`[` symbols `]`)? `->` 839 `(` output-dims `)` : affine-constraint-conjunction 840 841input-dims ::= bare-id-list 842output-dims ::= bare-id-list 843symbols ::= bare-id-list 844 845affine-rel ::= affine-rel-id | affine-relation-inline 846 847// Usage 848affine-rel-spec ::= affine-rel dim-and-symbol-use-list 849``` 850 851All identifiers appearing in input-dims, output-dims, and symbol-dims are 852pairwise distinct. All affine-constraint non-terminals in the above syntax are 853allowed to contain identifiers only from input-dims, output-dims, and 854symbol-dims. 855 856Affine relations are used to model read, write, may_read, and may_write sets of 857functions in the IR. The output dimension identifiers correspond to the data 858dimensions. 859 860Example: 861 862```mlir 863// read relation: two elements ( d0 <= r0 <= d0+1 ) 864##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0 865 866func @count (%A : memref<128xf32>, %pos : i32) -> f32 867 reads: {%A ##aff_rel9 (%pos)} 868 writes: /* empty */ 869 may_reads: /* empty */ 870 may_writes: /* empty */ { 871bb0 (%0, %1: memref<128xf32>, i64): 872 %val = affine.load %A [%pos] 873 %val = affine.load %A [%pos + 1] 874 %p = mulf %val, %val : f32 875 return %p : f32 876} 877``` 878 879### Regions 880 881#### Making function definition an operation 882 883MLIR supports values of a Function type. Instead of having first-class IR 884concept for functions, one could define an operation with a body region that 885defines a function value. The particularity of functions is that their names are 886globally visible and can be referred to before being defined, unlike SSA values 887that must be defined first. Implementing a "function definition" operation would 888require to relax some of the SSA constraints in a region, and also make the IR 889Module a region as well. It would also affect the core infrastructure (e.g., 890function passes) only for the sake of concept unification. 891 892#### Having types on a region 893 894Instead of inspecting the types of arguments of the first block, one could give 895the region itself a type. This type would be redundant with block argument 896types, which must have values and create room for type mismatches. While 897functions do have types that are partly redundant with the arguments of the 898first block in the function, this is necessary to support function declarations 899that do not have a body which we can refer to in order to obtain the argument 900types. A region is always contained in an operation or a function that can be 901queried to obtain the “type” of the region if necessary. 902 903A type on a region can be justified if Regions were to be considered separately 904from the enclosing entity (operation or function) and had their own semantics 905that should be checked. 906 907#### Attaching attributes to regions 908 909Regions could be annotated with dialect attributes to use attribute verification 910hooks. An operation could take multiple regions as arguments, and each of them 911may require different attributes. However, there are currently very few 912practical cases where this would be necessary. Instead, one could simulate 913per-region attributes with array attributes attached to the entity containing 914the region (operation or function). This decreases the overall complexity of the 915IR and enables more concise and op-specific forms, e.g., when all regions of an 916op have the same attribute that can be only mentioned once. Since the semantics 917of the region is entirely defined by the enclosing entity, it also makes sense 918to have attributes attached to that entity rather than to the region itself. 919 920This can be reconsidered in the future if we see a non-neglectable amount of use 921cases. 922 923### Read/Write/May_Read/May_Write sets for External Functions 924 925Having read, write, may_read, and may_write sets for external functions which 926include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL, 927FFT libraries, user-provided/optimized functions, or data movement runtimes such 928as DMA ones is a powerful feature. It allows the compiler to perform analysis, 929composition/transformation in the presence of such calls and with loops around 930such calls on sub-tensors. For user-provided or custom hand-tuned functions, the 931read/write/may_read/may_write sets could be provided a-priori by a user as part 932of the external function signature or they could be part of a database. 933 934TODO: Design this, and update to use function attribute syntax. 935 936Example: 937 938```mlir 939##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1 940 941func @cblas_reduce_ffi(%M: memref<1024 x ? x f32, #layout_map0, /*mem=*/0>) 942 -> f32 [ 943 reads: {%M, ##rel9() } 944 writes: /* empty */ 945 may_reads: /* empty */ 946 may_writes: /* empty */ 947] 948 949func @dma_mem_to_scratchpad(%a : memref<1024 x f32, #layout_map0, /*mem=*/0>, 950 %b : memref<1024 x f32, #layout_map0, 1>, %c : memref<1024 x f32, 951 #layout_map0>) [ 952 reads: {%M, ##rel9() } 953 writes: /* empty */ 954 may_reads: /* empty */ 955 may_writes: /* empty */ 956 ] 957 958``` 959 960### Memref Extensions 961 9621. Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor 963 dimensions where there is symmetry: use integer set (affine constraints) to 964 model tensor data space (instead of just extents). Requires some changes to 965 the IR and the in-memory form. 9661. Layout maps 967 968 1. Allow piece-wise affine maps for layouts: allows clean modeling of 969 boundary cases for images/tensors through padding, wrapping, mirroring, 970 padding where padded values are the results of computation as opposed to 971 data, padding in the interior as opposed to just boundaries. 972 1. Allow many-to-one layout maps: Index and layout maps in the current 973 proposal are bijective. Extending them to many-to-one layout maps allows 974 cleaner(?) modeling of broadcast/reduce style computations while reusing 975 memory. 976 977 Proposal 2(a) requires non-trivial changes to the IR and the in-memory 978 representation. 2(b) requires no change, but impacts how cost models look at 979 index and layout maps. 980 981### `affine.if` and `affine.for` Extensions for "Escaping Scalars" 982 983We considered providing a representation for SSA values that are live out of 984`if/else` conditional bodies and loop carried in `affine.for` loops. We 985ultimately abandoned this approach due to its complexity. In the current design 986of MLIR, scalar variables cannot escape for loops or if instructions. In 987situations, where escaping is necessary, we use zero-dimensional tensors and 988memrefs instead of scalars. 989 990**TODO**: This whole section is obsolete and should be updated to use block 991arguments and a yield like terminator in for/if instructions. 992 993The abandoned design of supporting escaping scalars is as follows: 994 995#### affine.for Instruction 996 997Syntax: 998 999``` 1000[<out-var-list> =] 1001for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step> 1002 [with <in-var-list>] { <loop-instruction-list> } 1003``` 1004 1005out-var-list is a comma separated list of SSA values defined in the loop body 1006and used outside the loop body. in-var-list is a comma separated list of SSA 1007values used inside the loop body and their initializers. loop-instruction-list 1008is a list of instructions that may also include a yield instruction. 1009 1010Example: 1011 1012```mlir 1013// Return sum of elements in 1-dimensional mref A 1014func i32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) { 1015 %init = 0 1016 %result = affine.for %i = 0 to N with %tmp(%init) { 1017 %value = affine.load %A[%i] 1018 %sum = %value + %tmp 1019 yield %sum 1020 } 1021 return %result : i32 1022} 1023``` 1024 1025#### affine.if/else Instruction 1026 1027Syntax: 1028 1029``` 1030<out-var-list> = affine.if (<cond-list>) {...} [else {...}] 1031``` 1032 1033Out-var-list is a list of SSA values defined by the if-instruction. The values 1034are arguments to the yield-instruction that occurs in both then and else clauses 1035when else clause is present. When if instruction contains only if clause, the 1036escaping value defined in the then clause should be merged with the value the 1037variable had before the if instruction. The design captured here does not handle 1038this situation. 1039 1040Example: 1041 1042```mlir 1043// Compute sum of half of the array 1044func i32 @sum_half(%A : memref<?xi32>, %N : i32) -> (i32) { 1045 %s0 = 0 1046 %s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) { 1047 %s3 = if (%i >= %N / 2) { 1048 %v0 = affine.load %A[%i] 1049 %s4 = %s2 + %v0 1050 yield %s4 1051 } 1052 yield %s3 1053 } 1054 return %s1 : i32 1055} 1056``` 1057 1058### Multithreading the compiler 1059 1060People want compilers to go fast, and one simple way to do that is to 1061multi-thread them. There are multiple strategies for this, but a simple one is 1062to optimize and compile separate functions in parallel. LLVM's original pass 1063manager anticipated this demand, and the CallGraphSCCPass manager is even 1064designed to support this as well, but unfortunately, a few early design 1065decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO 1066are forced to split programs into separate LLVM modules/context and optimize 1067those chunks independently. 1068 1069The problem is that LLVM has several objects in its IR that are globally uniqued 1070and also mutable: notably constants like `i32 0`. In LLVM, these constants are 1071`Value`'s, which allow them to be used as operands to instructions, and that 1072they also have SSA use lists. Because these things are uniqued, every `i32 0` in 1073any function shares a use list. This means that optimizing multiple functions in 1074parallel won't work (at least without some sort of synchronization on the use 1075lists, which would be unbearably inefficient). 1076 1077MLIR now supports a multithreaded pass manager. We do this through several 1078design choices: 1079 10801. MLIR makes use of extensive uniqued immutable data structures (affine 1081 expressions, types, etc are all immutable, uniqued, and immortal). 10822. Constants are defined in per-function pools, instead of being globally 1083 uniqued. 10843. Functions themselves are not SSA values either, so they don't have the same 1085 problem as constants. 10864. FunctionPasses are copied (through their copy ctor) into one instance per 1087 thread, avoiding sharing of local state across threads. 1088 1089This allows MLIR function passes to support efficient multithreaded compilation 1090and code generation. 1091