1# MLIR Rationale
2
3This document is intended to capture some of the alternatives considered and
4open debates in the design of MLIR, along with the rationale for certain
5decisions we made. This is not intended to be a "finely groomed" document - we
6prefer the ability to dump in interesting tidbits without worrying too much
7about their consistency or readability.
8
9[TOC]
10
11## Abstract
12
13MLIR is a compiler intermediate representation with similarities to traditional
14three-address SSA representations (like
15[LLVM IR](http://llvm.org/docs/LangRef.html) or
16[SIL](https://github.com/apple/swift/blob/master/docs/SIL.rst)), but which
17introduces notions from the polyhedral loop optimization works as first class
18concepts. This hybrid design is optimized to represent, analyze, and transform
19high level dataflow graphs as well as target-specific code generated for high
20performance data parallel systems. Beyond its representational capabilities, its
21single continuous design provides a framework to lower from dataflow graphs to
22high performance target specific code.
23
24MLIR stands for one of "Multi-Level IR" or "Multi-dimensional Loop IR" or
25"Machine Learning IR" or "Mid Level IR", we prefer the first. This document only
26provides the rationale behind MLIR -- its actual
27[specification document](../LangRef.md) and other content is hosted elsewhere.
28
29## Introduction and Motivation
30
31The Multi-Level Intermediate Representation (MLIR) is intended for easy
32expression and optimization of computations involving deep loop nests and dense
33matrices of high dimensionality. It is thus well-suited to deep learning
34computations in particular. Yet it is general enough to also represent arbitrary
35sequential computation. The representation allows high-level optimization and
36parallelization for a wide range of parallel architectures including those with
37deep memory hierarchies --- general-purpose multicores, GPUs, and specialized
38neural network accelerators.
39
40MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs
41while combining them with ideas from the polyhedral abstraction to represent
42loop nests, multidimensional data (tensors), and transformations on these
43entities as first class concepts in the IR.
44
45MLIR is a multi-level IR, i.e., it represents code at a domain-specific
46representation such as HLO or TensorFlow graphs, all the way down to the machine
47level. MLIR is able to represent arbitrary control flow and arbitrary data
48accesses, and is general enough to represent nearly all sequential computation.
49This is a key distinction from existing polyhedral representation
50implementations (such as LLVM [Polly](https://polly.llvm.org/)) that are able to
51use the polyhedral abstraction in a way isolated from the LLVM IR and only for
52affine loop nests, i.e., portions of the code where array accesses, loop bounds,
53and conditionals are regular (involve linear functions of loop iterators and
54constant symbols). The presence of statically unpredictable data accesses or
55control flow does not preclude representation in MLIR, but only limits to a
56certain extent the ability to reason about and apply transformations using the
57polyhedral abstraction.
58
59Maps, sets, and relations with affine constraints are the core structures
60underlying a polyhedral representation of high-dimensional loop nests and
61multidimensional arrays. These structures are represented as textual
62expressions in a form close to their mathematical form. These structures are
63used to capture loop nests, tensor data structures, and how they are reordered
64and mapped for a target architecture. All structured or "conforming" loops are
65captured as part of the polyhedral information, and so are tensor variables,
66their layouts, and subscripted accesses to these tensors in memory.
67
68The information captured in the IR allows a compact expression of all loop
69transformations, data remappings, explicit copying necessary for explicitly
70addressed memory in accelerators, mapping to pre-tuned expert-written
71primitives, and mapping to specialized vector instructions. Loop transformations
72that can be easily implemented include the body of affine transformations: these
73subsume all traditional loop transformations (unimodular and non-unimodular)
74such as loop tiling, interchange, permutation, skewing, scaling, relative
75shifting, reversal, fusion, and distribution/fission. Transformations on data
76layout such as padding and transforming to blocked layouts are also represented
77well via affine layout maps.
78
79MLIR's design allows a progressive lowering to target-specific forms. Besides
80high-level transformations for loop nests and data layouts that a typical
81mid-level optimizer is expected to deal with, MLIR is also designed to perform
82certain low-level scheduling and mapping decisions that a typical backend IR is
83entrusted with: these include mapping to specialized vector instructions,
84auto-vectorization, and software pipelining. The need to support these
85transformations stems from the fact that neural network accelerators have
86specialized units that deal with large chunks of data whose computation maps
87back to chunks of more than one loop of the loop nests as viewed by a program at
88a level closer to the original specification. Such specialized units or
89instructions operate on multidimensional data chunks from a programmer's
90viewpoint. It thus makes it hard or infeasible for a backend operating on a very
91low-level IR close to assembly to lift and reconstruct loops and perform such a
92mapping. This is in contrast to classic instruction selection and scheduling in
93today's compilers that primarily only deals with the body of the innermost loop.
94MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor
95libraries operating on data at higher levels (or at the highest level) of the
96memory hierarchy.
97
98In summary, MLIR is convenient for and closed under the kind of transformations
99needed to lower to general-purpose as well as specialized accelerators. It also
100allows one to build modular and reusable target independent and target dependent
101passes.
102
103## Design Decisions
104
105This section sheds light on some of the design decisions -- some of these are
106indirectly implied by the specification document.
107
108### Loads and stores
109
110The 'load' and 'store' instructions are specifically crafted to fully resolve to
111an element of a memref. These instructions take as arguments n+1 indices for an
112n-ranked tensor. This disallows the equivalent of pointer arithmetic or the
113ability to index into the same memref in other ways (something which C arrays
114allow for example). Furthermore, for the affine constructs, the compiler can
115follow use-def chains (e.g. through
116[affine.apply operations](../Dialects/Affine.md/#affineapply-affineapplyop)) or through
117the map attributes of [affine operations](../Dialects/Affine.md/#operations)) to
118precisely analyze references at compile-time using polyhedral techniques. This
119is possible because of the [restrictions on dimensions and symbols](../Dialects/Affine.md/#restrictions-on-dimensions-and-symbols).
120
121A scalar of element-type (a primitive type or a vector type) that is stored in
122memory is modeled as a 0-d memref. This is also necessary for scalars that are
123live out of for loops and if conditionals in a function, for which we don't yet
124have an SSA representation --
125[an extension](#affineif-and-affinefor-extensions-for-escaping-scalars) to allow that is
126described later in this doc.
127
128### Symbols and types
129
130The current MLIR disallows use of symbols in types. For example, when a tensor
131or memref dimension is statically unknown, it is denoted in the type as '?'. An
132SSA symbol is then bound to it when a memref is created. The actual value of the
133unknown dimension can be queried using the "dim" builtin as shown below.
134
135Example:
136
137```mlir
138func foo(...) {
139  %A = alloc <8x?xf32, #lmap> (%N)
140  ...
141  call bar(%A) : (memref<8x?xf32, #lmap>)
142}
143
144func bar(%A : memref<8x?xf32, #lmap>) {
145  // Type of %A indicates that %A has dynamic shape with 8 rows
146  // and unknown number of columns. The number of columns is queried
147  // dynamically using dim instruction.
148  %N = dim %A, 1 : memref<8x?xf32, #lmap>
149
150  affine.for %i = 0 to 8 {
151    affine.for %j = 0 to %N {
152      // A[i,j] += 1
153      %s1 = affine.load %A[%i, %j] : memref<8x?xf32, #lmap>
154      %s2 = add %s1, 1
155      affine.store %s2, %A[%i, %j] : memref<8x?xf32, #lmap>
156    }
157  }
158  return
159}
160
161```
162
163An alternative design is to embed the reference to symbols directly in the
164type - memref<8x%Nxf32>. We went for the current approach in MLIR because it
165simplifies the design --- types remain immutable when the values of symbols
166change.
167
168### Block Arguments vs PHI nodes
169
170MLIR Regions represent SSA using "[block arguments](../LangRef.md/#blocks)" rather
171than [PHI instructions](http://llvm.org/docs/LangRef.html#i-phi) used in LLVM.
172This choice is representationally identical (the same constructs can be
173represented in either form) but block arguments have several advantages:
174
1751.  LLVM PHI nodes always have to be kept at the top of a block, and
176    transformations frequently have to manually skip over them. This is defined
177    away with BB arguments.
1781.  LLVM has a separate function Argument node. This is defined away with BB
179    arguments, because the arguments to the entry block serve this purpose.
1801.  Blocks of PHI nodes in LLVM execute atomically, which is surprising and
181    super confusing to compiler engineers and it is easy to introduce bugs with
182    this (very related to the
183    "[lost copy](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.5461&rep=rep1&type=pdf)"
184    problem in SSA lowering literature.) With the BB argument representation,
185    this confusion is defined away.
1861.  The entry list of PHI nodes in LLVM are unordered, and some blocks have
187    thousands of predecessors (e.g. unwind blocks). This can cause long compile
188    time problems because transformations have to linearly scan this list. This
189    is defined away with BB argument representation.
1901.  LLVM has no way to represent values that are available only in one successor
191    but not the other, e.g. its invoke instruction cannot produce the exception
192    value JUST on the exception edge. Instead, the
193    [landingpad instruction](http://llvm.org/docs/LangRef.html#landingpad-instruction)
194    is a hack used to represent this. MLIR doesn't make use of this capability,
195    but SIL uses it extensively, e.g. in the
196    [switch_enum instruction](https://github.com/apple/swift/blob/master/docs/SIL.rst#switch-enum).
197
198For more context, block arguments were previously used in the Swift
199[SIL Intermediate Representation](https://github.com/apple/swift/blob/master/docs/SIL.rst),
200and described in
201[a talk on YouTube](https://www.youtube.com/watch?v=Ntj8ab-5cvE). The section of
202interest
203[starts here](https://www.google.com/url?q=https://youtu.be/Ntj8ab-5cvE?t%3D596&sa=D&ust=1529450150971000&usg=AFQjCNFQHEWL7m8q3eO-1DiKw9zqC2v24Q).
204
205### Index type usage and limitations
206
207Index types are intended to be used for platform-specific "size" values and may
208appear in subscripts, sizes of aggregate types and affine expressions. They are
209also tightly coupled with `affine.apply` and affine.load/store operations;
210having `index` type is a necessary precondition of a value to be acceptable by
211these operations.
212
213We allow `index` types in tensors, vectors, and memrefs as a code generation
214strategy has to map `index` to an implementation type and hence needs to be able
215to materialize corresponding values. However, the target might lack support for
216`vector` values with the target specific equivalent of the `index` type.
217
218### Data layout of non-primitive types
219
220Data layout information such as the bit width or the alignment of types may be
221target and ABI-specific and thus should be configurable rather than imposed by
222the compiler. Especially, the layout of compound or `index` types may vary. MLIR
223specifies default bit widths for certain primitive _types_, in particular for
224integers and floats. It is equal to the number that appears in the type
225definition, e.g. the bit width of `i32` is `32`, so is the bit width of `f32`.
226The bit width is not _necessarily_ related to the amount of memory (in bytes) or
227the register size (in bits) that is necessary to store the value of the given
228type. For example, `vector<3xi57>` is likely to be lowered to a vector of four
22964-bit integers, so that its storage requirement is `4 x 64 / 8 = 32` bytes,
230rather than `(3 x 57) ceildiv 8 = 22` bytes as can be naively computed from the
231bit width. MLIR makes such [data layout information](../DataLayout.md)
232configurable using attributes that can be queried during lowering, for example,
233when allocating a compound type.
234
235The data layout of dialect-specific types is undefined at MLIR level. Yet
236dialects are free to define their own quantities and make them available via the
237data layout infrastructure.
238
239### Integer signedness semantics
240
241Integers in the builtin MLIR type system have a bitwidth (note that the `index`
242type has a symbolic width equal to the machine word size), and they *may*
243additionally have signedness semantics. The purpose is to satisfy the needs of
244different dialects, which can model different levels of abstractions. Certain
245abstraction, especially closer to source language, might want to differentiate
246signedness with integer types; while others, especially closer to machine
247instruction, might want signless integers. Instead of forcing each abstraction
248to adopt the same integer modelling or develop its own one in house, Integer
249type provides this as an option to help code reuse and consistency.
250
251For the standard dialect, the choice is to have signless integer types. An
252integer value does not have an intrinsic sign, and it's up to the specific op
253for interpretation. For example, ops like `addi` and `muli` do two's complement
254arithmetic, but some other operations get a sign, e.g. `divis` vs `diviu`.
255
256LLVM uses the [same design](http://llvm.org/docs/LangRef.html#integer-type),
257which was introduced in a revamp rolled out
258[in the LLVM 2.0 integer type](http://releases.llvm.org/2.0/docs/LangRef.html#t_derived).
259Prior to that, from
260[LLVM 1.0](http://releases.llvm.org/1.0/docs/LangRef.html#t_classifications) to
261[1.9](http://releases.llvm.org/1.9/docs/LangRef.html#t_classifications), LLVM
262uses signed types like "sbyte" and "ubyte". This shift was important and has
263served LLVM well over the years. The reason this is important is that it is a
264good thing for an intermediate representation to represent the same computation
265with the same instruction. Signed types got in the way, because (e.g.) an "add
266of an sbyte" does the same computation as an "add of a ubyte", but the type
267system made them look artificially different. This split also required casts
268like "cast from sbyte to ubyte" which do nothing at the machine level. Removing
269signs from the type system eliminated these problems, making the compiler
270simpler.
271
272More information about this split is available in an old
273[talk on youtube](https://www.youtube.com/watch?v=VeRaLPupGks) talking about
274LLVM 2.0.
275
276Note that this rationale only applies to the "standard ops" dialect in which we
277can express an opinion about its design. Other dialects generally try to model
278an external system, and should aim to reflect its design as closely as possible.
279
280### Splitting floating point vs integer operations
281
282The MLIR "standard" operation set splits many integer and floating point
283operations into different categories, for example `addf` vs `addi` and `cmpf` vs
284`cmpi`
285([following the design of LLVM](http://llvm.org/docs/LangRef.html#binary-operations)).
286These instructions _are_ polymorphic on the number of elements in the type
287though, for example `addf` is used with scalar floats, vectors of floats, and
288tensors of floats (LLVM does the same thing with its scalar/vector types).
289
290This split is important because floating point and integer operations are quite
291different in practice: for example, floating point values include NaN's, so
292[integer comparisons](http://llvm.org/docs/LangRef.html#icmp-instruction) and
293[floating point comparisons](http://llvm.org/docs/LangRef.html#fcmp-instruction)
294should use different comparison opcodes. On the arithmetic side of things,
295floating point operations support rounding modes, floating point contractions,
296["fast math"](http://llvm.org/docs/LangRef.html#fadd-instruction), and integers
297may want to have two's complement overflow behavior or be undefined on
298[various forms of wrapping](http://llvm.org/docs/LangRef.html#add-instruction)
299for performance.
300
301We are a long way from this sort of thing being a priority to care about in
302MLIR, but since we have experience and know the right way to do this, we'd
303rather design it in from the beginning.
304
305Note that this rationale only applies to the "standard ops" dialect in which we
306can express an opinion about its design. Other dialects generally try to model
307an external system, and should aim to reflect its design as closely as possible.
308
309### Specifying sign in integer comparison operations
310
311Since integers are [signless](#integer-signedness-semantics), it is necessary to define the
312sign for integer comparison operations. This sign indicates how to treat the
313foremost bit of the integer: as sign bit or as most significant bit. For
314example, comparing two `i4` values `0b1000` and `0b0010` yields different
315results for unsigned (`8 > 3`) and signed (`-8 < 3`) interpretations. This
316difference is only significant for _order_ comparisons, but not for _equality_
317comparisons. Indeed, for the latter all bits must have the same value
318independently of the sign. Since both arguments have exactly the same bit width
319and cannot be padded by this operation, it is impossible to compare two values
320whose bit representations would differ while the values are interpreted as
321equal.
322
323### Specifying comparison kind as attribute
324
325Unlike arithmetic, comparison operators share several common properties, e.g.
326they cannot be considered associative. In practice, comparisons are sometimes
327implemented by the same instruction or its variants so it makes sense to group
328them together at the IR level.
329
330An alternative would be introducing ten distinct operators for all currently
331supported kinds of integer comparisons. These operators would have increased the
332number of "reserved" names used by standard operations as well as the size of
333the C++ API while their implementations would have been mostly identical.
334
335The comparison kind is internally an integer attribute. However, for the sake of
336readability by humans, custom assembly form accepts string literals that are
337mapped to the underlying integer values: `cmpi "eq", %lhs, %rhs` better implies
338integer equality comparison than `cmpi 0, %lhs, %rhs` where it is unclear what
339gets compared to what else. This syntactic sugar is possible thanks to parser
340logic redefinitions for custom assembly form of non-builtin operations.
341Supporting it in the full notation would have required changing how the main
342parsing algorithm works and may have unexpected repercussions. While it had been
343possible to store the predicate as string attribute, it would have rendered
344impossible to implement switching logic based on the comparison kind and made
345attribute validity checks (one out of ten possible kinds) more complex.
346
347### 'select' operation to implement min/max
348
349Although `min` and `max` operations are likely to occur as a result of
350transforming affine loops in ML functions, we did not make them first-class
351operations. Instead, we provide the `select` operation that can be combined with
352`cmpi` to implement the minimum and maximum computation. Although they now
353require two operations, they are likely to be emitted automatically during the
354transformation inside MLIR. On the other hand, there are multiple benefits of
355introducing `select`: standalone min/max would concern themselves with the
356signedness of the comparison, already taken into account by `cmpi`; `select` can
357support floats transparently if used after a float-comparison operation; the
358lower-level targets provide `select`-like instructions making the translation
359trivial.
360
361This operation could have been implemented with additional control flow: `%r =
362select %cond, %t, %f` is equivalent to
363
364```mlir
365^bb0:
366  cond_br %cond, ^bb1(%t), ^bb1(%f)
367^bb1(%r):
368```
369
370However, this control flow granularity is not available in the ML functions
371where min/max, and thus `select`, are likely to appear. In addition, simpler
372control flow may be beneficial for optimization in general.
373
374### Regions
375
376#### Attributes of type 'Block'
377
378We considered representing regions through `ArrayAttr`s containing a list of a
379special type `IRBlockAttr`, which in turn would contain a list of operations.
380All attributes in MLIR are unique’d within the context, which would make the IR
381inside the regions immortal for no good reason.
382
383#### Use "inlined" functions as regions
384
385We considered attaching a "force-inline" attribute on a function and/or a
386function `call` operation. Even the minimal region support (use cases in
387affine.for and affine.if existing before the regions) requires access to the
388values defined in the dominating block, which is not supported by functions.
389Conceptually, function bodies are instances of regions rather than the inverse;
390regions can also be device kernels, alternative sections, etc.
391
392#### Dedicated `region` operation
393
394This would mean we have a special kind of operation that is allowed to have
395regions while other operations are not. Such distinction is similar to the
396Stmt/Op difference we have had and chose to remove to make the IR simpler and
397more flexible. It would also require analyses and passes to consider the
398interplay between operations (e.g., an `affine.for` operation must be followed
399by a region operation). Finally, a region operation can be introduced using the
400current implementation, among other operations and without being special in any
401sense.
402
403#### Explicit capture of the values used in a region
404
405Being able to use values defined outside the region implies that use-def chains
406may contain uses from different nested regions. Consequently, IR transformations
407and analyses can pull the instruction defining the value across region
408boundaries, for example in case of TableGen-defined canonicalization patterns.
409This would not be the case if all used values had been passed as region
410arguments. One of the motivations for introducing regions in the IR is precisely
411to enable cross-region analyses and transformations that are simpler than
412inter-procedural transformations. Having uses from different regions appear in
413the same use-def chain, contrary to an additional data structure maintaining
414correspondence between function call arguments as uses of the original
415definitions and formal arguments as new definitions, enables such
416simplification. Since individual operations now belong to blocks, which belong
417to regions, it is always possible to check if the definition of the value
418belongs to the same region as its particular use. The risk is that any IR
419traversal will need to handle explicitly this situation and it is easy to forget
420a check (or conversely it isn’t easy to design the right check in a tablegen
421pattern for example): traversing use-def chains potentially crosses implicitly
422semantic barriers, making it possible to unknowingly break region semantics.
423This is expected to be caught in the verifier after the transformation.
424
425At the same time, one may choose to pass certain or all values as region
426arguments to explicitly break the use-def chains in the current proposal. This
427can be combined with an attribute-imposed semantic requirement disallowing the
428body of the region to refer to any value from outside it.
429
430### Dialect type extensions
431
432This section describes the design decisions that shaped the dialect extensible
433type system present in MLIR.
434
435#### Interactions between dialects
436
437There are two different interactions between dialects that are important to
438understand. When types of a dialect are:
439
440*   In operations of other dialects
441
442    -   For standard/builtin operations, only builtin types are allowed. This
443        restriction allows for operations to clearly understand the invariants
444        that they are working under.
445    -   Outside of standard/builtin operations, dialects are expected to verify
446        the allowable operation types per operation.
447
448*   In types of other dialects
449
450    -   For builtin types, these types are allowed to contain types from other
451        dialects. This simplifies the type system and removes the need for
452        dialects to redefine all of the builtin aggregate types, e.g. tensor, as
453        well as the memref type. Dialects are expected to verify that a specific
454        type is valid within a builtin type, e.g. if a type can be an element of
455        a tensor.
456    -   For dialect types, the dialect is expected to verify any type
457        invariants, e.g. if the tensor type can contain a specific type of that
458        dialect.
459
460#### Separating builtin and standard types
461
462Following the separation between the built-in and standard dialect, it makes
463sense to separate built-in types and standard dialect types. Built-in types are
464required for the validity of the IR itself, e.g. the function type (which
465appears in function signatures and generic assembly forms of operations).
466Integer, float, vector, memref and tensor types, while important, are not
467necessary for IR validity.
468
469#### Unregistered types
470
471MLIR supports unregistered operations in generic assembly form. MLIR also
472supports a similar concept for types. When parsing, if the dialect for dialect
473type has not been registered the type is modeled as an 'OpaqueType'. This allows
474for types to be round-tripped without needing to link in the dialect library
475that defined them. No additional information about opaque types, outside of
476parsing/printing, will be available.
477
478#### Dialect type syntax
479
480Dialect extended types are represented as string literals wrapped inside of the
481dialect namespace. This means that the parser delegates to the dialect for
482parsing specific type instances. This differs from the representation of dialect
483defined operations, of which have an identifier name that the parser uses to
484identify and parse them.
485
486This representation was chosen for several reasons:
487
488##### Dialects must provide custom type parsers
489
490Dialect type parsing cannot plug into the existing parser infrastructure as
491operations do with the OpAsmParser/Printer. Operations have a defined syntax
492structure that is the same across all dialects. Types, on the other hand, may
493have many different, and sometimes conflicting, parsing constraints that would
494be difficult/unmaintainable to provide within a single interface.
495
496This also has the added benefit of encouraging dialects to reuse existing
497external type parsers. For example, an LLVM dialect may provide an MLIR LLVM
498type that is simply a wrapper around LLVM types. The LLVM dialect would then use
499the existing LLVM type parsing infrastructure.
500
501Example:
502
503```mlir
504%s = "foo"() : () -> !llvm<"i32*">
505```
506
507##### Types do not always have canonical names
508
509Unlike operations, types generally do not have a formal canonical name. For
510example, function types have no defined keyword and integer types are defined by
511a regular expression to support arbitrary bitwidth. Dialects with existing type
512systems, e.g. LLVM, are likely to provide wrappers around their existing type
513systems. For these wrapper types there is no simple canonical name, it's logical
514to think of these types as existing within the namespace of the dialect. If a
515dialect wishes to assign a canonical name to a type, it can be done via
516[type aliases](../LangRef.md/#type-aliases).
517
518### Tuple types
519
520The MLIR type system provides first class support for defining
521[tuple types](../Dialects/Builtin/#tupletype). This is due to the fact that `Tuple`
522represents a universal concept that is likely to, and has already begun to,
523present itself in many different dialects. Though this type is first class in
524the type system, it merely serves to provide a common mechanism in which to
525represent this concept in MLIR. As such, MLIR provides no standard operations
526for interfacing with `tuple` types. It is up to dialect authors to provide
527operations, e.g. extract_tuple_element, to interpret and manipulate them. When
528possible, operations should prefer to use multiple results instead. These
529provide a myriad of benefits, such as alleviating any need for tuple-extract
530operations that merely get in the way of analysis and transformation.
531
532### Assembly forms
533
534MLIR decides to support both generic and custom assembly forms under the
535following considerations:
536
537MLIR is an open system; it is designed to support modular and pluggable
538dialects. Depending on whether there exists a corresponding dialect and whether
539the dialect is plugged in, operations may or may not be registered into MLIR
540system. Yet we still need a way to investigate these operations. So the generic
541assembly form is mandated by this aspect of MLIR system. It provides a default
542textual form for operations.
543
544On the other hand, an assembly form is for assisting developers to investigate
545the IR. The generic form serves as a safe fallback but it can be too verbose for
546certain ops. Therefore, MLIR gives each dialect the choice to define a custom
547assembly form for each operation according to the operation's semantics and
548specific needs. The custom assembly form can de-duplicate information from the
549operation to derive a more concise form, thus better facilitating the
550comprehension of the IR.
551
552## Examples
553
554This section describes a few very simple examples that help understand how MLIR
555represents computation.
556
557### Non-affine control flow
558
559```mlir
560// A simple linear search in every row of a matrix
561for (i = 0; i < N; i++) {
562  for (j = 0; j < N; j++) {
563    // dynamic control flow
564    if (a[i][j] == key) {
565      s[i] = j;
566      break;
567    }
568  }
569}
570```
571
572The presence of dynamic control flow leads to an inner non-affine function
573nested in an outer function that uses affine loops.
574
575```mlir
576func @search(%A: memref<?x?xi32>, %S: <?xi32>, %key : i32) {
577  %ni = dim %A, 0 : memref<?x?xi32>
578  // This loop can be parallelized
579  affine.for %i = 0 to %ni {
580    call @search_body (%A, %S, %key, %i) : (memref<?x?xi32>, memref<?xi32>, i32, i32)
581  }
582  return
583}
584
585func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32, %i : i32) {
586  %nj = dim %A, 1 : memref<?x?xi32>
587  br ^bb1(0)
588
589^bb1(%j: i32)
590  %p1 = cmpi "lt", %j, %nj : i32
591  cond_br %p1, ^bb2, ^bb5
592
593^bb2:
594  %v = affine.load %A[%i, %j] : memref<?x?xi32>
595  %p2 = cmpi "eq", %v, %key : i32
596  cond_br %p2, ^bb3(%j), ^bb4
597
598^bb3(%j: i32)
599  affine.store %j, %S[%i] : memref<?xi32>
600  br ^bb5
601
602^bb4:
603  %jinc = addi %j, 1 : i32
604  br ^bb1(%jinc)
605
606^bb5:
607  return
608}
609```
610
611As per the [MLIR spec](../LangRef.md), the restrictions on dimensions and symbol
612identifiers to be used with the affine.apply operation only apply to accesses
613inside `affine.for` and `affine.if` operations. However, an analysis of accesses
614inside the called function (`@search_body`) is necessary to determine if the
615`%i` loop could be parallelized: such function access analysis is calling
616context sensitive.
617
618### Non-affine loop bounds
619
620Loop bounds that are not affine lead to a nesting of functions as shown below.
621
622```c
623for (i = 0; i < N; i++)
624  for (j = 0; j < N; j++)
625    // Non-affine loop bound for k loop.
626    for (k = 0; k < pow(2, j); k++)
627       for (l = 0; l < N; l++) {
628        // block loop body
629        ...
630       }
631```
632
633```mlir
634func @outer_nest(%n : index) {
635  affine.for %i = 0 to %n {
636    affine.for %j = 0 to %n {
637      %pow = call @pow(2, %j) : (index, index) ->  index
638      call @inner_nest(%pow, %n) : ...
639    }
640  }
641  return
642}
643
644func @inner_nest(%m : index, %n : index) {
645  affine.for %k = 0 to %m {
646    affine.for %l = 0 to %n {
647      ...
648    }
649  }
650  return
651}
652```
653
654### Reference 2D Convolution
655
656The following example illustrates a reference implementation of a 2D
657convolution, which uses an integer set `#domain` to represent valid input data
658in a dilated convolution.
659
660```mlir
661// Dilation factors S0 and S1 can be constant folded if constant at compile time.
662#domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0,
663                                   S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0)
664// Identity map (shown here for illustration).
665#map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6)
666
667// Affine map from output to input coordinate space.
668// d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w
669// S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation
670// S4 = h_pad_low, S5 = w_pad_low
671//     %out0 =  %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low
672//     %out1=  %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low
673#map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4)
674#map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5)
675
676// Semi-affine map to undilated input coordinate space.
677// d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation.
678#map2_0 = (d0, d1) [S0, S1] -> (d0 / S0)
679#map2_1 = (d0, d1) [S0, S1] -> (d1 / S1)
680
681// Conv2D shapes:
682// input:   [batch, input_height, input_width, input_feature]
683// kernel: [kernel_height, kernel_width, input_feature, output_feature]
684// output: [batch, output_height, output_width, output_feature]
685func @conv2d(%input: memref<16x1024x1024x3xf32, #lm0, /*scratchpad=*/1>,
686             %kernel: memref<5x5x3x32xf32, #lm0, /*scratchpad=*/1>,
687             %output: memref<16x512x512x32xf32, #lm0, /*scratchpad=*/1>) {
688  affine.for %b = 0 to %batch {
689    affine.for %oh = 0 to %output_height {
690      affine.for %ow = 0 to %output_width {
691        affine.for %of = 0 to %output_feature {
692          affine.for %kh = 0 to %kernel_height {
693            affine.for %kw = 0 to %kernel_width {
694              affine.for %if = 0 to %input_feature {
695                // Calculate input indices.
696                %1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5)
697                  [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
698                   %h_pad_low, %w_pad_low]
699                %1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5)
700                  [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
701                   %h_pad_low, %w_pad_low]
702
703                // Check if access is not in padding.
704                affine.if #domain(%1_0, %1_1)
705                                       [%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] {
706                  %2_0 = affine.apply #map2 (%1_0, %1_1)
707                  %2_1 = affine.apply #map2 (%1_0, %1_1)
708                  // Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices]
709                  call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1)
710                }
711              }
712            }
713          }
714        }
715      }
716    }
717  }
718  return
719}
720```
721
722TODO: (Add more examples showing the IR for a variety of interesting cases)
723
724## Design alternatives and extensions
725
726This is a list of some design alternatives and extensions that we discussed in
727detail but did not include in the spec or postponed them for future
728consideration on demand. We will revisit these discussions when we have more
729implementation experience and learn more about the challenges and limitations of
730our current design in practice.
731
732### Polyhedral code representation alternatives: schedule lists vs schedules trees vs affine loop/if forms
733
734The current MLIR uses a representation of polyhedral schedules using a tree of
735if/for loops. We extensively debated the tradeoffs involved in the typical
736unordered polyhedral instruction representation (where each instruction has
737multidimensional schedule information), discussed the benefits of schedule tree
738forms, and eventually decided to go with a syntactic tree of affine if/else
739conditionals and affine for loops. Discussion of the tradeoff was captured in
740this document:
741[ MLIR: The case for a simplified polyhedral form](RationaleSimplifiedPolyhedralForm.md).
742
743At a high level, we have two alternatives here:
744
7451.  Schedule tree representation instead of an affine loop AST form: The current
746    proposal uses an affine loop and conditional tree form, which is syntactic
747    and with no separation of domains as sets and schedules as multidimensional
748    affine functions. A schedule tree form however makes polyhedral domains and
749    schedules a first class concept in the IR allowing compact expression of
750    transformations through the schedule tree without changing the domains of
751    instructions. Such a representation also hides prologues, epilogues, partial
752    tiles, complex loop bounds and conditionals making loop nests free of
753    "syntax". Cost models instead look at domains and schedules. In addition, if
754    necessary such a domain schedule representation can be normalized to
755    explicitly propagate the schedule into domains and model all the cleanup
756    code. An example and more detail on the schedule tree form is in the next
757    section.
7581.  Having two different forms of "affine regions": an affine loop tree form
759    and a polyhedral schedule tree form. In the latter, ops could carry
760    attributes capturing domain, scheduling, and other polyhedral code
761    generation options with IntegerSet, AffineMap, and other attributes.
762
763#### Schedule Tree Representation for Affine Regions
764
765This representation is based on a simplified form of the domain/schedule
766representation used by the polyhedral compiler community. Domains represent what
767has to be executed while schedules represent the order in which domain elements
768are interleaved. We model domains as non-piece-wise convex integer sets, and
769schedules as affine functions; however, the former can be disjunctive, and the
770latter can be piece-wise affine relations. In the schedule tree representation,
771domain and schedules for instructions are represented in a tree-like structure
772which is called a schedule tree. Each non-leaf node of the tree is an abstract
773polyhedral dimension corresponding to an abstract fused loop for each ML
774instruction that appears in that branch. Each leaf node is an ML Instruction.
775
776```mlir
777// A tiled matmul code (128x128x128) represented in schedule tree form
778
779// #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5)
780#intset_ij = (i, j) [M, N, K]  : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0
781#intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0,
782                                     -j +  M-1 >= 0, k >= 0, -k + N - 1 >= 0)
783func @matmul(%A, %B, %C, %M, %N, %K) : (...)  { // %M, N, K are symbols
784  // t1, t2, t3, t4, t5, t6  are abstract polyhedral loops
785  mldim %t1 : {S1,S2,S3,S4,S5}  floordiv (i, 128) {
786    mldim %t2 : {S1,S2,S3,S4,S5}  floordiv (j, 128) {
787      // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
788      call dma_mem_to_scratchpad(%C, %i, %j, %M, %N, %K)
789          with @intset_ij(%i, %j) [%M, %N, %K]
790      mldim %t3 :   {S2,S3,S4,S5} floordiv (k, 128) {
791        // (%i, %j, %k) = affine.apply (d0, d1, d2)
792        //                          -> (128*d0, 128*d1, 128*d2)  (%t1, %t2, %t3)
793        call dma_mem_to_scratchpad(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
794        // (%i, %j, %k) = affine.apply (d0, d1, d2)
795        //                          -> (128*d0, 128*d1, 128*d2)  (%t1, %t2, %t3)
796        call dma_mem_to_scratchpad(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
797        mldim %t4 : {S4} i mod 128 {
798          mldim %t5 : {S4} j mod 128 {
799            mldim %t6 : {S4} k mod 128 {
800              // (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6)
801              call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K)
802                  with #inset_ijk(%i, %j, %k) [%M, %N, %K]
803            } // end mld4im t6
804          } // end mldim t5
805        } // end mldim t4
806      } // end mldim t3
807      // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
808      call $dma_scratchpad_to_mem_C ... with #intset(%i, %j) [%M, %N, %K]
809    }  // end mldim t2
810  } // end mldim t1
811  return
812}
813
814```
815
816### Affine Relations
817
818The current MLIR spec includes affine maps and integer sets, but not
819affine relations. Affine relations are a natural way to model read and
820write access information, which can be very useful to capture the
821behavior of external library calls where no implementation is
822available, high-performance vendor libraries, or user-provided /
823user-tuned routines.
824
825An affine relation is a relation between input and output dimension identifiers
826while being symbolic on a list of symbolic identifiers and with affine
827constraints on the identifiers.
828
829Syntax:
830
831```
832// Affine relation definition at the top of file
833affine-rel-def ::= affine-rel-id `=` affine-relation-inline
834
835affine-rel-id ::= `##` prefixed-id
836
837affine-relation-inline ::=
838       `(` input-dims `)` (`[` symbols `]`)? `->`
839       `(` output-dims `)` :  affine-constraint-conjunction
840
841input-dims ::= bare-id-list
842output-dims ::= bare-id-list
843symbols ::= bare-id-list
844
845affine-rel ::= affine-rel-id | affine-relation-inline
846
847// Usage
848affine-rel-spec ::= affine-rel dim-and-symbol-use-list
849```
850
851All identifiers appearing in input-dims, output-dims, and symbol-dims are
852pairwise distinct. All affine-constraint non-terminals in the above syntax are
853allowed to contain identifiers only from input-dims, output-dims, and
854symbol-dims.
855
856Affine relations are used to model read, write, may_read, and may_write sets of
857functions in the IR. The output dimension identifiers correspond to the data
858dimensions.
859
860Example:
861
862```mlir
863// read relation: two elements ( d0 <= r0 <= d0+1 )
864##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0
865
866func @count (%A : memref<128xf32>, %pos : i32) -> f32
867  reads: {%A ##aff_rel9 (%pos)}
868  writes: /* empty */
869  may_reads: /* empty */
870  may_writes: /* empty */ {
871bb0 (%0, %1: memref<128xf32>, i64):
872  %val = affine.load %A [%pos]
873  %val = affine.load %A [%pos + 1]
874  %p = mulf %val, %val : f32
875  return %p : f32
876}
877```
878
879### Regions
880
881#### Making function definition an operation
882
883MLIR supports values of a Function type. Instead of having first-class IR
884concept for functions, one could define an operation with a body region that
885defines a function value. The particularity of functions is that their names are
886globally visible and can be referred to before being defined, unlike SSA values
887that must be defined first. Implementing a "function definition" operation would
888require to relax some of the SSA constraints in a region, and also make the IR
889Module a region as well. It would also affect the core infrastructure (e.g.,
890function passes) only for the sake of concept unification.
891
892#### Having types on a region
893
894Instead of inspecting the types of arguments of the first block, one could give
895the region itself a type. This type would be redundant with block argument
896types, which must have values and create room for type mismatches. While
897functions do have types that are partly redundant with the arguments of the
898first block in the function, this is necessary to support function declarations
899that do not have a body which we can refer to in order to obtain the argument
900types. A region is always contained in an operation or a function that can be
901queried to obtain the “type” of the region if necessary.
902
903A type on a region can be justified if Regions were to be considered separately
904from the enclosing entity (operation or function) and had their own semantics
905that should be checked.
906
907#### Attaching attributes to regions
908
909Regions could be annotated with dialect attributes to use attribute verification
910hooks. An operation could take multiple regions as arguments, and each of them
911may require different attributes. However, there are currently very few
912practical cases where this would be necessary. Instead, one could simulate
913per-region attributes with array attributes attached to the entity containing
914the region (operation or function). This decreases the overall complexity of the
915IR and enables more concise and op-specific forms, e.g., when all regions of an
916op have the same attribute that can be only mentioned once. Since the semantics
917of the region is entirely defined by the enclosing entity, it also makes sense
918to have attributes attached to that entity rather than to the region itself.
919
920This can be reconsidered in the future if we see a non-neglectable amount of use
921cases.
922
923### Read/Write/May_Read/May_Write sets for External Functions
924
925Having read, write, may_read, and may_write sets for external functions which
926include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL,
927FFT libraries, user-provided/optimized functions, or data movement runtimes such
928as DMA ones is a powerful feature. It allows the compiler to perform analysis,
929composition/transformation in the presence of such calls and with loops around
930such calls on sub-tensors. For user-provided or custom hand-tuned functions, the
931read/write/may_read/may_write sets could be provided a-priori by a user as part
932of the external function signature or they could be part of a database.
933
934TODO: Design this, and update to use function attribute syntax.
935
936Example:
937
938```mlir
939##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1
940
941func @cblas_reduce_ffi(%M: memref<1024 x ? x f32, #layout_map0, /*mem=*/0>)
942  -> f32 [
943  reads: {%M, ##rel9() }
944  writes: /* empty */
945  may_reads: /* empty */
946  may_writes: /* empty */
947]
948
949func @dma_mem_to_scratchpad(%a : memref<1024 x f32, #layout_map0, /*mem=*/0>,
950    %b : memref<1024 x f32, #layout_map0, 1>, %c : memref<1024 x f32,
951    #layout_map0>) [
952  reads: {%M, ##rel9() }
953  writes: /* empty */
954  may_reads: /* empty */
955  may_writes: /* empty */
956 ]
957
958```
959
960### Memref Extensions
961
9621.  Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor
963    dimensions where there is symmetry: use integer set (affine constraints) to
964    model tensor data space (instead of just extents). Requires some changes to
965    the IR and the in-memory form.
9661.  Layout maps
967
968    1.  Allow piece-wise affine maps for layouts: allows clean modeling of
969        boundary cases for images/tensors through padding, wrapping, mirroring,
970        padding where padded values are the results of computation as opposed to
971        data, padding in the interior as opposed to just boundaries.
972    1.  Allow many-to-one layout maps: Index and layout maps in the current
973        proposal are bijective. Extending them to many-to-one layout maps allows
974        cleaner(?) modeling of broadcast/reduce style computations while reusing
975        memory.
976
977    Proposal 2(a) requires non-trivial changes to the IR and the in-memory
978    representation. 2(b) requires no change, but impacts how cost models look at
979    index and layout maps.
980
981### `affine.if` and `affine.for` Extensions for "Escaping Scalars"
982
983We considered providing a representation for SSA values that are live out of
984`if/else` conditional bodies and loop carried in `affine.for` loops. We
985ultimately abandoned this approach due to its complexity. In the current design
986of MLIR, scalar variables cannot escape for loops or if instructions. In
987situations, where escaping is necessary, we use zero-dimensional tensors and
988memrefs instead of scalars.
989
990**TODO**: This whole section is obsolete and should be updated to use block
991arguments and a yield like terminator in for/if instructions.
992
993The abandoned design of supporting escaping scalars is as follows:
994
995#### affine.for Instruction
996
997Syntax:
998
999```
1000[<out-var-list> =]
1001for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step>
1002   [with <in-var-list>] { <loop-instruction-list> }
1003```
1004
1005out-var-list is a comma separated list of SSA values defined in the loop body
1006and used outside the loop body. in-var-list is a comma separated list of SSA
1007values used inside the loop body and their initializers. loop-instruction-list
1008is a list of instructions that may also include a yield instruction.
1009
1010Example:
1011
1012```mlir
1013// Return sum of elements in 1-dimensional mref A
1014func i32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) {
1015   %init = 0
1016   %result = affine.for %i = 0 to N with %tmp(%init) {
1017      %value = affine.load %A[%i]
1018      %sum = %value + %tmp
1019      yield %sum
1020   }
1021   return %result : i32
1022}
1023```
1024
1025#### affine.if/else Instruction
1026
1027Syntax:
1028
1029```
1030<out-var-list> = affine.if (<cond-list>) {...} [else {...}]
1031```
1032
1033Out-var-list is a list of SSA values defined by the if-instruction. The values
1034are arguments to the yield-instruction that occurs in both then and else clauses
1035when else clause is present. When if instruction contains only if clause, the
1036escaping value defined in the then clause should be merged with the value the
1037variable had before the if instruction. The design captured here does not handle
1038this situation.
1039
1040Example:
1041
1042```mlir
1043// Compute sum of half of the array
1044func i32 @sum_half(%A : memref<?xi32>, %N : i32) -> (i32) {
1045   %s0 = 0
1046   %s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) {
1047       %s3 = if (%i >= %N / 2) {
1048          %v0 = affine.load %A[%i]
1049          %s4 = %s2 + %v0
1050          yield %s4
1051       }
1052       yield %s3
1053   }
1054   return %s1 : i32
1055}
1056```
1057
1058### Multithreading the compiler
1059
1060People want compilers to go fast, and one simple way to do that is to
1061multi-thread them. There are multiple strategies for this, but a simple one is
1062to optimize and compile separate functions in parallel. LLVM's original pass
1063manager anticipated this demand, and the CallGraphSCCPass manager is even
1064designed to support this as well, but unfortunately, a few early design
1065decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO
1066are forced to split programs into separate LLVM modules/context and optimize
1067those chunks independently.
1068
1069The problem is that LLVM has several objects in its IR that are globally uniqued
1070and also mutable: notably constants like `i32 0`. In LLVM, these constants are
1071`Value`'s, which allow them to be used as operands to instructions, and that
1072they also have SSA use lists. Because these things are uniqued, every `i32 0` in
1073any function shares a use list. This means that optimizing multiple functions in
1074parallel won't work (at least without some sort of synchronization on the use
1075lists, which would be unbearably inefficient).
1076
1077MLIR now supports a multithreaded pass manager. We do this through several
1078design choices:
1079
10801.  MLIR makes use of extensive uniqued immutable data structures (affine
1081    expressions, types, etc are all immutable, uniqued, and immortal).
10822.  Constants are defined in per-function pools, instead of being globally
1083    uniqued.
10843.  Functions themselves are not SSA values either, so they don't have the same
1085    problem as constants.
10864.  FunctionPasses are copied (through their copy ctor) into one instance per
1087    thread, avoiding sharing of local state across threads.
1088
1089This allows MLIR function passes to support efficient multithreaded compilation
1090and code generation.
1091