1# Linalg Dialect Rationale: The Case For Compiler-Friendly Custom Operations
2
3[TOC]
4
5## Introduction<a name="introduction"></a>
6
7### Positioning
8
9<img width="180" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
10
11This document describes the key design principles
12that led to the existing implementation of Linalg and aims at exposing
13the tradeoffs involved when building higher-level Intermediate
14Representations (IR) and Dialects to facilitate code
15generation. Consider the simplified schema describing codegen in MLIR.
16Linalg is designed to solve the High-level Hierarchical Optimization
17(HHO box) and to interoperate nicely within a
18*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
19This work is inspired by a wealth of [prior art](#prior-art) in
20the field, from which it seeks to learn key lessons. This documentation
21and introspection effort also comes in the context of the proposal for a
22working group for discussing the [Development of high-level Tensor Compute
23Primitives dialect(s) and
24transformations](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/3).
25We hope that the lessons from prior art, the design principles outlined in
26this doc and the architecture of Linalg can help inform the community on a
27path to defining these High-Level Tensor Compute Primitives.
28
29### Inception
30
31Linalg started as a pragmatic dialect to bootstrap code generation in MLIR, by
32*defining away* complex code generation problems like precise dependence
33analysis or polyhedral code generation and by introducing the ability to call
34into fast library implementations when available. Linalg **defines ops and
35transformations declaratively**  and was originally restricted to ops with
36*linear-algebra like* semantics (`pointwise`, `matmul`, `conv`...). This
37approach enables building a high-level productivity-first codegen solution that
38leverages *both* compiler optimizations *and* efficient library implementations
39so as not to miss out on simple performance benefits. For example, if
40one's favorite HPC library or ISA has a `matmul` primitive running at 95% of
41the achievable peak performance, for operands stored in some memory, one should
42be able to **use the primitive** when possible *and* generate code otherwise.
43
44However, as the design of Linalg co-evolved with the design of MLIR, it became
45apparent that it could extend to larger application domains than just machine
46learning on dense tensors.
47
48The design and evolution of Linalg follow a *codegen-friendly* approach where
49the IR and the transformations evolve hand-in-hand.
50The key idea is that op semantics *declare* and transport information that is
51traditionally obtained by compiler analyses.
52This information captures the legality and applicability of transformations and
53is **not lost by lowering prematurely to loop or CFG form**. The key
54transformations are designed so as to **preserve this information** as long as
55necessary. For example, `linalg.matmul` remains `linalg.matmul` after tiling
56and fusion.
57
58Furthermore, Linalg decouples transformation validity from profitability
59considerations and voluntarily leaves the latter aside in the first iteration
60(see the [suitability for search](#suitability_for_search) guiding principle).
61
62The first incarnation of these ideas was presented as an example at the
63EuroLLVM 2019 developer's meeting as part of the
64[Linalg section](https://llvm.org/devmtg/2019-04/slides/Tutorial-AminiVasilacheZinenko-MLIR.pdf)
65of the first [MLIR Tutorial](https://www.youtube.com/watch?v=cyICUIZ56wQ).
66
67### Evolution
68Since the initial implementation, the design has evolved with, and partially
69driven the evolution of the core MLIR infrastructure to use
70[Regions](../LangRef.md/#regions),
71[OpInterfaces](../Interfaces.md),
72[ODS](../OpDefinitions.md) and
73[Declarative Rewrite Rules](../DeclarativeRewrites.md)
74among others. The approach adopted by Linalg was extended to become
75[StructuredOps abstractions](
76https://drive.google.com/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
77with Linalg becoming its incarnation on tensors and buffers.
78It is complemented by the
79[Vector dialect](../Dialects/Vector.md),
80which defines structured operations on vectors, following the same rationale and
81design principles as Linalg. (Vector dialect includes the higher-level
82operations on multi-dimensional vectors and abstracts away the lowering to
83single-dimensional vectors).
84
85The Linalg dialect itself grew beyond linear algebra-like operations to become
86more expressive, in particular by providing an abstraction of a loop nest
87supporting parallelism, reductions and sliding windows around arbitrary MLIR
88[regions](../LangRef.md/#regions). It also has the
89potential of growing beyond *dense* linear-algebra to support richer data
90types, such as sparse and ragged tensors and buffers.
91
92Linalg design remains open to evolution and cross-pollination with other
93dialects and approaches. It has been successfully used as the staging ground
94for code generation-related abstractions, spinning off the generalization of
95the following:
96- the `!linalg.view` type folded into the *"Strided MemRef"* type while
97preserving structure to allow calling into external C++ libraries with
98unsurprising ABI conventions;
99- the `linalg.view` and `linalg.subview` ops evolved into the standard dialect;
100- the `linalg.for`, `linalg.load` and `linalg.store` ops evolved into a prelude
101to the *structured control flow* dialect (named `LoopOps`).
102More components can be extracted, redesigned and generalized when new uses or
103requirements arise.
104
105Several [design questions](../Dialects/Linalg/_index.md/#open_issues) remain
106open in Linalg, which does not claim to be a general solution to all compilation
107problems. It does aim at driving thinking and implementations of domain-specific
108abstractions where programmer's intent can be captured at a very high level,
109directly in the IR.
110
111Given the evolution of the scope, it becomes apparent that a better name than
112"Linalg" could remove some of the confusions related to the dialect (and the
113underlying approach), its goals and limitations.
114
115## Prior Art
116Linalg draws inspiration from decades of prior art to design a modern a
117pragmatic solution. The following non-exhaustive list refers to some of the
118projects that influenced Linalg design:
119
120- [ONNX](https://onnx.ai/),
121- [LIFT](https://www.lift-project.org/),
122- [XLA](https://www.tensorflow.org/xla/architecture),
123- [Halide](https://halide-lang.org/) and [TVM](https://tvm.apache.org/),
124- [TACO](http://tensor-compiler.org/),
125- [Darkroom](http://darkroom-lang.org/) and [Terra](http://terralang.org/),
126- [Sigma-LL](http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/cgo16-preprint_248.pdf),
127- [Tensor Comprehensions](https://arxiv.org/abs/1802.04730),
128- [Polyhedral Compilers](https://en.wikipedia.org/wiki/Polytope_model),
129- the [Affine dialect](https://mlir.llvm.org/docs/Dialects/Affine/) in MLIR,
130- Generic Loop Transformations (see Ken Kennedy's
131[Optimizing Compilers for Modern Architectures](
132https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9))
133- Traditional compiler CFGs with SSA forms.
134
135Additionally, experience with the following tools proved very valuable when
136thinking holistically about how all these components interplay all the way
137up to the user and down to the hardware:
138
139- the [Torch](http://torch.ch/) machine-learning framework,
140- the LLVM compiler, specifically in JIT mode,
141- high-performance libraries (MKL, CUBLAS, FBFFT)
142- the [PeachPy](https://www.cs.utexas.edu/users/flame/BLISRetreat/BLISRetreatTalks/PeachPy.pdf) assembler
143- current and potentially upcoming hardware ISAs.
144
145The novelty of MLIR's code base and its unprecedented support for defining and
146mixing abstractions, enabling one to reflect on and integrate the key elements
147of the prior art success as well as avoid the common pitfalls in the area of
148code generation. Thus, instead of diverging into a discussion about the
149implications of adopting any of the existing solutions, Linalg had the
150possibility to build on all of them and learn from their experience while
151leveraging the benefit of hindsight.
152
153The following reflections on prior art have influenced the design of Linalg.
154The discussion is by no means exhaustive but should capture the key motivations
155behind Linalg.
156
157### Lessons from ONNX<a name="lessonsonnx"></a>
158ONNX is a specification of operations that appear in Machine Learning
159workloads. As such, it is predominantly driven by the expressiveness requirements
160of ML, and less by the considerations of IR design for HPC code generation.
161
162Similarly to ONNX, Linalg defines *"semantically charged" named ops*.
163But it also considers *transformations on these ops* as a key component and
164defines the IR to support the transformations, preferring transformations over
165expressiveness if necessary.
166
167Linalg hopes to additionally address the following:
168- facilitate frontend-compiler co-design by taking into account compiler
169  transformations and lowerings in op definition;
170- minimize the set of available ops by making them non-overlapping with each
171  other, thus simplifying the intermediate representation.
172
173### Lessons from LIFT<a name="lessonslift"></a>
174[LIFT](https://www.lift-project.org/) is a system to write computational
175kernels based on functional abstractions. Transformations are
176represented by additional nodes in the IR, whose semantics are at the
177level of the algorithm (e.g. `partialReduce`).
178LIFT applies and composes transformations by using [local rewrite
179rules](https://www.lift-project.org/presentations/2015/ICFP-2015.pdf) that
180embed these additional nodes directly in the functional abstraction.
181
182Similarly to LIFT, Linalg uses local rewrite rules implemented with the MLIR
183[Declarative Rewrite Rules](../DeclarativeRewrites.md)
184mechanisms.
185
186Linalg builds on, and helps separate concerns in the LIFT approach as follows:
187- transformations are either separated from the representation or expressed as
188  composable attributes that are independent of the actual computation,
189  avoiding intricate effects on performance;
190- abstractions are split into smaller components (e.g., control flow and data
191  structure abstractions) potentially reusable across different dialects in the
192  MLIR's open ecosystem.
193
194LIFT is expected to further influence the design of Linalg as it evolves. In
195particular, extending the data structure abstractions to support non-dense
196tensors can use the experience of LIFT abstractions for
197[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
198and [position-dependent
199arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
200
201### Lessons from XLA<a name="lessonsxla"></a>
202[XLA](https://www.tensorflow.org/xla/architecture) is one of the first
203post-Theano ML compilers that was introduced as a pragmatic compilation
204solution for TensorFlow. It shines on Google's xPU
205hardware and is an important piece of the puzzle. It is particularly good at
206(1) transforming code back and forth between the scalar and the vector
207worlds, (2) passing function boundaries for handling both host and device
208code, and (3) complying to stringent requirements imposed by energy-efficient
209xPUs.
210XLA followed a pragmatic design process where the compiler is given perfect
211knowledge of each op's semantic, all starting from the mighty `conv` and
212`matmul` ops. XLA transformations consist of writing emitters that compose, as C++
213functions. Perfect op semantics knowledge has 2 big benefits: (1) transformations are
214correct by construction (2) very strong performance on difficult xPU targets.
215
216Similarly, Linalg ops *"know their semantics"* and *"know how to transform and
217lower themselves"*. The means by which this information is made available and
218how it is used in MLIR are, however, very different.
219
220Linalg hopes to additionally address the following:
221- HLOs are expressive as a whole, but each op has very limited and fixed
222semantics: ops are not configurable. As a consequence, HLOs have evolved into
223a too large set of ops whose semantics intersect.
224This echoes the ops proliferation problem also exhibited by ONNX.
225- Reliance on perfect op knowledge leads to situations where transformations and
226ops end up needing to know about each other's semantics (e.g. during fusion).
227Since the transformations themselves are not simple local rewrite patterns
228(unlike LIFT), code complexity grows quickly.
229- XLA lacks an independent IR that can be inspected, unit tested and used
230independently. This monolithic design makes the system not portable: xPU passes
231and GPU passes do not share much code.
232
233### Lessons from Halide and TVM<a name="lessonshalide"></a>
234[Halide](https://halide-lang.org/) is a DSL embedded in C++ that provides a
235way of metaprogramming the HalideIR and applying transformations declaratively
236to let the expert user transform and optimize the program in tailored ways.
237Halide, initially targeted the SIGGRAPH community but is now more generally
238applicable. [TVM](https://tvm.apache.org/) is an evolution of Halide into the
239machine learning and deep-neural network space, based on HalideIR.
240
241The Halide transformation methodology follows similar principles to the
242[URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
243and
244[CHiLL](https://pdfs.semanticscholar.org/6a46/20589f63f3385707d2d590f7b7dc8ee4d74f.pdf)
245compiler transformation frameworks, but without the strengths (and especially
246complexity) of the polyhedral model.
247
248Halide particularly shines at making the HPC transformation methodology
249accessible to $\Omega$(10-100) users, at a time when polyhedral tools are
250still only accessible to $\Omega$(1-10) users. Halide makes heavy usage of
251canonicalization rules that are also very prevalent in MLIR.
252
253Linalg hopes to additionally address the following:
254- Halide scheduling is powerful and explores a large swath of possible
255transformations. But it's still too hard for newcomers to use or extend. The
256level of performance you get from Halide is very different depending on
257whether one is a seasoned veteran or a newcomer. This is especially true as
258the number of transformations grows.
259- Halide raises rather than lowers in two ways, going counter-current to the
260design goals we set for high-level codegen abstractions in MLIR. First,
261canonical Halide front-end code uses explicit indexing and math on scalar
262values, so to target BLAS/DNN libraries one needs to add pattern matching
263which is similarly brittle as in the affine case. While Halide's performance
264is on par with the libraries on programmable targets (CPU/GPU), that
265approach doesn't work on mobile accelerators or on xPUs, where the framework
266ingests whole-tensor operations.
267Second, reductions and scans are expressed using serial iteration, again
268requiring pattern matching before they can be transformed (e.g. to do a
269reduction using atomics, or hierarchically). The lesson to draw is that we
270should start with higher-level primitives than Halide.
271
272### Lessons from Tensor Comprehensions<a name="lessonstc"></a>
273[Tensor Comprehensions](https://arxiv.org/abs/1802.04730) is a
274high-level language to express tensor computations with a syntax
275generalizing the Einstein notation, coupled to an end-to-end
276compilation flow capable of lowering to efficient GPU code. It was
277integrated with 2 ML frameworks: Caffe2 and PyTorch.
278
279<img width="600" alt="MLIR Codegen Flow"
280src="https://user-images.githubusercontent.com/10148468/73613272-df904480-45c1-11ea-88f9-214dee7464cf.png">
281
282The compilation flow combines [Halide](#lessonshalide) and a Polyhedral Compiler
283derived from [ISL](https://en.wikipedia.org/wiki/Integer_set_library)
284and uses both HalideIR and the ISL *schedule-tree* IR.
285The compiler provides a collection of polyhedral compilation
286algorithms to perform fusion and favor multi-level parallelism and
287promotion to deeper levels of the memory hierarchy.
288Tensor Comprehensions showed that, fixing a few predefined strategies
289with parametric transformations and tuning knobs, can already provide
290great results. In that previous work, simple
291genetic search combined with an autotuning framework was sufficient
292to find good implementations in the ***non-compute bound regime***.
293This requires code versions obtainable by the
294various transformations to encompass versions that get close to the
295roofline limit.
296The ultimate goal of Tensor Comprehensions was to concretely mix
297Halide high-level transformations with polyhedral mid-level
298transformations and build a pragmatic system that could take advantage
299of both styles of compilation.
300
301Linalg hopes to additionally address the following:
302- Halide was never properly used in Tensor Comprehensions beyond shape
303inference. Most of the investment went into simplifying polyhedral
304transformations and building a usable end-to-end system. MLIR was
305deemed a better infrastructure to mix these types of compilation.
306- The early gains provided by reusing established infrastructures
307(HalideIR and ISL schedule trees) turned into more impedance mismatch
308problems than could be solved with a small tactical investment.
309- Tensor Comprehensions emitted CUDA code which was then JIT compiled
310with NVCC from a textual representation. While this was a pragmatic
311short-term solution it made it hard to perform low-level rewrites that
312would have helped with register reuse in the ***compute-bound regime***.
313- The same reliance on emitting CUDA code made it difficult to
314create cost models when time came. This made it artificially harder to
315prune out bad solutions than necessary. This resulted in excessive
316runtime evaluation, as reported in the paper [Machine Learning Systems
317are Stuck in a Rut](https://dl.acm.org/doi/10.1145/3317550.3321441).
318
319Many of those issues are naturally addressed by implementing these ideas
320in the MLIR infrastructure.
321
322### Lessons from Polyhedral compilers<a name="lessonspolyhedral"></a>
323The polyhedral model has been on the cutting edge of loop-level optimization for
324decades, with several incarnations in production compilers such as
325[GRAPHITE](https://gcc.gnu.org/wiki/Graphite) for GCC and
326[Polly](https://polly.llvm.org) for LLVM. Although it has proved crucial to
327generate efficient code from domain-specific languages such as
328[PolyMage](http://mcl.csa.iisc.ac.in/polymage.html) and [Tensor
329Comprehensions](https://dl.acm.org/doi/abs/10.1145/3355606), it has never been
330fully included into mainstream general-purpose optimization pipelines. Detailed
331analysis of the role of polyhedral transformations is provided in the
332[simplified polyhedral
333form](RationaleSimplifiedPolyhedralForm.md) document
334dating back to the inception of MLIR.
335
336In particular, polyhedral abstractions have proved challenging to integrate with
337a more conventional compiler due to the following.
338- The transformed code (or IR) quickly gets complex and thus hard to analyze and
339  understand.
340- Code generation from the mathematical form used in the polyhedral model relies
341  on non-trivial exponentially complex algorithms.
342- The mathematical form is rarely composable with the SSA representation and
343  related algorithms, on which most mainstream compilers are built today.
344- Expressiveness limitations, although addressed in the scientific literature
345  through, e.g., summary functions, often remain present in actual
346  implementations.
347
348The Affine dialect in MLIR was specifically designed to address the integration
349problems mention above. In particular, it maintains the IR in the same form
350(loops with additional constraints on how the bounds are expressed) throughout
351the transformation, decreasing the need for one-shot conversion between
352drastically different representations. It also embeds the polyhedral
353representation into the SSA form by using MLIR regions and thus allows one to
354combine polyhedral and SSA-based transformations.
355
356### Lessons from the Affine dialect<a name="lessonsaffine"></a>
357The Affine dialect in MLIR brings the polyhedral abstraction closer to the
358conventional SSA representation. It addresses several long-standing integration
359challenges as described above and is likely to be more suitable when compiling
360from a C language-level abstraction.
361
362MLIR makes it possible to start from a higher-level abstraction than C, for
363example in machine learning workloads. In such cases, it may be possible to
364avoid complex analyses (data-flow analysis across loop iterations is
365exponentially complex) required for polyhedral transformation by leveraging the
366information available at higher levels of abstractions, similarly to DSL
367compilers. Linalg intends to use this information when available and ensure
368*legality of transformations by construction*, by integrating legality
369preconditions in the op semantics (for example, loop tiling can be applied to
370the loop nest computing a matrix multiplication, no need to additionally rely on
371affine dependence analysis to check this). This information is not readily
372available in the Affine dialect, and can only be derived using potentially
373expensive pattern-matching algorithms.
374
375Informed by the practical experience in polyhedral compilation and with the
376Affine dialects in particular, Linalg takes the following decisions.
377- **Discourage loop skewing**: the loop skewing transformation, that is
378  sometimes used to enable parallelization, often has surprising (negative)
379  effects on performance. In particular, polyhedral auto-transformation can be
380  expressed in a simpler way without loop skewing; skewing often leads to
381  complex control flow hampering performance on accelerators such as GPUs.
382  Moreover, the problems loop skewing addresses can be better addressed by other
383  approaches, e.g., diamond tiling. In the more restricted case of ML workloads,
384  multi-for loops with induction variables independent of each other (referred
385  to as hyper-rectangular iteration domains in the literature) such as the
386  proposed
387  [affine.parallel]((https://llvm.discourse.group/t/rfc-add-affine-parallel/350)
388  are sufficient in the majority of cases.
389- **Declarative Tiling**: the *tiling* transformation is ubiquitous in HPC code
390  generation. It can be seen as a decomposition of either the iteration space or
391  the data space into smaller regular parts, referred to as tiles. Polyhedral
392  approaches, including the Affine dialect, mostly opt for iteration space
393  tiling, which introduces additional control flow and complex address
394  expressions. If the tile sizes are not known during the transformation (so
395  called parametric tiling), the address expressions and conditions quickly
396  become non-affine or require exponentially complex algorithms to reason about
397  them. Linalg focuses tiling on the data space instead, creating views into the
398  buffers that leverage MLIR's strided `memref` abstraction. These views compose
399  and the complexity of access expressions remains predictable.
400- **Preserve high-level information**: Linalg maintains the information provided
401  by the op semantics as long as necessary for transformations. For example, the
402  result of tiling a matrix multiplication is loops around a smaller matrix
403  multiplication. Even with pattern-matching on top of the Affine dialect, this
404  would have required another step of pattern-matching after the transformation.
405
406Given these choices, Linalg intends to be a better fit for **high-level
407compilation** were significantly more information is readily available in the
408input representation and should be leveraged before lowering to other
409abstractions. Affine remains a strong abstraction for mid-level transformation
410and is used as a lowering target for Linalg, enabling further transformations
411and combination of semantically-loaded and lower-level inputs. As such, Linalg
412is intended to complement Affine rather than replace it.
413
414## Core Guiding Principles<a name="guiding_principles"></a>
415
416### Transformations and Simplicity First<a name="transformations_first"></a>
417The purpose of the Linalg IR and its operations is primarily to:
418- develop a set of key transformations, and
419- make them correct by construction by carefully curating the set of
420generic operation properties that drive applicability, and
421- make them very simple to implement, apply, verify and especially
422maintain.
423
424The problem at hand is fundamentally driven by compilation of domain-specific
425workloads for high-performance and parallel hardware architectures: **this is
426an HPC compilation problem**.
427
428The selection of relevant transformations follows a co-design approach and
429involves considerations related to:
430- concrete current and future needs of the application domain,
431- concrete current and future hardware properties and ISAs,
432- understanding of strengths and limitations of [existing approaches](#prior-art),
433- taking advantage of the coexistence of multiple levels of IR in MLIR,
434
435One needs to be methodical to avoid proliferation and redundancy. A given
436transformation could exist at multiple levels of abstraction but **just
437because one can write transformation X at level Y absolutely does not mean
438one should**. This is where evaluation of existing
439systems and acknowledgement of their strengths and weaknesses is crucial:
440simplicity and maintainability aspects must be first-order concerns. Without
441this additional effort of introspection, a design will not stand the test of
442time. At the same time, complexity is very hard to ward off. It seems one needs
443to suffer complexity to be prompted to take a step back and rethink
444abstractions.
445
446This is not merely a reimplementation of idea X in system Y: simplicity
447**must be the outcome** of this introspection effort.
448
449### Preservation of Information<a name="information_preservation"></a>
450The last two decades have seen a proliferation of Domain-Specific Languages
451(DSLs) that have been very successful at limited application domains.
452The main commonality between these systems is their use of a significantly
453richer structural information than CFGs or loops.
454Still, another commonality of existing systems is to lower to LLVM very quickly,
455and cross a wide abstraction gap in a single step. This process often drops
456semantic information that later needs to be reconstructed later,
457when it is not irremediably lost.
458
459These remarks, coupled with MLIR's suitability for defining IR at multiple
460levels of abstraction led to the following 2 principles.
461
462#### Declarative Specification: Avoid Raising<a name="declarative_specification"></a>
463
464Compiler transformations need static structural information (e.g. loop-nests,
465graphs of basic blocks, pure functions, etc). When that structural information
466is lost, it needs to be reconstructed.
467
468A good illustration of this phenomenon is the notion of *raising* in polyhedral
469compilers: multiple polyhedral tools start by raising from a simplified C
470form or from SSA IR into a higher-level representation that is more amenable
471to loop transformations.
472
473In advanced polyhedral compilers, a second type of raising
474may typically exist to detect particular patterns (often variations of
475BLAS). Such patterns may be broken by transformations making their detection
476very fragile or even just impossible (incorrect).
477
478MLIR makes it easy to define op semantics declaratively thanks to the use of
479regions and attributes. This is an ideal opportunity to define new abstractions
480to convey user-intent directly into the proper abstraction.
481
482#### Progressive Lowering: Don't Lose Information too Quickly<a name="#progressive_lowering"></a>
483
484Lowering too quickly to affine, generic loops or CFG form reduces the
485amount of structure available to derive transformations from. While
486manipulating loops is a net gain compared to CFG form for a certain class of
487transformations, important information is still lost (e.g. parallel loops, or
488mapping of a loop nest to an external implementation).
489
490This creates non-trivial phase ordering issues. For instance, loop fusion may
491easily destroy the ability to detect a BLAS pattern. One possible alternative
492is to perform loop fusion, tiling, intra-tile loop distribution and then hope to
493detect the BLAS pattern. Such a scheme presents difficult phase-ordering
494constraints that will likely interfere with other decisions and passes.
495Instead, certain Linalg ops are designed to maintain high-level information
496across transformations such as tiling and fusion.
497
498MLIR is designed as an infrastructure for ***progressive lowering***.
499Linalg fully embraces this notion and thinks of codegen in terms of
500*reducing a potential function*. That potential function is loosely
501defined in terms of number of low-level instructions in a particular
502Linalg ops (i.e. how heavy or lightweight the Linalg op is).
503Linalg-based codegen and transformations start from higher-level IR
504ops and dialects. Then each transformation application reduces the
505potential by introducing lower-level IR ops and *smaller* Linalg ops.
506This gradually reduces the potential, all the way to Loops + VectorOps
507and LLVMIR.
508
509### Composable and Declarative Transformations<a name="declarative_transformations"></a>
510Complex and impactful transformations need not be hard to manipulate, write or
511maintain. Mixing XLA-style high-level op semantics knowledge with generic
512properties to describe these semantics, directly in MLIR, is a promising way to:
513- Design transformations that are correct by construction, easy to
514write, easy to verify and easy to maintain.
515- Provide a way to specify transformations and the units of IR they manipulate
516declaratively. In turn this allows using local pattern rewrite rules in MLIR
517(i.e. [DRR](../DeclarativeRewrites.md)).
518- Allow creating customizable passes declaratively by simply selecting rewrite
519rules. This allows mixing transformations, canonicalizations, constant folding
520and other enabling rewrites in a single pass. The result is a system where pass
521fusion is very simple to obtain and gives hope for solving certain
522[phase ordering issues](https://dl.acm.org/doi/10.1145/201059.201061).
523
524### Suitability for Search and Machine Learning<a name="ml"></a>
525Compiler heuristics are hand-crafted human-engineered features: it is
526ripe for disruption by machine-learning  techniques.
527To enable search, compiler transformations should be fine-grained,
528[composable](#declarative_transformations) and expose tuning parameters that
529can modify their behavior, guided by lessons from previous experience
530with [Tensor Comprehensions](#lessonstc).
531
532Of course, we are not advocating for using ML everywhere in the stack
533immediately: low-level compilation and machine models are still quite performant
534in LLVM. However, for the high-level and mid-level optimization problems,
535models need to be conditioned (probabilistically) on the low-level
536compiler which acts as a blackbox. For these reasons we prioritize the
537design of IR and transformations with search-friendly properties over
538building cost models.
539Still, this  does not mean Linalg refuses cost models: instead we
540prefer to invest in infrastructure that will enable [ML-based
541techniques to automatically build cost
542models](http://homepages.inf.ed.ac.uk/hleather/publications/2009_autofeatures_cgo.pdf).
543
544### Extensibility and Future-Proofness<a name="future"></a>
545MLIR allows defining IR for structured control flow and structured
546data types. We choose to take advantage of these properties for the
547reasons described above.
548In particular, the `MemRefType` represents dense non-contiguous memory regions.
549This structure should extend beyond simple dense data types and generalize to
550ragged, sparse and mixed dens/sparse tensors as well as to trees, hash tables,
551tables of records and maybe even graphs.
552
553For such more advanced data types, the control-flow required to traverse the
554data structures, termination conditions, etc are much less simple to analyze and
555characterize statically. As a consequence we need to also design solutions that
556stand a chance of evolving into runtime-adaptive computations (e.g.
557inspector-executor in which an *inspector* runs a cheap runtime
558analysis on the data to configure how the *executor* should run).
559While there is no concrete solution
560today to solve these problems in MLIR, it is pretty clear that perfect
561static knowledge and analyses will not be serious contenders for these problems.
562
563## Key Observations<a name="keyobservation"></a>
564The following key observations have influenced the design of Linalg and helped
565reconcile [core guiding principles](#guiding_principles) with real-world
566requirements when producing an implementation based on MLIR.
567
568### Algorithms + Data Structures = Programs<a name="data_and_compute"></a>
569
570This is a twist on Niklaus Wirth's formulation but captures the essence of the
571design of Linalg: control-flow does not exist in a vacuum, independently of
572data. On the contrary, there is a very strong relationship between control-flow
573and data structures: one cannot exist without the other. This has multiple
574implications on the
575[semantics of Linalg Ops](../Dialects/Linalg/_index.md/#linalg_ops) and their
576transformations. In particular, this observation influences whether certain
577transformations are better done: - as control flow or data structure
578manipulation, - on Linalg ops attributes or on loops after some partial lowering
579occurred, - as extensions to the Linalg dialect in terms of new ops or
580attributes.
581
582### The Dialect Need not be Closed Under Transformations<a name="dialect_not_closed"></a>
583This is probably the most surprising and counter-intuitive
584observation. When one designs IR for transformations, closed-ness is
585often a non-negotiable property.
586This is a key design principle of polyhedral IRs such as
587[URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
588and
589[ISL-based IRs](https://en.wikipedia.org/wiki/Integer_set_library):
590they are closed under affine transformations.
591In MLIR, multiple dialects coexist and form a coherent whole. After
592experimenting with different alternatives, it became clear that strict
593dialect closed-ness wasn't necessary and could be relaxed. Previous
594systems did not have simple and principled means of building new IR
595and probably suffered from this limitation. We conjecture this is a
596key reason they required the IR to be closed under transformations.
597
598Despite the fact that Linalg ops only allow perfectly nested
599semantics, once tiling and fusion kick in, imperfectly nested loops
600are gradually introduced.
601In other words, imperfectly nested control flow appears as ***the result of
602applying key transformations***.
603
604Considering the *potential* described during the discussion on
605[Progressive Lowering](#progressive_lowering), closed-ness under
606transformation would dictate that the potential remains constant.
607In contrast, Linalg advocates for ***monotonicity*** under
608transformations.
609
610### Summary of Existing Alternatives a Picture<a name="observationssummary"></a>
611Lastly, we summarize our observations of lessons from [Prior
612Art](#prior-art)---when viewed under the lense of our [Core Guiding
613Principles](#guiding_principles)---with the following picture.
614
615<img width="1200" alt="MLIR Codegen Flow"
616src="https://user-images.githubusercontent.com/10148468/73613904-2f720a00-45c8-11ea-8265-1c856c02525b.png">
617
618This figure is not meant to be perfectly accurate but a rough map of how we view
619the distribution of structural information in existing systems, from a
620codegen-friendly angle. Unsurprisingly, the
621[Linalg Dialect](../Dialects/Linalg/_index.md) and its future evolutions aspire
622to a position in the top-right of this map.
623