1# Linalg Dialect Rationale: The Case For Compiler-Friendly Custom Operations 2 3[TOC] 4 5## Introduction<a name="introduction"></a> 6 7### Positioning 8 9<img width="180" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png"> 10 11This document describes the key design principles 12that led to the existing implementation of Linalg and aims at exposing 13the tradeoffs involved when building higher-level Intermediate 14Representations (IR) and Dialects to facilitate code 15generation. Consider the simplified schema describing codegen in MLIR. 16Linalg is designed to solve the High-level Hierarchical Optimization 17(HHO box) and to interoperate nicely within a 18*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box). 19This work is inspired by a wealth of [prior art](#prior_art) in 20the field, from which it seeks to learn key lessons. This documentation 21and introspection effort also comes in the context of the proposal for a 22working group for discussing the [Development of high-level Tensor Compute 23Primitives dialect(s) and 24transformations](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/3). 25We hope that the lessons from prior art, the design principles outlined in 26this doc and the architecture of Linalg can help inform the community on a 27path to defining these High-Level Tensor Compute Primitives. 28 29### Inception 30 31Linalg started as a pragmatic dialect to bootstrap code generation in MLIR, by 32*defining away* complex code generation problems like precise dependence 33analysis or polyhedral code generation and by introducing the ability to call 34into fast library implementations when available. Linalg **defines ops and 35transformations declaratively** and was originally restricted to ops with 36*linear-algebra like* semantics (`pointwise`, `matmul`, `conv`...). This 37approach enables building a high-level productivity-first codegen solution that 38leverages *both* compiler optimizations *and* efficient library implementations 39so as not to miss out on simple performance benefits. For example, if 40one's favorite HPC library or ISA has a `matmul` primitive running at 95% of 41the achievable peak performance, for operands stored in some memory, one should 42be able to **use the primitive** when possible *and* generate code otherwise. 43 44However, as the design of Linalg co-evolved with the design of MLIR, it became 45apparent that it could extend to larger application domains than just machine 46learning on dense tensors. 47 48The design and evolution of Linalg follow a *codegen-friendly* approach where 49the IR and the transformations evolve hand-in-hand. 50The key idea is that op semantics *declare* and transport information that is 51traditionally obtained by compiler analyses. 52This information captures the legality and applicability of transformations and 53is **not lost by lowering prematurely to loop or CFG form**. The key 54transformations are designed so as to **preserve this information** as long as 55necessary. For example, `linalg.matmul` remains `linalg.matmul` after tiling 56and fusion. 57 58Furthermore, Linalg decouples transformation validity from profitability 59considerations and voluntarily leaves the latter aside in the first iteration 60(see the [suitability for search](#suitability_for_search) guiding principle). 61 62The first incarnation of these ideas was presented as an example at the 63EuroLLVM 2019 developer's meeting as part of the 64[Linalg section](https://llvm.org/devmtg/2019-04/slides/Tutorial-AminiVasilacheZinenko-MLIR.pdf) 65of the first [MLIR Tutorial](https://www.youtube.com/watch?v=cyICUIZ56wQ). 66 67### Evolution 68Since the initial implementation, the design has evolved with, and partially 69driven the evolution of the core MLIR infrastructure to use 70[Regions](https://mlir.llvm.org/docs/LangRef/#regions), 71[OpInterfaces](https://mlir.llvm.org/docs/Interfaces/), 72[ODS](https://mlir.llvm.org/docs/OpDefinitions/) and 73[Declarative Rewrite Rules](https://mlir.llvm.org/docs/DeclarativeRewrites/) 74among others. The approach adopted by Linalg was extended to become 75[StructuredOps abstractions]( 76https://drive.google.com/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-), 77with Linalg becoming its incarnation on tensors and buffers. 78It is complemented by the 79[Vector dialect](https://mlir.llvm.org/docs/Dialects/Vector/), 80which defines structured operations on vectors, following the same rationale and 81design principles as Linalg. (Vector dialect includes the higher-level 82operations on multi-dimensional vectors and abstracts away the lowering to 83single-dimensional vectors). 84 85The Linalg dialect itself grew beyond linear algebra-like operations to become 86more expressive, in particular by providing an abstraction of a loop nest 87supporting parallelism, reductions and sliding windows around arbitrary MLIR 88[regions](https://mlir.llvm.org/docs/LangRef/#regions). It also has the 89potential of growing beyond *dense* linear-algebra to support richer data 90types, such as sparse and ragged tensors and buffers. 91 92Linalg design remains open to evolution and cross-pollination with other 93dialects and approaches. It has been successfully used as the staging ground 94for code generation-related abstractions, spinning off the generalization of 95the following: 96- the `!linalg.view` type folded into the *"Strided MemRef"* type while 97preserving structure to allow calling into external C++ libraries with 98unsurprising ABI conventions; 99- the `linalg.view` and `linalg.subview` ops evolved into the standard dialect; 100- the `linalg.for`, `linalg.load` and `linalg.store` ops evolved into a prelude 101to the *structured control flow* dialect (named `LoopOps`). 102More components can be extracted, redesigned and generalized when new uses or 103requirements arise. 104 105Several [design questions](#open_issues) remain open in Linalg, which does not 106claim to be a general solution to all compilation problems. 107It does aim at driving thinking and implementations of domain-specific 108abstractions where programmer's intent can be captured at a very high level, 109directly in the IR. 110 111Given the evolution of the scope, it becomes apparent that a better name than 112"Linalg" could remove some of the confusions related to the dialect (and the 113underlying approach), its goals and limitations. 114 115## Prior Art<a name=""></a> 116Linalg draws inspiration from decades of prior art to design a modern a 117pragmatic solution. The following non-exhaustive list refers to some of the 118projects that influenced Linalg design: 119 120- [ONNX](https://onnx.ai/), 121- [LIFT](https://www.lift-project.org/), 122- [XLA](https://www.tensorflow.org/xla/architecture), 123- [Halide](https://halide-lang.org/) and [TVM](https://tvm.apache.org/), 124- [TACO](http://tensor-compiler.org/), 125- [Darkroom](http://darkroom-lang.org/) and [Terra](http://terralang.org/), 126- [Sigma-LL](http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/cgo16-preprint_248.pdf), 127- [Tensor Comprehensions](https://arxiv.org/abs/1802.04730), 128- [Polyhedral Compilers](https://en.wikipedia.org/wiki/Polytope_model), 129- the [Affine dialect](https://mlir.llvm.org/docs/Dialects/Affine/) in MLIR, 130- Generic Loop Transformations (see Ken Kennedy's 131[Optimizing Compilers for Modern Architectures]( 132https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9)) 133- Traditional compiler CFGs with SSA forms. 134 135Additionally, experience with the following tools proved very valuable when 136thinking holistically about how all these components interplay all the way 137up to the user and down to the hardware: 138 139- the [Torch](http://torch.ch/) machine-learning framework, 140- the LLVM compiler, specifically in JIT mode, 141- high-performance libraries (MKL, CUBLAS, FBFFT) 142- the [PeachPy](https://www.cs.utexas.edu/users/flame/BLISRetreat/BLISRetreatTalks/PeachPy.pdf) assembler 143- current and potentially upcoming hardware ISAs. 144 145The novelty of MLIR's code base and its unprecedented support for defining and 146mixing abstractions, enabling one to reflect on and integrate the key elements 147of the prior art success as well as avoid the common pitfalls in the area of 148code generation. Thus, instead of diverging into a discussion about the 149implications of adopting any of the existing solutions, Linalg had the 150possibility to build on all of them and learn from their experience while 151leveraging the benefit of hindsight. 152 153The following reflections on prior art have influenced the design of Linalg. 154The discussion is by no means exhaustive but should capture the key motivations 155behind Linalg. 156 157### Lessons from ONNX<a name="lessonsonnx"></a> 158ONNX is a specification of operations that appear in Machine Learning 159workloads. As such, it is predominantly driven by the expressiveness requirements 160of ML, and less by the considerations of IR design for HPC code generation. 161 162Similarly to ONNX, Linalg defines *"semantically charged" named ops*. 163But it also considers *transformations on these ops* as a key component and 164defines the IR to support the transformations, preferring transformations over 165expressiveness if necessary. 166 167Linalg hopes to additionally address the following: 168- facilitate frontend-compiler co-design by taking into account compiler 169 transformations and lowerings in op definition; 170- minimize the set of available ops by making them non-overlapping with each 171 other, thus simplifying the intermediate representation. 172 173### Lessons from LIFT<a name="lessonslift"></a> 174[LIFT](https://www.lift-project.org/) is a system to write computational 175kernels based on functional abstractions. Transformations are 176represented by additional nodes in the IR, whose semantics are at the 177level of the algorithm (e.g. `partialReduce`). 178LIFT applies and composes transformations by using [local rewrite 179rules](https://www.lift-project.org/presentations/2015/ICFP-2015.pdf) that 180embed these additional nodes directly in the functional abstraction. 181 182Similarly to LIFT, Linalg uses local rewrite rules implemented with the MLIR 183[Declarative Rewrite Rules](https://mlir.llvm.org/docs/DeclarativeRewrites/) 184mechanisms. 185 186Linalg builds on, and helps separate concerns in the LIFT approach as follows: 187- transformations are either separated from the representation or expressed as 188 composable attributes that are independent of the actual computation, 189 avoiding intricate effects on performance; 190- abstractions are split into smaller components (e.g., control flow and data 191 structure abstractions) potentially reusable across different dialects in the 192 MLIR's open ecosystem. 193 194LIFT is expected to further influence the design of Linalg as it evolves. In 195particular, extending the data structure abstractions to support non-dense 196tensors can use the experience of LIFT abstractions for 197[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf) 198and [position-dependent 199arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf). 200 201### Lessons from XLA<a name="lessonsxla"></a> 202[XLA](https://www.tensorflow.org/xla/architecture) is one of the first 203post-Theano ML compilers that was introduced as a pragmatic compilation 204solution for TensorFlow. It shines on Google's xPU 205hardware and is an important piece of the puzzle. It is particularly good at 206(1) transforming code back and forth between the scalar and the vector 207worlds, (2) passing function boundaries for handling both host and device 208code, and (3) complying to stringent requirements imposed by energy-efficient 209xPUs. 210XLA followed a pragmatic design process where the compiler is given perfect 211knowledge of each op's semantic, all starting from the mighty `conv` and 212`matmul` ops. XLA transformations consist of writing emitters that compose, as C++ 213functions. Perfect op semantics knowledge has 2 big benefits: (1) transformations are 214correct by construction (2) very strong performance on difficult xPU targets. 215 216Similarly, Linalg ops *"know their semantics"* and *"know how to transform and 217lower themselves"*. The means by which this information is made available and 218how it is used in MLIR are, however, very different. 219 220Linalg hopes to additionally address the following: 221- HLOs are expressive as a whole, but each op has very limited and fixed 222semantics: ops are not configurable. As a consequence, HLOs have evolved into 223a too large set of ops whose semantics intersect. 224This echoes the ops proliferation problem also exhibited by ONNX. 225- Reliance on perfect op knowledge leads to situations where transformations and 226ops end up needing to know about each other's semantics (e.g. during fusion). 227Since the transformations themselves are not simple local rewrite patterns 228(unlike LIFT), code complexity grows quickly. 229- XLA lacks an independent IR that can be inspected, unit tested and used 230independently. This monolithic design makes the system not portable: xPU passes 231and GPU passes do not share much code. 232 233### Lessons from Halide and TVM<a name="lessonshalide"></a> 234[Halide](https://halide-lang.org/) is a DSL embedded in C++ that provides a 235way of metaprogramming the HalideIR and applying transformations declaratively 236to let the expert user transform and optimize the program in tailored ways. 237Halide, initially targeted the SIGGRAPH community but is now more generally 238applicable. [TVM](https://tvm.apache.org/) is an evolution of Halide into the 239machine learning and deep-neural network space, based on HalideIR. 240 241The Halide transformation methodology follows similar principles to the 242[URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf) 243and 244[CHiLL](https://pdfs.semanticscholar.org/6a46/20589f63f3385707d2d590f7b7dc8ee4d74f.pdf) 245compiler transformation frameworks, but without the strengths (and especially 246complexity) of the polyhedral model. 247 248Halide particularly shines at making the HPC transformation methodology 249accessible to $\Omega$(10-100) users, at a time when polyhedral tools are 250still only accessible to $\Omega$(1-10) users. Halide makes heavy usage of 251canonicalization rules that are also very prevalent in MLIR. 252 253Linalg hopes to additionally address the following: 254- Halide scheduling is powerful and explores a large swath of possible 255transformations. But it's still too hard for newcomers to use or extend. The 256level of performance you get from Halide is very different depending on 257whether one is a seasoned veteran or a newcomer. This is especially true as 258the number of transformations grows. 259- Halide raises rather than lowers in two ways, going counter-current to the 260design goals we set for high-level codegen abstractions in MLIR. First, 261canonical Halide front-end code uses explicit indexing and math on scalar 262values, so to target BLAS/DNN libraries one needs to add pattern matching 263which is similarly brittle as in the affine case. While Halide's performance 264is on par with the libraries on programmable targets (CPU/GPU), that 265approach doesn't work on mobile accelerators or on xPUs, where the framework 266ingests whole-tensor operations. 267Second, reductions and scans are expressed using serial iteration, again 268requiring pattern matching before they can be transformed (e.g. to do a 269reduction using atomics, or hierarchically). The lesson to draw is that we 270should start with higher-level primitives than Halide. 271 272### Lessons from Tensor Comprehensions<a name="lessonstc"></a> 273[Tensor Comprehensions](https://arxiv.org/abs/1802.04730) is a 274high-level language to express tensor computations with a syntax 275generalizing the Einstein notation, coupled to an end-to-end 276compilation flow capable of lowering to efficient GPU code. It was 277integrated with 2 ML frameworks: Caffe2 and PyTorch. 278 279<img width="600" alt="MLIR Codegen Flow" 280src="https://user-images.githubusercontent.com/10148468/73613272-df904480-45c1-11ea-88f9-214dee7464cf.png"> 281 282The compilation flow combines [Halide](#lessonshalide) and a Polyhedral Compiler 283derived from [ISL](https://en.wikipedia.org/wiki/Integer_set_library) 284and uses both HalideIR and the ISL *schedule-tree* IR. 285The compiler provides a collection of polyhedral compilation 286algorithms to perform fusion and favor multi-level parallelism and 287promotion to deeper levels of the memory hierarchy. 288Tensor Comprehensions showed that, fixing a few predefined strategies 289with parametric transformations and tuning knobs, can already provide 290great results. In that previous work, simple 291genetic search combined with an autotuning framework was sufficient 292to find good implementations in the ***non-compute bound regime***. 293This requires code versions obtainable by the 294various transformations to encompass versions that get close to the 295roofline limit. 296The ultimate goal of Tensor Comprehensions was to concretely mix 297Halide high-level transformations with polyhedral mid-level 298transformations and build a pragmatic system that could take advantage 299of both styles of compilation. 300 301Linalg hopes to additionally address the following: 302- Halide was never properly used in Tensor Comprehensions beyond shape 303inference. Most of the investment went into simplifying polyhedral 304transformations and building a usable end-to-end system. MLIR was 305deemed a better infrastructure to mix these types of compilation. 306- The early gains provided by reusing established infrastructures 307(HalideIR and ISL schedule trees) turned into more impedance mismatch 308problems than could be solved with a small tactical investment. 309- Tensor Comprehensions emitted CUDA code which was then JIT compiled 310with NVCC from a textual representation. While this was a pragmatic 311short-term solution it made it hard to perform low-level rewrites that 312would have helped with register reuse in the ***compute-bound regime***. 313- The same reliance on emitting CUDA code made it difficult to 314create cost models when time came. This made it artificially harder to 315prune out bad solutions than necessary. This resulted in excessive 316runtime evaluation, as reported in the paper [Machine Learning Systems 317are Stuck in a Rut](https://dl.acm.org/doi/10.1145/3317550.3321441). 318 319Many of those issues are naturally addressed by implementing these ideas 320in the MLIR infrastructure. 321 322### Lessons from Polyhedral compilers<a name="lessonspolyhedral"></a> 323The polyhedral model has been on the cutting edge of loop-level optimization for 324decades, with several incarnations in production compilers such as 325[GRAPHITE](https://gcc.gnu.org/wiki/Graphite) for GCC and 326[Polly](https://polly.llvm.org) for LLVM. Although it has proved crucial to 327generate efficient code from domain-specific languages such as 328[PolyMage](http://mcl.csa.iisc.ac.in/polymage.html) and [Tensor 329Comprehensions](https://dl.acm.org/doi/abs/10.1145/3355606), it has never been 330fully included into mainstream general-purpose optimization pipelines. Detailed 331analysis of the role of polyhedral transformations is provided in the 332[simplified polyhedral 333form](RationaleSimplifiedPolyhedralForm.md) document 334dating back to the inception of MLIR. 335 336In particular, polyhedral abstractions have proved challenging to integrate with 337a more conventional compiler due to the following. 338- The transformed code (or IR) quickly gets complex and thus hard to analyze and 339 understand. 340- Code generation from the mathematical form used in the polyhedral model relies 341 on non-trivial exponentially complex algorithms. 342- The mathematical form is rarely composable with the SSA representation and 343 related algorithms, on which most mainstream compilers are built today. 344- Expressiveness limitations, although addressed in the scientific literature 345 through, e.g., summary functions, often remain present in actual 346 implementations. 347 348The Affine dialect in MLIR was specifically designed to address the integration 349problems mention above. In particular, it maintains the IR in the same form 350(loops with additional constraints on how the bounds are expressed) throughout 351the transformation, decreasing the need for one-shot conversion between 352drastically different representations. It also embeds the polyhedral 353representation into the SSA form by using MLIR regions and thus allows one to 354combine polyhedral and SSA-based transformations. 355 356### Lessons from the Affine dialect<a name="lessonsaffine"></a> 357The Affine dialect in MLIR brings the polyhedral abstraction closer to the 358conventional SSA representation. It addresses several long-standing integration 359challenges as described above and is likely to be more suitable when compiling 360from a C language-level abstraction. 361 362MLIR makes it possible to start from a higher-level abstraction than C, for 363example in machine learning workloads. In such cases, it may be possible to 364avoid complex analyses (data-flow analysis across loop iterations is 365exponentially complex) required for polyhedral transformation by leveraging the 366information available at higher levels of abstractions, similarly to DSL 367compilers. Linalg intends to use this information when available and ensure 368*legality of transformations by construction*, by integrating legality 369preconditions in the op semantics (for example, loop tiling can be applied to 370the loop nest computing a matrix multiplication, no need to additionally rely on 371affine dependence analysis to check this). This information is not readily 372available in the Affine dialect, and can only be derived using potentially 373expensive pattern-matching algorithms. 374 375Informed by the practical experience in polyhedral compilation and with the 376Affine dialects in particular, Linalg takes the following decisions. 377- **Discourage loop skewing**: the loop skewing transformation, that is 378 sometimes used to enable parallelization, often has surprising (negative) 379 effects on performance. In particular, polyhedral auto-transformation can be 380 expressed in a simpler way without loop skewing; skewing often leads to 381 complex control flow hampering performance on accelerators such as GPUs. 382 Moreover, the problems loop skewing addresses can be better addressed by other 383 approaches, e.g., diamond tiling. In the more restricted case of ML workloads, 384 multi-for loops with induction variables independent of each other (referred 385 to as hyper-rectangular iteration domains in the literature) such as the 386 proposed 387 [affine.parallel]((https://llvm.discourse.group/t/rfc-add-affine-parallel/350) 388 are sufficient in the majority of cases. 389- **Declarative Tiling**: the *tiling* transformation is ubiquitous in HPC code 390 generation. It can be seen as a decomposition of either the iteration space or 391 the data space into smaller regular parts, referred to as tiles. Polyhedral 392 approaches, including the Affine dialect, mostly opt for iteration space 393 tiling, which introduces additional control flow and complex address 394 expressions. If the tile sizes are not known during the transformation (so 395 called parametric tiling), the address expressions and conditions quickly 396 become non-affine or require exponentially complex algorithms to reason about 397 them. Linalg focuses tiling on the data space instead, creating views into the 398 buffers that leverage MLIR's strided `memref` abstraction. These views compose 399 and the complexity of access expressions remains predictable. 400- **Preserve high-level information**: Linalg maintains the information provided 401 by the op semantics as long as necessary for transformations. For example, the 402 result of tiling a matrix multiplication is loops around a smaller matrix 403 multiplication. Even with pattern-matching on top of the Affine dialect, this 404 would have required another step of pattern-matching after the transformation. 405 406Given these choices, Linalg intends to be a better fit for **high-level 407compilation** were significantly more information is readily available in the 408input representation and should be leveraged before lowering to other 409abstractions. Affine remains a strong abstraction for mid-level transformation 410and is used as a lowering target for Linalg, enabling further transformations 411and combination of semantically-loaded and lower-level inputs. As such, Linalg 412is intended to complement Affine rather than replace it. 413 414## Core Guiding Principles<a name="guiding_principles"></a> 415 416### Transformations and Simplicity First<a name="transformations_first"></a> 417The purpose of the Linalg IR and its operations is primarily to: 418- develop a set of key transformations, and 419- make them correct by construction by carefully curating the set of 420generic operation properties that drive applicability, and 421- make them very simple to implement, apply, verify and especially 422maintain. 423 424The problem at hand is fundamentally driven by compilation of domain-specific 425workloads for high-performance and parallel hardware architectures: **this is 426an HPC compilation problem**. 427 428The selection of relevant transformations follows a co-design approach and 429involves considerations related to: 430- concrete current and future needs of the application domain, 431- concrete current and future hardware properties and ISAs, 432- understanding of strengths and limitations of [existing approaches](#prior_art), 433- taking advantage of the coexistence of multiple levels of IR in MLIR, 434 435One needs to be methodical to avoid proliferation and redundancy. A given 436transformation could exist at multiple levels of abstraction but **just 437because one can write transformation X at level Y absolutely does not mean 438one should**. This is where evaluation of existing 439systems and acknowledgement of their strengths and weaknesses is crucial: 440simplicity and maintainability aspects must be first-order concerns. Without 441this additional effort of introspection, a design will not stand the test of 442time. At the same time, complexity is very hard to ward off. It seems one needs 443to suffer complexity to be prompted to take a step back and rethink 444abstractions. 445 446This is not merely a reimplementation of idea X in system Y: simplicity 447**must be the outcome** of this introspection effort. 448 449### Preservation of Information<a name="information_preservation"></a> 450The last two decades have seen a proliferation of Domain-Specific Languages 451(DSLs) that have been very successful at limited application domains. 452The main commonality between these systems is their use of a significantly 453richer structural information than CFGs or loops. 454Still, another commonality of existing systems is to lower to LLVM very quickly, 455and cross a wide abstraction gap in a single step. This process often drops 456semantic information that later needs to be reconstructed later, 457when it is not irremediably lost. 458 459These remarks, coupled with MLIR's suitability for defining IR at multiple 460levels of abstraction led to the following 2 principles. 461 462#### Declarative Specification: Avoid Raising<a name="declarative_specification"></a> 463 464Compiler transformations need static structural information (e.g. loop-nests, 465graphs of basic blocks, pure functions, etc). When that structural information 466is lost, it needs to be reconstructed. 467 468A good illustration of this phenomenon is the notion of *raising* in polyhedral 469compilers: multiple polyhedral tools start by raising from a simplified C 470form or from SSA IR into a higher-level representation that is more amenable 471to loop transformations. 472 473In advanced polyhedral compilers, a second type of raising 474may typically exist to detect particular patterns (often variations of 475BLAS). Such patterns may be broken by transformations making their detection 476very fragile or even just impossible (incorrect). 477 478MLIR makes it easy to define op semantics declaratively thanks to the use of 479regions and attributes. This is an ideal opportunity to define new abstractions 480to convey user-intent directly into the proper abstraction. 481 482#### Progressive Lowering: Don't Lose Information too Quickly<a name="#progressive_lowering"></a> 483 484Lowering too quickly to affine, generic loops or CFG form reduces the 485amount of structure available to derive transformations from. While 486manipulating loops is a net gain compared to CFG form for a certain class of 487transformations, important information is still lost (e.g. parallel loops, or 488mapping of a loop nest to an external implementation). 489 490This creates non-trivial phase ordering issues. For instance, loop fusion may 491easily destroy the ability to detect a BLAS pattern. One possible alternative 492is to perform loop fusion, tiling, intra-tile loop distribution and then hope to 493detect the BLAS pattern. Such a scheme presents difficult phase-ordering 494constraints that will likely interfere with other decisions and passes. 495Instead, certain Linalg ops are designed to maintain high-level information 496across transformations such as tiling and fusion. 497 498MLIR is designed as an infrastructure for ***progressive lowering***. 499Linalg fully embraces this notion and thinks of codegen in terms of 500*reducing a potential function*. That potential function is loosely 501defined in terms of number of low-level instructions in a particular 502Linalg ops (i.e. how heavy or lightweight the Linalg op is). 503Linalg-based codegen and transformations start from higher-level IR 504ops and dialects. Then each transformation application reduces the 505potential by introducing lower-level IR ops and *smaller* Linalg ops. 506This gradually reduces the potential, all the way to Loops + VectorOps 507and LLVMIR. 508 509### Composable and Declarative Transformations<a name="declarative_transformations"></a> 510Complex and impactful transformations need not be hard to manipulate, write or 511maintain. Mixing XLA-style high-level op semantics knowledge with generic 512properties to describe these semantics, directly in MLIR, is a promising way to: 513- Design transformations that are correct by construction, easy to 514write, easy to verify and easy to maintain. 515- Provide a way to specify transformations and the units of IR they manipulate 516declaratively. In turn this allows using local pattern rewrite rules in MLIR 517(i.e. [DRR](../DeclarativeRewrites.md)). 518- Allow creating customizable passes declaratively by simply selecting rewrite 519rules. This allows mixing transformations, canonicalizations, constant folding 520and other enabling rewrites in a single pass. The result is a system where pass 521fusion is very simple to obtain and gives hope for solving certain 522[phase ordering issues](https://dl.acm.org/doi/10.1145/201059.201061). 523 524### Suitability for Search and Machine Learning<a name="ml"></a> 525Compiler heuristics are hand-crafted human-engineered features: it is 526ripe for disruption by machine-learning techniques. 527To enable search, compiler transformations should be fine-grained, 528[composable](#declarative_transformations) and expose tuning parameters that 529can modify their behavior, guided by lessons from previous experience 530with [Tensor Comprehensions](#lessonstc). 531 532Of course, we are not advocating for using ML everywhere in the stack 533immediately: low-level compilation and machine models are still quite performant 534in LLVM. However, for the high-level and mid-level optimization problems, 535models need to be conditioned (probabilistically) on the low-level 536compiler which acts as a blackbox. For these reasons we prioritize the 537design of IR and transformations with search-friendly properties over 538building cost models. 539Still, this does not mean Linalg refuses cost models: instead we 540prefer to invest in infrastructure that will enable [ML-based 541techniques to automatically build cost 542models](http://homepages.inf.ed.ac.uk/hleather/publications/2009_autofeatures_cgo.pdf). 543 544### Extensibility and Future-Proofness<a name="future"></a> 545MLIR allows defining IR for structured control flow and structured 546data types. We choose to take advantage of these properties for the 547reasons described above. 548In particular, the `MemRefType` represents dense non-contiguous memory regions. 549This structure should extend beyond simple dense data types and generalize to 550ragged, sparse and mixed dens/sparse tensors as well as to trees, hash tables, 551tables of records and maybe even graphs. 552 553For such more advanced data types, the control-flow required to traverse the 554data structures, termination conditions, etc are much less simple to analyze and 555characterize statically. As a consequence we need to also design solutions that 556stand a chance of evolving into runtime-adaptive computations (e.g. 557inspector-executor in which an *inspector* runs a cheap runtime 558analysis on the data to configure how the *executor* should run). 559While there is no concrete solution 560today to solve these problems in MLIR, it is pretty clear that perfect 561static knowledge and analyses will not be serious contenders for these problems. 562 563## Key Observations<a name="keyobservation"></a> 564The following key observations have influenced the design of Linalg and helped 565reconcile [core guiding principles](#guiding_principles) with real-world 566requirements when producing an implementation based on MLIR. 567 568### Algorithms + Data Structures = Programs<a name="data_and_compute"></a> 569This is a twist on Niklaus Wirth's formulation but captures the essence of the 570design of Linalg: control-flow does not exist in a vacuum, independently of 571data. 572On the contrary, there is a very strong relationship between control-flow and 573data structures: one cannot exist without the other. This has multiple 574implications on the [semantics of Linalg Ops](#linalg_ops) and their 575transformations. In particular, this observation influences whether 576certain transformations are better done: 577- as control flow or data structure manipulation, 578- on Linalg ops attributes or on loops after some partial lowering 579occurred, 580- as extensions to the Linalg dialect in terms of new ops or attributes. 581 582### The Dialect Need not be Closed Under Transformations<a name="dialect_not_closed"></a> 583This is probably the most surprising and counter-intuitive 584observation. When one designs IR for transformations, closed-ness is 585often a non-negotiable property. 586This is a key design principle of polyhedral IRs such as 587[URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf) 588and 589[ISL-based IRs](https://en.wikipedia.org/wiki/Integer_set_library): 590they are closed under affine transformations. 591In MLIR, multiple dialects coexist and form a coherent whole. After 592experimenting with different alternatives, it became clear that strict 593dialect closed-ness wasn't necessary and could be relaxed. Previous 594systems did not have simple and principled means of building new IR 595and probably suffered from this limitation. We conjecture this is a 596key reason they required the IR to be closed under transformations. 597 598Despite the fact that Linalg ops only allow perfectly nested 599semantics, once tiling and fusion kick in, imperfectly nested loops 600are gradually introduced. 601In other words, imperfectly nested control flow appears as ***the result of 602applying key transformations***. 603 604Considering the *potential* described during the discussion on 605[Progressive Lowering](#progressive_lowering), closed-ness under 606transformation would dictate that the potential remains constant. 607In contrast, Linalg advocates for ***monotonicity*** under 608transformations. 609 610### Summary of Existing Alternatives a Picture<a name="observationssummary"></a> 611Lastly, we summarize our observations of lessons from [Prior 612Art](#prior_art)---when viewed under the lense of our [Core Guiding 613Principles](#guiding_principles)---with the following picture. 614 615<img width="1200" alt="MLIR Codegen Flow" 616src="https://user-images.githubusercontent.com/10148468/73613904-2f720a00-45c8-11ea-8265-1c856c02525b.png"> 617 618This figure is not meant to be perfectly accurate but a rough map of 619how we view the distribution of structural information in existing 620systems, from a codegen-friendly angle. Unsurprisingly, the 621[Linalg Dialect](../Dialects/Linalg/) and its 622future evolutions aspire to a position in the top-right of this map. 623 624