1================ 2The Architecture 3================ 4 5Polly is a loop optimizer for LLVM. Starting from LLVM-IR it detects and 6extracts interesting loop kernels. For each kernel a mathematical model is 7derived which precisely describes the individual computations and memory 8accesses in the kernels. Within Polly a variety of analysis and code 9transformations are performed on this mathematical model. After all 10optimizations have been derived and applied, optimized LLVM-IR is regenerated 11and inserted into the LLVM-IR module. 12 13.. image:: images/architecture.png 14 :align: center 15 16Polly in the LLVM pass pipeline 17------------------------------- 18 19The standard LLVM pass pipeline as it is used in -O1/-O2/-O3 mode of clang/opt 20consists of a sequence of passes that can be grouped in different conceptual 21phases. The first phase, we call it here **Canonicalization**, is a scalar 22canonicalization phase that contains passes like -mem2reg, -instcombine, 23-cfgsimplify, or early loop unrolling. It has the goal of removing and 24simplifying the given IR as much as possible focusing mostly on scalar 25optimizations. The second phase consists of three conceptual groups that are 26executed in the so-called **Inliner cycle**, This is again a set of **Scalar 27Simplification** passes, a set of **Simple Loop Optimizations**, and the 28**Inliner** itself. Even though these passes make up the majority of the LLVM 29pass pipeline, the primary goal of these passes is still canonicalization 30without loosing semantic information that complicates later analysis. As part of 31the inliner cycle, the LLVM inliner step-by-step tries to inline functions, runs 32canonicalization passes to exploit newly exposed simplification opportunities, 33and then tries to inline the further simplified functions. Some simple loop 34optimizations are executed as part of the inliner cycle. Even though they 35perform some optimizations, their primary goal is still the simplification of 36the program code. Loop invariant code motion is one such optimization that 37besides being beneficial for program performance also allows us to move 38computation out of loops and in the best case enables us to eliminate certain 39loops completely. Only after the inliner cycle has been finished, a last 40**Target Specialization** phase is run, where IR complexity is deliberately 41increased to take advantage of target specific features that maximize the 42execution performance on the device we target. One of the principal 43optimizations in this phase is vectorization, but also target specific loop 44unrolling, or some loop transformations (e.g., distribution) that expose more 45vectorization opportunities. 46 47.. image:: images/LLVM-Passes-only.png 48 :align: center 49 50Polly can conceptually be run at three different positions in the pass pipeline. 51As an early optimizer before the standard LLVM pass pipeline, as a later 52optimizer as part of the target specialization sequence, and theoretically also 53with the loop optimizations in the inliner cycle. We only discuss the first two 54options, as running Polly in the inline loop, is likely to disturb the inliner 55and is consequently not a good idea. 56 57.. image:: images/LLVM-Passes-all.png 58 :align: center 59 60Running Polly early before the standard pass pipeline has the benefit that the 61LLVM-IR processed by Polly is still very close to the original input code. 62Hence, it is less likely that transformations applied by LLVM change the IR in 63ways not easily understandable for the programmer. As a result, user feedback is 64likely better and it is less likely that kernels that in C seem a perfect fit 65for Polly have been transformed such that Polly can not handle them any 66more. On the other hand, codes that require inlining to be optimized won't 67benefit if Polly is scheduled at this position. The additional set of 68canonicalization passes required will result in a small, but general compile 69time increase and some random run-time performance changes due to slightly 70different IR being passed through the optimizers. To force Polly to run early in 71the pass pipeline use the option *-polly-position=early* (default today). 72 73.. image:: images/LLVM-Passes-early.png 74 :align: center 75 76Running Polly right before the vectorizer has the benefit that the full inlining 77cycle has been run and as a result even heavily templated C++ code could 78theoretically benefit from Polly (more work is necessary to make Polly here 79really effective). As the IR that is passed to Polly has already been 80canonicalized, there is also no need to run additional canonicalization passes. 81General compile time is almost not affected by Polly, as detection of loop 82kernels is generally very fast and the actual optimization and cleanup passes 83are only run on functions which contain loop kernels that are worth optimizing. 84However, due to the many optimizations that LLVM runs before Polly the IR that 85reaches Polly often has additional scalar dependences that make Polly a lot less 86efficient. To force Polly to run before the vectorizer in the pass pipleline use 87the option *-polly-position=before-vectorizer*. This position is not yet the 88default for Polly, but work is on its way to be effective even in presence of 89scalar dependences. After this work has been completed, Polly will likely use 90this position by default. 91 92.. image:: images/LLVM-Passes-late.png 93 :align: center 94