109467b48Spatrick================== 209467b48SpatrickVectorization Plan 309467b48Spatrick================== 409467b48Spatrick 509467b48Spatrick.. contents:: 609467b48Spatrick :local: 709467b48Spatrick 809467b48SpatrickAbstract 909467b48Spatrick======== 1009467b48SpatrickThe vectorization transformation can be rather complicated, involving several 1109467b48Spatrickpotential alternatives, especially for outer-loops [1]_ but also possibly for 1209467b48Spatrickinnermost loops. These alternatives may have significant performance impact, 1309467b48Spatrickboth positive and negative. A cost model is therefore employed to identify the 1409467b48Spatrickbest alternative, including the alternative of avoiding any transformation 1509467b48Spatrickaltogether. 1609467b48Spatrick 1709467b48SpatrickThe Vectorization Plan is an explicit model for describing vectorization 1809467b48Spatrickcandidates. It serves for both optimizing candidates including estimating their 1909467b48Spatrickcost reliably, and for performing their final translation into IR. This 2009467b48Spatrickfacilitates dealing with multiple vectorization candidates. 2109467b48Spatrick 2209467b48SpatrickHigh-level Design 2309467b48Spatrick================= 2409467b48Spatrick 2509467b48SpatrickVectorization Workflow 2609467b48Spatrick---------------------- 2709467b48SpatrickVPlan-based vectorization involves three major steps, taking a "scenario-based 2809467b48Spatrickapproach" to vectorization planning: 2909467b48Spatrick 3009467b48Spatrick1. Legal Step: check if a loop can be legally vectorized; encode constraints and 3109467b48Spatrick artifacts if so. 3209467b48Spatrick2. Plan Step: 3309467b48Spatrick 3409467b48Spatrick a. Build initial VPlans following the constraints and decisions taken by 3509467b48Spatrick Legal Step 1, and compute their cost. 3609467b48Spatrick b. Apply optimizations to the VPlans, possibly forking additional VPlans. 3709467b48Spatrick Prune sub-optimal VPlans having relatively high cost. 3809467b48Spatrick3. Execute Step: materialize the best VPlan. Note that this is the only step 3909467b48Spatrick that modifies the IR. 4009467b48Spatrick 4109467b48SpatrickDesign Guidelines 4209467b48Spatrick----------------- 4309467b48SpatrickIn what follows, the term "input IR" refers to code that is fed into the 4409467b48Spatrickvectorizer whereas the term "output IR" refers to code that is generated by the 4509467b48Spatrickvectorizer. The output IR contains code that has been vectorized or "widened" 4609467b48Spatrickaccording to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed 4709467b48Spatrickaccording to an Unroll Factor (UF). 4809467b48SpatrickThe design of VPlan follows several high-level guidelines: 4909467b48Spatrick 5009467b48Spatrick1. Analysis-like: building and manipulating VPlans must not modify the input IR. 5109467b48Spatrick In particular, if the best option is not to vectorize at all, the 5209467b48Spatrick vectorization process terminates before reaching Step 3, and compilation 5309467b48Spatrick should proceed as if VPlans had not been built. 5409467b48Spatrick 5509467b48Spatrick2. Align Cost & Execute: each VPlan must support both estimating the cost and 5609467b48Spatrick generating the output IR code, such that the cost estimation evaluates the 5709467b48Spatrick to-be-generated code reliably. 5809467b48Spatrick 5909467b48Spatrick3. Support vectorizing additional constructs: 6009467b48Spatrick 6109467b48Spatrick a. Outer-loop vectorization. In particular, VPlan must be able to model the 6209467b48Spatrick control-flow of the output IR which may include multiple basic-blocks and 6309467b48Spatrick nested loops. 6409467b48Spatrick b. SLP vectorization. 6509467b48Spatrick c. Combinations of the above, including nested vectorization: vectorizing 6609467b48Spatrick both an inner loop and an outer-loop at the same time (each with its own 6709467b48Spatrick VF and UF), mixed vectorization: vectorizing a loop with SLP patterns 6809467b48Spatrick inside [4]_, (re)vectorizing input IR containing vector code. 6909467b48Spatrick d. Function vectorization [2]_. 7009467b48Spatrick 7109467b48Spatrick4. Support multiple candidates efficiently. In particular, similar candidates 7209467b48Spatrick related to a range of possible VF's and UF's must be represented efficiently. 7309467b48Spatrick Potential versioning needs to be supported efficiently. 7409467b48Spatrick 7509467b48Spatrick5. Support vectorizing idioms, such as interleaved groups of strided loads or 7609467b48Spatrick stores. This is achieved by modeling a sequence of output instructions using 7709467b48Spatrick a "Recipe", which is responsible for computing its cost and generating its 7809467b48Spatrick code. 7909467b48Spatrick 8009467b48Spatrick6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization 8109467b48Spatrick such regions may need to be, for example, predicated and linearized, or 8209467b48Spatrick replicated VF*UF times to handle scalarized and predicated instructions. 8309467b48Spatrick Innerloops are also modelled as SESE regions. 8409467b48Spatrick 8509467b48Spatrick7. Support instruction-level analysis and transformation, as part of Planning 8609467b48Spatrick Step 2.b: During vectorization instructions may need to be traversed, moved, 8709467b48Spatrick replaced by other instructions or be created. For example, vector idiom 8809467b48Spatrick detection and formation involves searching for and optimizing instruction 8909467b48Spatrick patterns. 9009467b48Spatrick 9109467b48SpatrickDefinitions 9209467b48Spatrick=========== 9309467b48SpatrickThe low-level design of VPlan comprises of the following classes. 9409467b48Spatrick 9509467b48Spatrick:LoopVectorizationPlanner: 9609467b48Spatrick A LoopVectorizationPlanner is designed to handle the vectorization of a loop 9709467b48Spatrick or a loop nest. It can construct, optimize and discard one or more VPlans, 9809467b48Spatrick each VPlan modelling a distinct way to vectorize the loop or the loop nest. 9909467b48Spatrick Once the best VPlan is determined, including the best VF and UF, this VPlan 10009467b48Spatrick drives the generation of output IR. 10109467b48Spatrick 10209467b48Spatrick:VPlan: 10309467b48Spatrick A model of a vectorized candidate for a given input IR loop or loop nest. This 10409467b48Spatrick candidate is represented using a Hierarchical CFG. VPlan supports estimating 10509467b48Spatrick the cost and driving the generation of the output IR code it represents. 10609467b48Spatrick 10709467b48Spatrick:Hierarchical CFG: 10809467b48Spatrick A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The 10909467b48Spatrick Hierarchical CFG data structure is similar to the Tile Tree [5]_, where 11009467b48Spatrick cross-Tile edges are lifted to connect Tiles instead of the original 11109467b48Spatrick basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms 11209467b48Spatrick Region and Block are used rather than Tile [5]_ to avoid confusion with loop 11309467b48Spatrick tiling. 11409467b48Spatrick 11509467b48Spatrick:VPBlockBase: 11609467b48Spatrick The building block of the Hierarchical CFG. A pure-virtual base-class of 11709467b48Spatrick VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical 11809467b48Spatrick control-flow relations with other VPBlocks. Note that in contrast to the IR 11909467b48Spatrick BasicBlock, a VPBlockBase models its control-flow successors and predecessors 12009467b48Spatrick directly, rather than through a Terminator branch or through predecessor 12109467b48Spatrick branches that "use" the VPBlockBase. 12209467b48Spatrick 12309467b48Spatrick:VPBasicBlock: 12409467b48Spatrick VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the 12509467b48Spatrick Hierarchical CFG. It represents a sequence of output IR instructions that will 12609467b48Spatrick appear consecutively in an output IR basic-block. The instructions of this 12709467b48Spatrick basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a 12809467b48Spatrick sequence of zero or more VPRecipes that model the cost and generation of the 12909467b48Spatrick output IR instructions. 13009467b48Spatrick 13109467b48Spatrick:VPRegionBlock: 13209467b48Spatrick VPRegionBlock is a subclass of VPBlockBase. It models a collection of 13309467b48Spatrick VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR 13409467b48Spatrick CFG. A VPRegionBlock may indicate that its contents are to be replicated a 13509467b48Spatrick constant number of times when output IR is generated, effectively representing 13609467b48Spatrick a loop with constant trip-count that will be completely unrolled. This is used 13709467b48Spatrick to support scalarized and predicated instructions with a single model for 13809467b48Spatrick multiple candidate VF's and UF's. 13909467b48Spatrick 14009467b48Spatrick:VPRecipeBase: 14109467b48Spatrick A pure-virtual base class modeling a sequence of one or more output IR 14209467b48Spatrick instructions, possibly based on one or more input IR instructions. These 14309467b48Spatrick input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe 14409467b48Spatrick may specify how its ingredients are to be transformed to produce the output IR 14509467b48Spatrick instructions; e.g., cloned once, replicated multiple times or widened 14609467b48Spatrick according to selected VF. 14709467b48Spatrick 14809467b48Spatrick:VPValue: 14909467b48Spatrick The base of VPlan's def-use relations class hierarchy. When instantiated, it 15009467b48Spatrick models a constant or a live-in Value in VPlan. It has users, which are of type 15109467b48Spatrick VPUser, but no operands. 15209467b48Spatrick 15309467b48Spatrick:VPUser: 154*73471bf0Spatrick A VPUser represents an entity that uses a number of VPValues as operands. 155*73471bf0Spatrick VPUser is similar in some aspects to LLVM's User class. 156*73471bf0Spatrick 157*73471bf0Spatrick:VPDef: 158*73471bf0Spatrick A VPDef represents an entity that defines zero, one or multiple VPValues. 159*73471bf0Spatrick It is used to model the fact that recipes in VPlan can define multiple 160*73471bf0Spatrick VPValues. 16109467b48Spatrick 16209467b48Spatrick:VPInstruction: 16309467b48Spatrick A VPInstruction is both a VPRecipe and a VPUser. It models a single 16409467b48Spatrick VPlan-level instruction to be generated if the VPlan is executed, including 16509467b48Spatrick its opcode and possibly additional characteristics. It is the basis for 16609467b48Spatrick writing instruction-level analyses and optimizations in VPlan as creating, 16709467b48Spatrick replacing or moving VPInstructions record both def-use and scheduling 16809467b48Spatrick decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic 16909467b48Spatrick operations that enrich the Vectorizer's semantics. 17009467b48Spatrick 17109467b48Spatrick:VPTransformState: 17209467b48Spatrick Stores information used for generating output IR, passed from 17309467b48Spatrick LoopVectorizationPlanner to its selected VPlan for execution, and used to pass 17409467b48Spatrick additional information down to VPBlocks and VPRecipes. 17509467b48Spatrick 17609467b48SpatrickThe Planning Process and VPlan Roadmap 17709467b48Spatrick====================================== 17809467b48Spatrick 17909467b48SpatrickTransforming the Loop Vectorizer to use VPlan follows a staged approach. First, 18009467b48SpatrickVPlan is used to record the final vectorization decisions, and to execute them: 18109467b48Spatrickthe Hierarchical CFG models the planned control-flow, and Recipes capture 18209467b48Spatrickdecisions taken inside basic-blocks. Next, VPlan will be used also as the basis 18309467b48Spatrickfor taking these decisions, effectively turning them into a series of 18409467b48SpatrickVPlan-to-VPlan algorithms. Finally, VPlan will support the planning process 18509467b48Spatrickitself including cost-based analyses for making these decisions, to fully 18609467b48Spatricksupport compositional and iterative decision making. 18709467b48Spatrick 18809467b48SpatrickSome decisions are local to an instruction in the loop, such as whether to widen 18909467b48Spatrickit into a vector instruction or replicate it, keeping the generated instructions 19009467b48Spatrickin place. Other decisions, however, involve moving instructions, replacing them 19109467b48Spatrickwith other instructions, and/or introducing new instructions. For example, a 19209467b48Spatrickcast may sink past a later instruction and be widened to handle first-order 19309467b48Spatrickrecurrence; an interleave group of strided gathers or scatters may effectively 19409467b48Spatrickmove to one place where they are replaced with shuffles and a common wide vector 19509467b48Spatrickload or store; new instructions may be introduced to compute masks, shuffle the 19609467b48Spatrickelements of vectors, and pack scalar values into vectors or vice-versa. 19709467b48Spatrick 19809467b48SpatrickIn order for VPlan to support making instruction-level decisions and analyses, 19909467b48Spatrickit needs to model the relevant instructions along with their def/use relations. 20009467b48SpatrickThis too follows a staged approach: first, the new instructions that compute 20109467b48Spatrickmasks are modeled as VPInstructions, along with their induced def/use subgraph. 20209467b48SpatrickThis effectively models masks in VPlan, facilitating VPlan-based predication. 20309467b48SpatrickNext, the logic embedded within each Recipe for generating its instructions at 20409467b48SpatrickVPlan execution time, will instead take part in the planning process by modeling 20509467b48Spatrickthem as VPInstructions. Finally, only logic that applies to instructions as a 20609467b48Spatrickgroup will remain in Recipes, such as interleave groups and potentially other 20709467b48Spatrickidiom groups having synergistic cost. 20809467b48Spatrick 20909467b48SpatrickRelated LLVM components 21009467b48Spatrick----------------------- 21109467b48Spatrick1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP 21209467b48Spatrick tree, where TSLP [3]_ adds Plan Step 2.b. 21309467b48Spatrick 21409467b48Spatrick2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by 21509467b48Spatrick Polly [7]_. 21609467b48Spatrick 21709467b48Spatrick3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of 21809467b48Spatrick the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_. 21909467b48Spatrick 22009467b48SpatrickReferences 22109467b48Spatrick---------- 22209467b48Spatrick.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit 22309467b48Spatrick Nuzman and Ayal Zaks, PACT 2008. 22409467b48Spatrick 22509467b48Spatrick.. [2] "Proposal for function vectorization and loop vectorization with function 22609467b48Spatrick calls", Xinmin Tian, [`cfe-dev 22709467b48Spatrick <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_]., 22809467b48Spatrick March 2, 2016. 22909467b48Spatrick See also `review <https://reviews.llvm.org/D22792>`_. 23009467b48Spatrick 23109467b48Spatrick.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios 23209467b48Spatrick Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. 23309467b48Spatrick 23409467b48Spatrick.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization 23509467b48Spatrick overhead", Hao Zhou and Jingling Xue, CGO 2016. 23609467b48Spatrick 23709467b48Spatrick.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and 23809467b48Spatrick Brian Koblenz, PLDI 1991 23909467b48Spatrick 24009467b48Spatrick.. [6] "Structural analysis: A new approach to flow analysis in optimizing 24109467b48Spatrick compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 24209467b48Spatrick 24309467b48Spatrick.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma 24409467b48Spatrick thesis, 2011. 24509467b48Spatrick 24609467b48Spatrick.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, 24709467b48Spatrick European LLVM Developers' Meeting 2017. 24809467b48Spatrick 24909467b48Spatrick.. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop 25009467b48Spatrick Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016. 251