109467b48Spatrick==================
209467b48SpatrickVectorization Plan
309467b48Spatrick==================
409467b48Spatrick
509467b48Spatrick.. contents::
609467b48Spatrick   :local:
709467b48Spatrick
809467b48SpatrickAbstract
909467b48Spatrick========
1009467b48SpatrickThe vectorization transformation can be rather complicated, involving several
1109467b48Spatrickpotential alternatives, especially for outer-loops [1]_ but also possibly for
1209467b48Spatrickinnermost loops. These alternatives may have significant performance impact,
1309467b48Spatrickboth positive and negative. A cost model is therefore employed to identify the
1409467b48Spatrickbest alternative, including the alternative of avoiding any transformation
1509467b48Spatrickaltogether.
1609467b48Spatrick
1709467b48SpatrickThe Vectorization Plan is an explicit model for describing vectorization
1809467b48Spatrickcandidates. It serves for both optimizing candidates including estimating their
1909467b48Spatrickcost reliably, and for performing their final translation into IR. This
2009467b48Spatrickfacilitates dealing with multiple vectorization candidates.
2109467b48Spatrick
2209467b48SpatrickHigh-level Design
2309467b48Spatrick=================
2409467b48Spatrick
2509467b48SpatrickVectorization Workflow
2609467b48Spatrick----------------------
2709467b48SpatrickVPlan-based vectorization involves three major steps, taking a "scenario-based
2809467b48Spatrickapproach" to vectorization planning:
2909467b48Spatrick
3009467b48Spatrick1. Legal Step: check if a loop can be legally vectorized; encode constraints and
3109467b48Spatrick   artifacts if so.
3209467b48Spatrick2. Plan Step:
3309467b48Spatrick
3409467b48Spatrick   a. Build initial VPlans following the constraints and decisions taken by
3509467b48Spatrick      Legal Step 1, and compute their cost.
3609467b48Spatrick   b. Apply optimizations to the VPlans, possibly forking additional VPlans.
3709467b48Spatrick      Prune sub-optimal VPlans having relatively high cost.
3809467b48Spatrick3. Execute Step: materialize the best VPlan. Note that this is the only step
3909467b48Spatrick   that modifies the IR.
4009467b48Spatrick
4109467b48SpatrickDesign Guidelines
4209467b48Spatrick-----------------
4309467b48SpatrickIn what follows, the term "input IR" refers to code that is fed into the
4409467b48Spatrickvectorizer whereas the term "output IR" refers to code that is generated by the
4509467b48Spatrickvectorizer. The output IR contains code that has been vectorized or "widened"
4609467b48Spatrickaccording to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
4709467b48Spatrickaccording to an Unroll Factor (UF).
4809467b48SpatrickThe design of VPlan follows several high-level guidelines:
4909467b48Spatrick
5009467b48Spatrick1. Analysis-like: building and manipulating VPlans must not modify the input IR.
5109467b48Spatrick   In particular, if the best option is not to vectorize at all, the
5209467b48Spatrick   vectorization process terminates before reaching Step 3, and compilation
5309467b48Spatrick   should proceed as if VPlans had not been built.
5409467b48Spatrick
5509467b48Spatrick2. Align Cost & Execute: each VPlan must support both estimating the cost and
5609467b48Spatrick   generating the output IR code, such that the cost estimation evaluates the
5709467b48Spatrick   to-be-generated code reliably.
5809467b48Spatrick
5909467b48Spatrick3. Support vectorizing additional constructs:
6009467b48Spatrick
6109467b48Spatrick   a. Outer-loop vectorization. In particular, VPlan must be able to model the
6209467b48Spatrick      control-flow of the output IR which may include multiple basic-blocks and
6309467b48Spatrick      nested loops.
6409467b48Spatrick   b. SLP vectorization.
6509467b48Spatrick   c. Combinations of the above, including nested vectorization: vectorizing
6609467b48Spatrick      both an inner loop and an outer-loop at the same time (each with its own
6709467b48Spatrick      VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
6809467b48Spatrick      inside [4]_, (re)vectorizing input IR containing vector code.
6909467b48Spatrick   d. Function vectorization [2]_.
7009467b48Spatrick
7109467b48Spatrick4. Support multiple candidates efficiently. In particular, similar candidates
7209467b48Spatrick   related to a range of possible VF's and UF's must be represented efficiently.
7309467b48Spatrick   Potential versioning needs to be supported efficiently.
7409467b48Spatrick
7509467b48Spatrick5. Support vectorizing idioms, such as interleaved groups of strided loads or
7609467b48Spatrick   stores. This is achieved by modeling a sequence of output instructions using
7709467b48Spatrick   a "Recipe", which is responsible for computing its cost and generating its
7809467b48Spatrick   code.
7909467b48Spatrick
8009467b48Spatrick6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
8109467b48Spatrick   such regions may need to be, for example, predicated and linearized, or
8209467b48Spatrick   replicated VF*UF times to handle scalarized and predicated instructions.
8309467b48Spatrick   Innerloops are also modelled as SESE regions.
8409467b48Spatrick
8509467b48Spatrick7. Support instruction-level analysis and transformation, as part of Planning
8609467b48Spatrick   Step 2.b: During vectorization instructions may need to be traversed, moved,
8709467b48Spatrick   replaced by other instructions or be created. For example, vector idiom
8809467b48Spatrick   detection and formation involves searching for and optimizing instruction
8909467b48Spatrick   patterns.
9009467b48Spatrick
9109467b48SpatrickDefinitions
9209467b48Spatrick===========
9309467b48SpatrickThe low-level design of VPlan comprises of the following classes.
9409467b48Spatrick
9509467b48Spatrick:LoopVectorizationPlanner:
9609467b48Spatrick  A LoopVectorizationPlanner is designed to handle the vectorization of a loop
9709467b48Spatrick  or a loop nest. It can construct, optimize and discard one or more VPlans,
9809467b48Spatrick  each VPlan modelling a distinct way to vectorize the loop or the loop nest.
9909467b48Spatrick  Once the best VPlan is determined, including the best VF and UF, this VPlan
10009467b48Spatrick  drives the generation of output IR.
10109467b48Spatrick
10209467b48Spatrick:VPlan:
10309467b48Spatrick  A model of a vectorized candidate for a given input IR loop or loop nest. This
10409467b48Spatrick  candidate is represented using a Hierarchical CFG. VPlan supports estimating
10509467b48Spatrick  the cost and driving the generation of the output IR code it represents.
10609467b48Spatrick
10709467b48Spatrick:Hierarchical CFG:
10809467b48Spatrick  A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
10909467b48Spatrick  Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
11009467b48Spatrick  cross-Tile edges are lifted to connect Tiles instead of the original
11109467b48Spatrick  basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
11209467b48Spatrick  Region and Block are used rather than Tile [5]_ to avoid confusion with loop
11309467b48Spatrick  tiling.
11409467b48Spatrick
11509467b48Spatrick:VPBlockBase:
11609467b48Spatrick  The building block of the Hierarchical CFG. A pure-virtual base-class of
11709467b48Spatrick  VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
11809467b48Spatrick  control-flow relations with other VPBlocks. Note that in contrast to the IR
11909467b48Spatrick  BasicBlock, a VPBlockBase models its control-flow successors and predecessors
12009467b48Spatrick  directly, rather than through a Terminator branch or through predecessor
12109467b48Spatrick  branches that "use" the VPBlockBase.
12209467b48Spatrick
12309467b48Spatrick:VPBasicBlock:
12409467b48Spatrick  VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
12509467b48Spatrick  Hierarchical CFG. It represents a sequence of output IR instructions that will
12609467b48Spatrick  appear consecutively in an output IR basic-block. The instructions of this
12709467b48Spatrick  basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
12809467b48Spatrick  sequence of zero or more VPRecipes that model the cost and generation of the
12909467b48Spatrick  output IR instructions.
13009467b48Spatrick
13109467b48Spatrick:VPRegionBlock:
13209467b48Spatrick  VPRegionBlock is a subclass of VPBlockBase. It models a collection of
13309467b48Spatrick  VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
13409467b48Spatrick  CFG. A VPRegionBlock may indicate that its contents are to be replicated a
13509467b48Spatrick  constant number of times when output IR is generated, effectively representing
13609467b48Spatrick  a loop with constant trip-count that will be completely unrolled. This is used
13709467b48Spatrick  to support scalarized and predicated instructions with a single model for
13809467b48Spatrick  multiple candidate VF's and UF's.
13909467b48Spatrick
14009467b48Spatrick:VPRecipeBase:
14109467b48Spatrick  A pure-virtual base class modeling a sequence of one or more output IR
14209467b48Spatrick  instructions, possibly based on one or more input IR instructions. These
14309467b48Spatrick  input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
14409467b48Spatrick  may specify how its ingredients are to be transformed to produce the output IR
14509467b48Spatrick  instructions; e.g., cloned once, replicated multiple times or widened
14609467b48Spatrick  according to selected VF.
14709467b48Spatrick
14809467b48Spatrick:VPValue:
14909467b48Spatrick  The base of VPlan's def-use relations class hierarchy. When instantiated, it
15009467b48Spatrick  models a constant or a live-in Value in VPlan. It has users, which are of type
15109467b48Spatrick  VPUser, but no operands.
15209467b48Spatrick
15309467b48Spatrick:VPUser:
154*73471bf0Spatrick  A VPUser represents an entity that uses a number of VPValues as operands.
155*73471bf0Spatrick  VPUser is similar in some aspects to LLVM's User class.
156*73471bf0Spatrick
157*73471bf0Spatrick:VPDef:
158*73471bf0Spatrick  A VPDef represents an entity that defines zero, one or multiple VPValues.
159*73471bf0Spatrick  It is used to model the fact that recipes in VPlan can define multiple
160*73471bf0Spatrick  VPValues.
16109467b48Spatrick
16209467b48Spatrick:VPInstruction:
16309467b48Spatrick  A VPInstruction is both a VPRecipe and a VPUser. It models a single
16409467b48Spatrick  VPlan-level instruction to be generated if the VPlan is executed, including
16509467b48Spatrick  its opcode and possibly additional characteristics. It is the basis for
16609467b48Spatrick  writing instruction-level analyses and optimizations in VPlan as creating,
16709467b48Spatrick  replacing or moving VPInstructions record both def-use and scheduling
16809467b48Spatrick  decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic
16909467b48Spatrick  operations that enrich the Vectorizer's semantics.
17009467b48Spatrick
17109467b48Spatrick:VPTransformState:
17209467b48Spatrick  Stores information used for generating output IR, passed from
17309467b48Spatrick  LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
17409467b48Spatrick  additional information down to VPBlocks and VPRecipes.
17509467b48Spatrick
17609467b48SpatrickThe Planning Process and VPlan Roadmap
17709467b48Spatrick======================================
17809467b48Spatrick
17909467b48SpatrickTransforming the Loop Vectorizer to use VPlan follows a staged approach. First,
18009467b48SpatrickVPlan is used to record the final vectorization decisions, and to execute them:
18109467b48Spatrickthe Hierarchical CFG models the planned control-flow, and Recipes capture
18209467b48Spatrickdecisions taken inside basic-blocks. Next, VPlan will be used also as the basis
18309467b48Spatrickfor taking these decisions, effectively turning them into a series of
18409467b48SpatrickVPlan-to-VPlan algorithms. Finally, VPlan will support the planning process
18509467b48Spatrickitself including cost-based analyses for making these decisions, to fully
18609467b48Spatricksupport compositional and iterative decision making.
18709467b48Spatrick
18809467b48SpatrickSome decisions are local to an instruction in the loop, such as whether to widen
18909467b48Spatrickit into a vector instruction or replicate it, keeping the generated instructions
19009467b48Spatrickin place. Other decisions, however, involve moving instructions, replacing them
19109467b48Spatrickwith other instructions, and/or introducing new instructions. For example, a
19209467b48Spatrickcast may sink past a later instruction and be widened to handle first-order
19309467b48Spatrickrecurrence; an interleave group of strided gathers or scatters may effectively
19409467b48Spatrickmove to one place where they are replaced with shuffles and a common wide vector
19509467b48Spatrickload or store; new instructions may be introduced to compute masks, shuffle the
19609467b48Spatrickelements of vectors, and pack scalar values into vectors or vice-versa.
19709467b48Spatrick
19809467b48SpatrickIn order for VPlan to support making instruction-level decisions and analyses,
19909467b48Spatrickit needs to model the relevant instructions along with their def/use relations.
20009467b48SpatrickThis too follows a staged approach: first, the new instructions that compute
20109467b48Spatrickmasks are modeled as VPInstructions, along with their induced def/use subgraph.
20209467b48SpatrickThis effectively models masks in VPlan, facilitating VPlan-based predication.
20309467b48SpatrickNext, the logic embedded within each Recipe for generating its instructions at
20409467b48SpatrickVPlan execution time, will instead take part in the planning process by modeling
20509467b48Spatrickthem as VPInstructions. Finally, only logic that applies to instructions as a
20609467b48Spatrickgroup will remain in Recipes, such as interleave groups and potentially other
20709467b48Spatrickidiom groups having synergistic cost.
20809467b48Spatrick
20909467b48SpatrickRelated LLVM components
21009467b48Spatrick-----------------------
21109467b48Spatrick1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
21209467b48Spatrick   tree, where TSLP [3]_ adds Plan Step 2.b.
21309467b48Spatrick
21409467b48Spatrick2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
21509467b48Spatrick   Polly [7]_.
21609467b48Spatrick
21709467b48Spatrick3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of
21809467b48Spatrick   the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_.
21909467b48Spatrick
22009467b48SpatrickReferences
22109467b48Spatrick----------
22209467b48Spatrick.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
22309467b48Spatrick    Nuzman and Ayal Zaks, PACT 2008.
22409467b48Spatrick
22509467b48Spatrick.. [2] "Proposal for function vectorization and loop vectorization with function
22609467b48Spatrick    calls", Xinmin Tian, [`cfe-dev
22709467b48Spatrick    <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
22809467b48Spatrick    March 2, 2016.
22909467b48Spatrick    See also `review <https://reviews.llvm.org/D22792>`_.
23009467b48Spatrick
23109467b48Spatrick.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
23209467b48Spatrick    Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.
23309467b48Spatrick
23409467b48Spatrick.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
23509467b48Spatrick    overhead", Hao Zhou and Jingling Xue, CGO 2016.
23609467b48Spatrick
23709467b48Spatrick.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
23809467b48Spatrick    Brian Koblenz, PLDI 1991
23909467b48Spatrick
24009467b48Spatrick.. [6] "Structural analysis: A new approach to flow analysis in optimizing
24109467b48Spatrick    compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
24209467b48Spatrick
24309467b48Spatrick.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
24409467b48Spatrick    thesis, 2011.
24509467b48Spatrick
24609467b48Spatrick.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
24709467b48Spatrick    European LLVM Developers' Meeting 2017.
24809467b48Spatrick
24909467b48Spatrick.. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop
25009467b48Spatrick    Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016.
251