xref: /openbsd/gnu/llvm/llvm/docs/BigEndianNEON.rst (revision d415bd75)
109467b48Spatrick==============================================
209467b48SpatrickUsing ARM NEON instructions in big endian mode
309467b48Spatrick==============================================
409467b48Spatrick
509467b48Spatrick.. contents::
609467b48Spatrick    :local:
709467b48Spatrick
809467b48SpatrickIntroduction
909467b48Spatrick============
1009467b48Spatrick
1109467b48SpatrickGenerating code for big endian ARM processors is for the most part straightforward. NEON loads and stores however have some interesting properties that make code generation decisions less obvious in big endian mode.
1209467b48Spatrick
1309467b48SpatrickThe aim of this document is to explain the problem with NEON loads and stores, and the solution that has been implemented in LLVM.
1409467b48Spatrick
15*097a140dSpatrickIn this document the term "vector" refers to what the ARM ABI calls a "short vector", which is a sequence of items that can fit in a NEON register. This sequence can be 64 or 128 bits in length, and can constitute 8, 16, 32 or 64 bit items. This document refers to A64 instructions throughout, but is almost applicable to the A32/ARMv7 instruction sets also. The ABI format for passing vectors in A32 is slightly different to A64. Apart from that, the same concepts apply.
1609467b48Spatrick
1709467b48SpatrickExample: C-level intrinsics -> assembly
1809467b48Spatrick---------------------------------------
1909467b48Spatrick
2009467b48SpatrickIt may be helpful first to illustrate how C-level ARM NEON intrinsics are lowered to instructions.
2109467b48Spatrick
2209467b48SpatrickThis trivial C function takes a vector of four ints and sets the zero'th lane to the value "42"::
2309467b48Spatrick
2409467b48Spatrick    #include <arm_neon.h>
2509467b48Spatrick    int32x4_t f(int32x4_t p) {
2609467b48Spatrick        return vsetq_lane_s32(42, p, 0);
2709467b48Spatrick    }
2809467b48Spatrick
2909467b48Spatrickarm_neon.h intrinsics generate "generic" IR where possible (that is, normal IR instructions not ``llvm.arm.neon.*`` intrinsic calls). The above generates::
3009467b48Spatrick
3109467b48Spatrick    define <4 x i32> @f(<4 x i32> %p) {
3209467b48Spatrick      %vset_lane = insertelement <4 x i32> %p, i32 42, i32 0
3309467b48Spatrick      ret <4 x i32> %vset_lane
3409467b48Spatrick    }
3509467b48Spatrick
3609467b48SpatrickWhich then becomes the following trivial assembly::
3709467b48Spatrick
3809467b48Spatrick    f:                                      // @f
3909467b48Spatrick            movz	w8, #0x2a
4009467b48Spatrick            ins 	v0.s[0], w8
4109467b48Spatrick            ret
4209467b48Spatrick
4309467b48SpatrickProblem
4409467b48Spatrick=======
4509467b48Spatrick
4609467b48SpatrickThe main problem is how vectors are represented in memory and in registers.
4709467b48Spatrick
4809467b48SpatrickFirst, a recap. The "endianness" of an item affects its representation in memory only. In a register, a number is just a sequence of bits - 64 bits in the case of AArch64 general purpose registers. Memory, however, is a sequence of addressable units of 8 bits in size. Any number greater than 8 bits must therefore be split up into 8-bit chunks, and endianness describes the order in which these chunks are laid out in memory.
4909467b48Spatrick
5009467b48SpatrickA "little endian" layout has the least significant byte first (lowest in memory address). A "big endian" layout has the *most* significant byte first. This means that when loading an item from big endian memory, the lowest 8-bits in memory must go in the most significant 8-bits, and so forth.
5109467b48Spatrick
5209467b48Spatrick``LDR`` and ``LD1``
5309467b48Spatrick===================
5409467b48Spatrick
5509467b48Spatrick.. figure:: ARM-BE-ldr.png
5609467b48Spatrick    :align: right
5709467b48Spatrick
5809467b48Spatrick    Big endian vector load using ``LDR``.
5909467b48Spatrick
6009467b48Spatrick
6109467b48SpatrickA vector is a consecutive sequence of items that are operated on simultaneously. To load a 64-bit vector, 64 bits need to be read from memory. In little endian mode, we can do this by just performing a 64-bit load - ``LDR q0, [foo]``. However if we try this in big endian mode, because of the byte swapping the lane indices end up being swapped! The zero'th item as laid out in memory becomes the n'th lane in the vector.
6209467b48Spatrick
6309467b48Spatrick.. figure:: ARM-BE-ld1.png
6409467b48Spatrick    :align: right
6509467b48Spatrick
6609467b48Spatrick    Big endian vector load using ``LD1``. Note that the lanes retain the correct ordering.
6709467b48Spatrick
6809467b48Spatrick
6909467b48SpatrickBecause of this, the instruction ``LD1`` performs a vector load but performs byte swapping not on the entire 64 bits, but on the individual items within the vector. This means that the register content is the same as it would have been on a little endian system.
7009467b48Spatrick
7109467b48SpatrickIt may seem that ``LD1`` should suffice to peform vector loads on a big endian machine. However there are pros and cons to the two approaches that make it less than simple which register format to pick.
7209467b48Spatrick
7309467b48SpatrickThere are two options:
7409467b48Spatrick
7509467b48Spatrick    1. The content of a vector register is the same *as if* it had been loaded with an ``LDR`` instruction.
7609467b48Spatrick    2. The content of a vector register is the same *as if* it had been loaded with an ``LD1`` instruction.
7709467b48Spatrick
7809467b48SpatrickBecause ``LD1 == LDR + REV`` and similarly ``LDR == LD1 + REV`` (on a big endian system), we can simulate either type of load with the other type of load plus a ``REV`` instruction. So we're not deciding which instructions to use, but which format to use (which will then influence which instruction is best to use).
7909467b48Spatrick
8009467b48Spatrick.. The 'clearer' container is required to make the following section header come after the floated
8109467b48Spatrick   images above.
8209467b48Spatrick.. container:: clearer
8309467b48Spatrick
8409467b48Spatrick    Note that throughout this section we only mention loads. Stores have exactly the same problems as their associated loads, so have been skipped for brevity.
8509467b48Spatrick
8609467b48Spatrick
8709467b48SpatrickConsiderations
8809467b48Spatrick==============
8909467b48Spatrick
9009467b48SpatrickLLVM IR Lane ordering
9109467b48Spatrick---------------------
9209467b48Spatrick
9309467b48SpatrickLLVM IR has first class vector types. In LLVM IR, the zero'th element of a vector resides at the lowest memory address. The optimizer relies on this property in certain areas, for example when concatenating vectors together. The intention is for arrays and vectors to have identical memory layouts - ``[4 x i8]`` and ``<4 x i8>`` should be represented the same in memory. Without this property there would be many special cases that the optimizer would have to cleverly handle.
9409467b48Spatrick
9509467b48SpatrickUse of ``LDR`` would break this lane ordering property. This doesn't preclude the use of ``LDR``, but we would have to do one of two things:
9609467b48Spatrick
9709467b48Spatrick   1. Insert a ``REV`` instruction to reverse the lane order after every ``LDR``.
9809467b48Spatrick   2. Disable all optimizations that rely on lane layout, and for every access to an individual lane (``insertelement``/``extractelement``/``shufflevector``) reverse the lane index.
9909467b48Spatrick
10009467b48SpatrickAAPCS
10109467b48Spatrick-----
10209467b48Spatrick
10309467b48SpatrickThe ARM procedure call standard (AAPCS) defines the ABI for passing vectors between functions in registers. It states:
10409467b48Spatrick
10509467b48Spatrick    When a short vector is transferred between registers and memory it is treated as an opaque object. That is a short vector is stored in memory as if it were stored with a single ``STR`` of the entire register; a short vector is loaded from memory using the corresponding ``LDR`` instruction. On a little-endian system this means that element 0 will always contain the lowest addressed element of a short vector; on a big-endian system element 0 will contain the highest-addressed element of a short vector.
10609467b48Spatrick
10709467b48Spatrick    -- Procedure Call Standard for the ARM 64-bit Architecture (AArch64), 4.1.2 Short Vectors
10809467b48Spatrick
10909467b48SpatrickThe use of ``LDR`` and ``STR`` as the ABI defines has at least one advantage over ``LD1`` and ``ST1``. ``LDR`` and ``STR`` are oblivious to the size of the individual lanes of a vector. ``LD1`` and ``ST1`` are not - the lane size is encoded within them. This is important across an ABI boundary, because it would become necessary to know the lane width the callee expects. Consider the following code:
11009467b48Spatrick
11109467b48Spatrick.. code-block:: c
11209467b48Spatrick
11309467b48Spatrick    <callee.c>
11409467b48Spatrick    void callee(uint32x2_t v) {
11509467b48Spatrick      ...
11609467b48Spatrick    }
11709467b48Spatrick
11809467b48Spatrick    <caller.c>
11909467b48Spatrick    extern void callee(uint32x2_t);
12009467b48Spatrick    void caller() {
12109467b48Spatrick      callee(...);
12209467b48Spatrick    }
12309467b48Spatrick
12409467b48SpatrickIf ``callee`` changed its signature to ``uint16x4_t``, which is equivalent in register content, if we passed as ``LD1`` we'd break this code until ``caller`` was updated and recompiled.
12509467b48Spatrick
12609467b48SpatrickThere is an argument that if the signatures of the two functions are different then the behaviour should be undefined. But there may be functions that are agnostic to the lane layout of the vector, and treating the vector as an opaque value (just loading it and storing it) would be impossible without a common format across ABI boundaries.
12709467b48Spatrick
12809467b48SpatrickSo to preserve ABI compatibility, we need to use the ``LDR`` lane layout across function calls.
12909467b48Spatrick
13009467b48SpatrickAlignment
13109467b48Spatrick---------
13209467b48Spatrick
13309467b48SpatrickIn strict alignment mode, ``LDR qX`` requires its address to be 128-bit aligned, whereas ``LD1`` only requires it to be as aligned as the lane size. If we canonicalised on using ``LDR``, we'd still need to use ``LD1`` in some places to avoid alignment faults (the result of the ``LD1`` would then need to be reversed with ``REV``).
13409467b48Spatrick
13509467b48SpatrickMost operating systems however do not run with alignment faults enabled, so this is often not an issue.
13609467b48Spatrick
13709467b48SpatrickSummary
13809467b48Spatrick-------
13909467b48Spatrick
14009467b48SpatrickThe following table summarises the instructions that are required to be emitted for each property mentioned above for each of the two solutions.
14109467b48Spatrick
14209467b48Spatrick+-------------------------------+-------------------------------+---------------------+
14309467b48Spatrick|                               | ``LDR`` layout                | ``LD1`` layout      |
14409467b48Spatrick+===============================+===============================+=====================+
14509467b48Spatrick| Lane ordering                 |   ``LDR + REV``               |    ``LD1``          |
14609467b48Spatrick+-------------------------------+-------------------------------+---------------------+
14709467b48Spatrick| AAPCS                         |   ``LDR``                     |    ``LD1 + REV``    |
14809467b48Spatrick+-------------------------------+-------------------------------+---------------------+
14909467b48Spatrick| Alignment for strict mode     |   ``LDR`` / ``LD1 + REV``     |    ``LD1``          |
15009467b48Spatrick+-------------------------------+-------------------------------+---------------------+
15109467b48Spatrick
15209467b48SpatrickNeither approach is perfect, and choosing one boils down to choosing the lesser of two evils. The issue with lane ordering, it was decided, would have to change target-agnostic compiler passes and would result in a strange IR in which lane indices were reversed. It was decided that this was worse than the changes that would have to be made to support ``LD1``, so ``LD1`` was chosen as the canonical vector load instruction (and by inference, ``ST1`` for vector stores).
15309467b48Spatrick
15409467b48SpatrickImplementation
15509467b48Spatrick==============
15609467b48Spatrick
15709467b48SpatrickThere are 3 parts to the implementation:
15809467b48Spatrick
15909467b48Spatrick    1. Predicate ``LDR`` and ``STR`` instructions so that they are never allowed to be selected to generate vector loads and stores. The exception is one-lane vectors [1]_ - these by definition cannot have lane ordering problems so are fine to use ``LDR``/``STR``.
16009467b48Spatrick
16109467b48Spatrick    2. Create code generation patterns for bitconverts that create ``REV`` instructions.
16209467b48Spatrick
16309467b48Spatrick    3. Make sure appropriate bitconverts are created so that vector values get passed over call boundaries as 1-element vectors (which is the same as if they were loaded with ``LDR``).
16409467b48Spatrick
16509467b48SpatrickBitconverts
16609467b48Spatrick-----------
16709467b48Spatrick
16809467b48Spatrick.. image:: ARM-BE-bitcastfail.png
16909467b48Spatrick    :align: right
17009467b48Spatrick
17109467b48SpatrickThe main problem with the ``LD1`` solution is dealing with bitconverts (or bitcasts, or reinterpret casts). These are pseudo instructions that only change the compiler's interpretation of data, not the underlying data itself. A requirement is that if data is loaded and then saved again (called a "round trip"), the memory contents should be the same after the store as before the load. If a vector is loaded and is then bitconverted to a different vector type before storing, the round trip will currently be broken.
17209467b48Spatrick
17309467b48SpatrickTake for example this code sequence::
17409467b48Spatrick
17509467b48Spatrick    %0 = load <4 x i32> %x
17609467b48Spatrick    %1 = bitcast <4 x i32> %0 to <2 x i64>
17709467b48Spatrick         store <2 x i64> %1, <2 x i64>* %y
17809467b48Spatrick
17909467b48SpatrickThis would produce a code sequence such as that in the figure on the right. The mismatched ``LD1`` and ``ST1`` cause the stored data to differ from the loaded data.
18009467b48Spatrick
18109467b48Spatrick.. container:: clearer
18209467b48Spatrick
18309467b48Spatrick    When we see a bitcast from type ``X`` to type ``Y``, what we need to do is to change the in-register representation of the data to be *as if* it had just been loaded by a ``LD1`` of type ``Y``.
18409467b48Spatrick
18509467b48Spatrick.. image:: ARM-BE-bitcastsuccess.png
18609467b48Spatrick    :align: right
18709467b48Spatrick
18809467b48SpatrickConceptually this is simple - we can insert a ``REV`` undoing the ``LD1`` of type ``X`` (converting the in-register representation to the same as if it had been loaded by ``LDR``) and then insert another ``REV`` to change the representation to be as if it had been loaded by an ``LD1`` of type ``Y``.
18909467b48Spatrick
19009467b48SpatrickFor the previous example, this would be::
19109467b48Spatrick
19209467b48Spatrick    LD1   v0.4s, [x]
19309467b48Spatrick
19409467b48Spatrick    REV64 v0.4s, v0.4s                  // There is no REV128 instruction, so it must be synthesizedcd
19509467b48Spatrick    EXT   v0.16b, v0.16b, v0.16b, #8    // with a REV64 then an EXT to swap the two 64-bit elements.
19609467b48Spatrick
19709467b48Spatrick    REV64 v0.2d, v0.2d
19809467b48Spatrick    EXT   v0.16b, v0.16b, v0.16b, #8
19909467b48Spatrick
20009467b48Spatrick    ST1   v0.2d, [y]
20109467b48Spatrick
20209467b48SpatrickIt turns out that these ``REV`` pairs can, in almost all cases, be squashed together into a single ``REV``. For the example above, a ``REV128 4s`` + ``REV128 2d`` is actually a ``REV64 4s``, as shown in the figure on the right.
20309467b48Spatrick
20409467b48Spatrick.. [1] One lane vectors may seem useless as a concept but they serve to distinguish between values held in general purpose registers and values held in NEON/VFP registers. For example, an ``i64`` would live in an ``x`` register, but ``<1 x i64>`` would live in a ``d`` register.
205