• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..31-Mar-2022-

debian/H07-May-2022-9468

hwy/H31-Mar-2022-34,07225,103

BUILDH A D31-Mar-20225.4 KiB229204

CMakeLists.txt.inH A D31-Mar-2022443 1513

CONTRIBUTINGH A D31-Mar-20221.2 KiB3423

LICENSEH A D31-Mar-202211.1 KiB201169

README.mdH A D31-Mar-202217.5 KiB367287

WORKSPACEH A D31-Mar-2022961 2520

libhwy-contrib.pc.inH A D31-Mar-2022312 119

libhwy-test.pc.inH A D31-Mar-2022244 97

libhwy.pc.inH A D31-Mar-2022299 119

run_tests.batH A D31-Mar-2022254 2115

run_tests.shH A D31-Mar-2022195 169

test.pyH A D31-Mar-20224.2 KiB13298

README.md

1# Efficient and performance-portable SIMD
2
3Highway is a C++ library for SIMD (Single Instruction, Multiple Data), i.e.
4applying the same operation to 'lanes'.
5
6## Why Highway?
7
8- more portable (same source code) than platform-specific intrinsics,
9- works on a wider range of compilers than compiler-specific vector extensions,
10- more dependable than autovectorization,
11- easier to write/maintain than assembly language,
12- supports **runtime dispatch**,
13- supports **variable-length vector** architectures.
14
15## Current status
16
17Supported targets: scalar, SSE4, AVX2, AVX-512, NEON (ARMv7 and v8), WASM SIMD.
18Ports to RVV and SVE/SVE2 are in progress.
19
20Version 0.11 is considered stable enough to use in other projects, and is
21expected to remain backwards compatible unless serious issues are discovered
22while implementing SVE/RVV targets. After these targets are added, Highway will
23reach version 1.0.
24
25Continuous integration tests build with a recent version of Clang (running on
26x86 and QEMU for ARM) and MSVC from VS2015 (running on x86).
27
28Before releases, we also test on x86 with Clang and GCC, and ARMv7/8 via
29GCC cross-compile and QEMU. See the
30[testing process](g3doc/release_testing_process.md) for details.
31
32The `contrib` directory contains SIMD-related utilities: an image class with
33aligned rows, and a math library (16 functions already implemented, mostly
34trigonometry).
35
36## Installation
37
38This project uses cmake to generate and build. In a Debian-based system you can
39install it via:
40
41```bash
42sudo apt install cmake
43```
44
45Highway's unit tests use [googletest](https://github.com/google/googletest).
46By default, Highway's CMake downloads this dependency at configuration time.
47You can disable this by setting the `HWY_SYSTEM_GTEST` CMake variable to ON and
48installing gtest separately:
49
50```bash
51sudo apt install libgtest-dev
52```
53
54To build and test the library the standard cmake workflow can be used:
55
56```bash
57mkdir -p build && cd build
58cmake ..
59make -j && make test
60```
61
62Or you can run `run_tests.sh` (`run_tests.bat` on Windows).
63
64To test on all the attainable targets for your platform, use
65`cmake .. -DCMAKE_CXX_FLAGS="-DHWY_COMPILE_ALL_ATTAINABLE"`. Otherwise, the
66default configuration skips baseline targets (e.g. scalar) that are superseded
67by another baseline target.
68
69Bazel is also supported for building, but it is not as widely used/tested.
70
71## Quick start
72
73You can use the `benchmark` inside examples/ as a starting point.
74
75A [quick-reference page](g3doc/quick_reference.md) briefly lists all operations
76and their parameters, and the [instruction_matrix][instmtx] indicates the
77number of instructions per operation.
78
79We recommend using full SIMD vectors whenever possible for maximum performance
80portability. To obtain them, pass a `HWY_FULL(float)` tag to functions such as
81`Zero/Set/Load`. There is also the option of a vector of up to `N` (a power of
82two) lanes: `HWY_CAPPED(T, N)`. 128-bit vectors are guaranteed to be available
83for lanes of type `T` if `HWY_TARGET != HWY_SCALAR` and `N == 16 / sizeof(T)`.
84
85Functions using Highway must be inside a namespace `namespace HWY_NAMESPACE {`
86(possibly nested in one or more other namespaces defined by the project), and
87additionally either prefixed with `HWY_ATTR`, or residing between
88`HWY_BEFORE_NAMESPACE()` and `HWY_AFTER_NAMESPACE()`.
89
90*   For static dispatch, `HWY_TARGET` will be the best available target among
91    `HWY_BASELINE_TARGETS`, i.e. those allowed for use by the compiler (see
92    [quick-reference](g3doc/quick_reference.md)). Functions inside `HWY_NAMESPACE`
93    can be called using `HWY_STATIC_DISPATCH(func)(args)` within the same module
94    they are defined in. You can call the function from other modules by
95    wrapping it in a regular function and declaring the regular function in a
96    header.
97
98*   For dynamic dispatch, a table of function pointers is generated via the
99    `HWY_EXPORT` macro that is used by `HWY_DYNAMIC_DISPATCH(func)(args)` to
100    call the best function pointer for the current CPU supported targets. A
101    module is automatically compiled for each target in `HWY_TARGETS` (see
102    [quick-reference](g3doc/quick_reference.md)) if `HWY_TARGET_INCLUDE` is
103    defined and foreach_target.h is included.
104
105## Strip-mining loops
106
107To vectorize a loop, "strip-mining" transforms it into an outer loop and inner
108loop with number of iterations matching the preferred vector width.
109
110In this section, let `T` denote the element type, `d = HWY_FULL(T)`, `count` the
111number of elements to process, and `N = Lanes(d)` the number of lanes in a full
112vector. Assume the loop body is given as a function `template<bool partial,
113class D> void LoopBody(D d, size_t max_n)`.
114
115Highway offers several ways to express loops where `N` need not divide `count`:
116
117*   Ensure all inputs/outputs are padded. Then the loop is simply
118
119    ```
120    for (size_t i = 0; i < count; i += N) LoopBody<false>(d, 0);
121    ```
122    Here, the template parameter and second function argument are not needed.
123
124    This is the preferred option, unless `N` is in the thousands and vector
125    operations are pipelined with long latencies. This was the case for
126    supercomputers in the 90s, but nowadays ALUs are cheap and we see most
127    implementations split vectors into 1, 2 or 4 parts, so there is little cost
128    to processing entire vectors even if we do not need all their lanes. Indeed
129    this avoids the (potentially large) cost of predication or partial
130    loads/stores on older targets, and does not duplicate code.
131
132*   Process whole vectors as above, followed by a scalar loop:
133
134    ```
135    size_t i = 0;
136    for (; i + N <= count; i += N) LoopBody<false>(d, 0);
137    for (; i < count; ++i) LoopBody<false>(HWY_CAPPED(T, 1)(), 0);
138    ```
139    The template parameter and second function arguments are again not needed.
140
141    This avoids duplicating code, and is reasonable if `count` is large.
142    Otherwise, multiple iterations may be slower than one `LoopBody` variant
143    with masking, especially because the `HWY_SCALAR` target selected by
144    `HWY_CAPPED(T, 1)` is slower for some operations due to workarounds for
145    undefined behavior in C++.
146
147*   Process whole vectors as above, followed by a single call to a modified
148    `LoopBody` with masking:
149
150    ```
151    size_t i = 0;
152    for (; i + N <= count; i += N) {
153      LoopBody<false>(d, 0);
154    }
155    if (i < count) {
156      LoopBody<true>(d, count - i);
157    }
158    ```
159    Now the template parameter and second function argument can be used inside
160    `LoopBody` to replace `Load/Store` of full aligned vectors with
161    `LoadN/StoreN(n)` that affect no more than `1 <= n <= N` aligned elements
162    (pending implementation).
163
164    This is a good default when it is infeasible to ensure vectors are padded.
165    In contrast to the scalar loop, only a single final iteration is needed.
166
167## Design philosophy
168
169*   Performance is important but not the sole consideration. Anyone who goes to
170    the trouble of using SIMD clearly cares about speed. However, portability,
171    maintainability and readability also matter, otherwise we would write in
172    assembly. We aim for performance within 10-20% of a hand-written assembly
173    implementation on the development platform.
174
175*   The guiding principles of C++ are "pay only for what you use" and "leave no
176    room for a lower-level language below C++". We apply these by defining a
177    SIMD API that ensures operation costs are visible, predictable and minimal.
178
179*   Performance portability is important, i.e. the API should be efficient on
180    all target platforms. Unfortunately, common idioms for one platform can be
181    inefficient on others. For example: summing lanes horizontally versus
182    shuffling. Documenting which operations are expensive does not prevent their
183    use, as evidenced by widespread use of `HADDPS`. Performance acceptance
184    tests may detect large regressions, but do not help choose the approach
185    during initial development. Analysis tools can warn about some potential
186    inefficiencies, but likely not all. We instead provide [a carefully chosen
187    set of vector types and operations that are efficient on all target
188    platforms][instmtx] (PPC8, SSE4/AVX2+, ARMv8).
189
190*   Future SIMD hardware features are difficult to predict. For example, AVX2
191    came with surprising semantics (almost no interaction between 128-bit
192    blocks) and AVX-512 added two kinds of predicates (writemask and zeromask).
193    To ensure the API reflects hardware realities, we suggest a flexible
194    approach that adds new operations as they become commonly available, with
195    scalar fallbacks where not supported.
196
197*   Masking is not yet widely supported on current CPUs. It is difficult to
198    define an interface that provides access to all platform features while
199    retaining performance portability. The P0214R5 proposal lacks support for
200    AVX-512/ARM SVE zeromasks. We suggest limiting usage of masks to the
201    `IfThen[Zero]Else[Zero]` functions until the community has gained more
202    experience with them.
203
204*   "Width-agnostic" SIMD is more future-proof than user-specified fixed sizes.
205    For example, valarray-like code can iterate over a 1D array with a
206    library-specified vector width. This will result in better code when vector
207    sizes increase, and matches the direction taken by
208    [ARM SVE](https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf) and
209    RiscV V as well as Agner Fog's
210    [ForwardCom instruction set proposal](https://goo.gl/CFizWu). However, some
211    applications may require fixed sizes, so we also guarantee support for
212    128-bit vectors in each instruction set.
213
214*   The API and its implementation should be usable and efficient with commonly
215    used compilers, including MSVC. For example, we write `ShiftLeft<3>(v)`
216    instead of `v << 3` because MSVC 2017 (ARM64) does not propagate the literal
217    (https://godbolt.org/g/rKx5Ga). Highway requires function-specific
218    target attributes, supported by GCC 4.9 / Clang 3.9 / MSVC 2015.
219
220*   Efficient and safe runtime dispatch is important. Modules such as image or
221    video codecs are typically embedded into larger applications such as
222    browsers, so they cannot require separate binaries for each CPU. Libraries
223    also cannot predict whether the application already uses AVX2 (and pays the
224    frequency throttling cost), so this decision must be left to the
225    application. Using only the lowest-common denominator instructions
226    sacrifices too much performance.
227    Therefore, we provide code paths for multiple instruction sets and choose
228    the most suitable at runtime. To reduce overhead, dispatch should be hoisted
229    to higher layers instead of checking inside every low-level function.
230    Highway supports inlining functions in the same file or in *-inl.h headers.
231    We generate all code paths from the same source to reduce implementation-
232    and debugging cost.
233
234*   Not every CPU need be supported. For example, pre-SSE4.1 CPUs are
235    increasingly rare and the AVX instruction set is limited to floating-point
236    operations. To reduce code size and compile time, we provide specializations
237    for SSE4, AVX2 and AVX-512 instruction sets on x86, plus a scalar fallback.
238
239*   Access to platform-specific intrinsics is necessary for acceptance in
240    performance-critical projects. We provide conversions to and from intrinsics
241    to allow utilizing specialized platform-specific functionality, and simplify
242    incremental porting of existing code.
243
244*   The core API should be compact and easy to learn. We provide only the few
245    dozen operations which are necessary and sufficient for most of the 150+
246    SIMD applications we examined.
247
248## Prior API designs
249
250The author has been writing SIMD code since 2002: first via assembly language,
251then intrinsics, later Intel's `F32vec4` wrapper, followed by three generations
252of custom vector classes. The first used macros to generate the classes, which
253reduces duplication but also readability. The second used templates instead.
254The third (used in highwayhash and PIK) added support for AVX2 and runtime
255dispatch. The current design (used in JPEG XL) enables code generation for
256multiple platforms and/or instruction sets from the same source, and improves
257runtime dispatch.
258
259## Differences versus [P0214R5 proposal](https://goo.gl/zKW4SA)
260
2611.  Adding widely used and portable operations such as `AndNot`, `AverageRound`,
262    bit-shift by immediates and `IfThenElse`.
263
2641.  Designing the API to avoid or minimize overhead on AVX2/AVX-512 caused by
265    crossing 128-bit 'block' boundaries.
266
2671.  Avoiding the need for non-native vectors. By contrast, P0214R5's `simd_cast`
268    returns `fixed_size<>` vectors which are more expensive to access because
269    they reside on the stack. We can avoid this plus additional overhead on
270    ARM/AVX2 by defining width-expanding operations as functions of a vector
271    part, e.g. promoting half a vector of `uint8_t` lanes to one full vector of
272    `uint16_t`, or demoting full vectors to half vectors with half-width lanes.
273
2741.  Guaranteeing access to the underlying intrinsic vector type. This ensures
275    all platform-specific capabilities can be used. P0214R5 instead only
276    'encourages' implementations to provide access.
277
2781.  Enabling safe runtime dispatch and inlining in the same binary. P0214R5 is
279    based on the Vc library, which does not provide assistance for linking
280    multiple instruction sets into the same binary. The Vc documentation
281    suggests compiling separate executables for each instruction set or using
282    GCC's ifunc (indirect functions). The latter is compiler-specific and risks
283    crashes due to ODR violations when compiling the same function with
284    different compiler flags. We solve this problem via target-specific
285    namespaces and attributes (see HOWTO section below). We also permit a mix of
286    static target selection and runtime dispatch for hotspots that may benefit
287    from newer instruction sets if available.
288
2891.  Using built-in PPC vector types without a wrapper class. This leads to much
290    better code generation with GCC 6.3: https://godbolt.org/z/pd2PNP.
291    By contrast, P0214R5 requires a wrapper. We avoid this by using only the
292    member operators provided by the PPC vectors; all other functions and
293    typedefs are non-members. 2019-04 update: Clang power64le does not have
294    this issue, so we simplified get_part(d, v) to GetLane(v).
295
2961.  Omitting inefficient or non-performance-portable operations such as `hmax`,
297    `operator[]`, and unsupported integer comparisons. Applications can often
298    replace these operations at lower cost than emulating that exact behavior.
299
3001.  Omitting `long double` types: these are not commonly available in hardware.
301
3021.  Ensuring signed integer overflow has well-defined semantics (wraparound).
303
3041.  Simple header-only implementation and less than a tenth of the size of the
305    Vc library from which P0214 was derived (98,000 lines in
306    https://github.com/VcDevel/Vc according to the gloc Chrome extension).
307
3081.  Avoiding hidden performance costs. P0214R5 allows implicit conversions from
309    integer to float, which costs 3-4 cycles on x86. We make these conversions
310    explicit to ensure their cost is visible.
311
312## Other related work
313
314*   [Neat SIMD](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7568423)
315    adopts a similar approach with interchangeable vector/scalar types and
316    a compact interface. It allows access to the underlying intrinsics, but
317    does not appear to be designed for other platforms than x86.
318
319*   UME::SIMD ([code](https://goo.gl/yPeVZx), [paper](https://goo.gl/2xpZrk))
320    also adopts an explicit vectorization model with vector classes.
321    However, it exposes the union of all platform capabilities, which makes the
322    API harder to learn (209-page spec) and implement (the estimated LOC count
323    is [500K](https://goo.gl/1THFRi)). The API is less performance-portable
324    because it allows applications to use operations that are inefficient on
325    other platforms.
326
327*   Inastemp ([code](https://goo.gl/hg3USM), [paper](https://goo.gl/YcTU7S))
328    is a vector library for scientific computing with some innovative features:
329    automatic FLOPS counting, and "if/else branches" using lambda functions.
330    It supports IBM Power8, but only provides float and double types.
331
332## Overloaded function API
333
334Most C++ vector APIs rely on class templates. However, the ARM SVE vector
335type is sizeless and cannot be wrapped in a class. We instead rely on overloaded
336functions. Overloading based on vector types is also undesirable because SVE
337vectors cannot be default-constructed. We instead use a dedicated 'descriptor'
338type `Simd` for overloading, abbreviated to `D` for template arguments and
339`d` in lvalues.
340
341Note that generic function templates are possible (see highway.h).
342
343## Masks
344
345AVX-512 introduced a major change to the SIMD interface: special mask registers
346(one bit per lane) that serve as predicates. It would be expensive to force
347AVX-512 implementations to conform to the prior model of full vectors with lanes
348set to all one or all zero bits. We instead provide a Mask type that emulates
349a subset of this functionality on other platforms at zero cost.
350
351Masks are returned by comparisons and `TestBit`; they serve as the input to
352`IfThen*`. We provide conversions between masks and vector lanes. For clarity
353and safety, we use FF..FF as the definition of true. To also benefit from
354x86 instructions that only require the sign bit of floating-point inputs to be
355set, we provide a special `ZeroIfNegative` function.
356
357## Additional resources
358
359*   [Highway introduction (slides)][intro]
360*   [Overview of instructions per operation on different architectures][instmtx]
361
362[intro]: g3doc/highway_intro.pdf
363[instmtx]: g3doc/instruction_matrix.pdf
364
365This is not an officially supported Google product.
366Contact: janwas@google.com
367