• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

CMakeModules/H07-May-2022-872497

Examples/H03-May-2022-2,6761,864

Extra/H03-May-2022-6550

Patterns/H03-May-2022-13879

Src/H03-May-2022-8,8006,237

UTests/H03-May-2022-1,9651,604

.clang-formatH A D02-Jun-20172.2 KiB7675

.gitlab-ci.ymlH A D02-Jun-20173.3 KiB155140

LICENSEH A D02-Jun-20171.1 KiB2117

README.mdH A D02-Jun-201722.1 KiB562441

README.md

1# Inastemp - Intrinsics as template
2
3Version : 0.1 (29/06/2016)
4
5[TOC]
6
7
8## Introduction
9
10- What is Inastemp?
11Inastemp provides a set of `C++` classes to make vectorization with intrinsics easier.
12It aims at developing numerical kernels by separating the algorithm from the hardware target.
13Inastemp comes with several examples and patterns related to widespread use-cases.
14
15- License
16The code is is published under the MIT license.
17See the `LICENSE` file or [https://opensource.org/licenses/MIT] for details.
18
19- Target audience and use cases
20The library requires basic C++ knowledge about classes, templates, etc.
21It is not mandatory to really know what is vectorization all about, but it certainly helps.
22For example, the increment of loops is usually tied to the underlying size of the vector which means that one should understand why and where the increment is not equal to 1.
23A good way to use Inastemp would be to have a software engineer managing the inclusion and hiding the small complexities of the template process,
24and to have the scientists work with basic simple functions templatized against a vector type.
25
26- Overview
27The library itself purely consists of headers. However, it comes with CMAKE files.
28CMAKE will detect the hardware capacities and turn on some intrinsics by default.
29One can turn them on/off using `ccmake` or other configuration tool if needed.
30Unit tests and examples can be compiled and the headers installed with the usual CMAKE stage.
31
32
33### Templates --- Breaking the Genericity/Optimization Opposition
34
35The main point of Inastemp is to factorize the intrinsics code to let the developers implement kernels without knowing the final destination of the code.
36It has been common practice to write a kernel for each architecture (partly because, similarly, it has been common practice to develop in C/Fortran).  At this point, Inastemp fills the gap being responsible of the architecture.
37When developing a kernel for a specific target we usually make some specific optimizations.
38But any optimization usually means being less generic: Optimization by targeting an hardware, by using special instructions, by considering that the input data respect a given pattern, etc.
39Using C++ templates, the abstraction allows to have a generic algorithm and to choose only at compile time the optimizations that should be applied.
40It offers a huge benefit in term of factorization and maintainability of the code.
41```
42                 Generic <-----------------------> Optimized
43       All architectures                           One architecture
44               All input                           Special input
45
46Template C++ source-code                           Compiled Template C++ code
47```
48
49## Features of Inastemp
50
51- The following x86 SIMD types are currently supported:
52    - SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX512-KNL, AVX512-SKL
53- The following Powere PC SIMD types are currently supported:
54    - Power-8 Altivec/VMX
55- arithmetic operators `*/+-` are provided
56- CPU capacities are detected automatically during the CMake stage
57- The compiler capacities are detected automatically during the CMake stage
58- The library purely contains of headers, no linkage is necessary.
59- CPU detection may use Intel®-SDE
60- Unit-tests may use Intel®-SDE
61- Fast intrinsic `exp()` function (if not supported natively by the compiler)
62- Explicit branches vectorization
63- several patterns which represent many applications are demonstrated
64
65
66## Compilation
67
68### Basic compilation
69
70```bash
71# Create a Build directory
72mkdir Build
73# Go into the Build directory
74cd Build
75# Run cmake - considering Inastemp is the upper directory
76cmake ..
77# OR with more output messages
78VERBOSE=1 cmake ..
79# Then make will build the unit test and examples
80make
81```
82
83### CMake variables - Hardware detection (X86)
84
85There are two hardware detections:
86- the instructions supported by the compiler
87- the instructions supported by the current CPU
88
89For each existing vector type in Inastemp, the cmake stage runs both tests.
90The results are print out if `VERBOSE=1`.
91If a vector type can be compiled, it creates the cmake variable `INASTEMP_USE_X` where `X` could be `SSE3`, `AVX2`, etc.
92Then, all the vector types that are supported by the CPU are turned to `ON` by default.
93So it is possible to turn on specific options even if the CPU does not support them (to execute over intel SDE for example).
94But if an options does not exist (even if the hardware seems appropriate) it means that the compiler does not support it.
95
96For example, here is a part of the output of the `ccmake ..` command on a AVX2 CPU and using GCC 6.1:
97```
98 INASTEMP_USE_AVX                 ON
99 INASTEMP_USE_AVX2                ON
100 INASTEMP_USE_AVX512KNL           OFF
101 INASTEMP_USE_AVX512SKL           OFF
102 INASTEMP_USE_SSE3                ON
103 INASTEMP_USE_SSE41               ON
104 INASTEMP_USE_SSE42               ON
105 INASTEMP_USE_SSSE3               ON
106```
107
108`AVX512KNL` and `AVX512SKL` are supported by the compiler but not by the hardware, so they are turned `OFF` but could be turn to `ON` if needed.
109
110By turning the cmake variable `INASTEMP_ISDE_CPU` to `ON` the hardware detection is done over intel SDE.
111In this case, one can ask Inastemp to check any hardware (passing the appropriate options to isde).
112
113### CMake variables - Hardware detection (Power PC)
114
115Inastemp perform the following test to enable the VMX classes (and to disable all the X86 classes):
116```
117if(${CMAKE_SYSTEM_PROCESSOR} STREQUAL "ppc64le")
118```
119If the detection looks incorrect, please check the value returned by `${CMAKE_SYSTEM_PROCESSOR}` and open an issue on our gitlab.
120
121### Using Inastemp as a sub-project with CMake
122
123Considering that the folder `inastemp exist in the root directory of the project, one could use:
124```
125## inastemp inclusion in CMakeLists.txt of the other project
126# tell inastemp we do not want examples/tests
127# (will be detected automatically otherwise because it is included as a subproject)
128set(INASTEMP_JUST_LIB TRUE)
129# Set the C++ standard required by the master project (default inastamp is 11)
130set(CMAKE_CXX_STANDARD 11) # Could be 11 or 14 ...
131set(CMAKE_CXX_STANDARD_REQUIRED ON) # Could be OFF in specific cases
132# add the cmakelist directory
133add_subdirectory(inastemp)
134# use the filled variables from inastemp
135INCLUDE_DIRECTORIES(
136         ${INASTEMP_BINARY_DIR}/Src
137         ${INASTEMP_SOURCE_DIR}/Src
138         ${INASTEMP_INCLUDE_DIR}
139    )
140# propagate the flags to be able to compile
141set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${INASTEMP_CXX_FLAGS}")
142```
143
144The statement `set(INASTEMP_JUST_LIB TRUE)` asks Inastemp not to compile tests/examples and to limit inclusion of extra warning flags in `INASTEMP_CXX_FLAGS`.
145The inclusion of Inastemp as a subproject does not modify the variables of the upper project.
146This is why we use the variables `${INASTEMP_CXX_FLAGS}` explicitly to transfer the hardware specific flags.
147
148When Inastemp detects it is included as a subproject, it turns the variable `INASTEMP_AS_SUBPROJECT` to `ON` and minimizes the additional flags.
149As a result, the variable `INASTEMP_CXX_FLAGS` will contain a flag for `C++11` and `-funroll-loops` (from `INASTEMP_EXTRA_CXX_FLAGS`), plus the flags needed for the target instructions.
150Such that, if one wants to compile only some files with these specific flags, it must use `INASTEMP_CXX_FLAGS` carefully.
151
152### Compilers support
153
154Inastemp was developed and tested using the following compilers on the x86_64 architecture.
155- Gcc 6.1 (earlier versions if AVX512/KNL/SKL are not used, like 4.9)
156- Clang 3.5
157- Intel 16.0
158Earlier versions may work as well.
159
160It was also tested on IBM POWER 8 NVL/4023GHz (Openpower) using the following compilers:
161- Gcc 6.2.0
162
163- Intel special flags
164We pass `-diag-disable 2304 -diag-disable 10121 -diag-disable 10120` to intel compiler to remove the implicit conversion warnings between Inastemp classes.
165In fact, we voluntarily want implicit conversion because it helps us to factorize and reduce the number of lines drastically.
166
167User can add the following flags to push for more inlining `-inline-forceinline -no-inline-max-total-size` for intel compiler.
168However, it looks like such options do not work for several compiler versions, and thus we do not want to add them by default.
169
170### Multiple hardwares compilation
171
172Inastemp is not designed to compile a binary targeting multiple hardwares (like having an execution path for the different possibilities).
173If such need appears, let us know and we might enable it in a next release.
174
175## Best practices
176
177To ensure good performance, developers should obey some rules:
178- Never try to use an abstract data type above the vector-type (especially never use `InaVecInterface`).
179In fact, the Inastemp functions must be inlined and known at compile time to be efficient.
180While in some cases the compiler may know what is behind an abstract type this is not always true and may lead to very poor performance.
181- Do not use the SCALAR classes for real development. SSE is available almost everywhere so SCALAR classes should be used only when really needed, like for example at the end of the loop if it is not possible to allocate a loop with a size equal to a multiple of the length of the vector.
182- Do not silently convert scalar float/double to vector type in a loop, as shown in the following example:
183
184```cpp
185for(){
186    // ...
187    VecType a = ....
188    VecType v = 1.0 * a; // Silent conversion of 1.0
189    // Or
190    VecType v = VecType(1.0) * a; // Explicit conversion
191    // In the previous lines 1.0 is converted to VecType
192    // It might be optimized by the compiler but please manually move it before the loop
193}
194// Do
195const VecType VecOne = 1;
196for(){
197    // ...
198    VecType a = ....
199    VecType v = VecOne * a;
200```
201
202- Never update a variable in a lambda function in a if-else structure (see If-else description)
203
204```cpp
205a = 1
206b = VecType::If(cond).Then([&](){ // take 'a' by ref
207        a += 1
208        return a;
209    }).Else([&](){
210        return a; // here 'a' is 1+1
211    });
212
213```
214
215## Exp Function (SIMD)
216
217Inastemp provides an implementation of the `exponential` function using intrinsics.
218It follows the nice paper _Fast Exponential Computation on SIMD Architectures_
219available at
220[https://www.researchgate.net/publication/272178514_Fast_Exponential_Computation_on_SIMD_Architectures].
221The Remez polynomial is used, and the Scilab file to find out the coefficient is provided.
222
223
224## Intel® Software Development Emulator (iSDE/SDE)
225
226When using SDE, Inastemp looks for `sde64`, therefore you must update your path or put a symbolic link:
227```bash
228// Most system:
229export PATH=/PATH-TO-SDE-BIN/:$PATH
230// On ubuntu if sde is not installed in the system folder, I used
231sudo update-alternatives --install /usr/bin/sde64 sde64 /PATH-TO-SDE-BIN/sde64 1
232```
233
234
235## Simple Examples
236
237### Patterns/sumArrays1L
238Taken from the pattern examples, we simply sum two arrays.
239
240```cpp
241template < class VecType >
242void SumArrays(double* __restrict__ dest, const double* __restrict__ src1,
243               const double* __restrict__ src2, const int nbToProceed) {
244
245    for (int idx = 0; idx < nbToProceed; idx += VecType::VecLength) {
246        const VecType v1(&src1[idx]);
247        const VecType v2(&src2[idx]);
248        const VecType res = v1 + v2;
249        res.storeInArray(&dest[idx]);
250        // In one line:
251        //VecType(VecType(&src1[idx]) + VecType(&src2[idx]).storeInArray(&dest[idx]);
252    }
253}
254```
255
256### Examples/ParticlesWithBranch
257
258Here we compute the interactions between particles.  The scalar sequential code reads:
259
260```cpp
261void ScalarFunction(const int nbParticles, const double* __restrict__ positionsX,
262                        const double* __restrict__ positionsY,const double* __restrict__ positionsZ,
263                        const double* __restrict__ physicalValues, double* __restrict__ potentials,
264                        const double cutDistance, const double constantIfCut){
265    for(int idxTarget = 0 ; idxTarget < nbParticles ; ++idxTarget){
266        for(int idxSource = idxTarget+1 ; idxSource < nbParticles ; ++idxSource){
267            const double dx = positionsX[idxTarget] - positionsX[idxSource];
268            const double dy = positionsY[idxTarget] - positionsY[idxSource];
269            const double dz = positionsZ[idxTarget] - positionsZ[idxSource];
270
271            const double distance = std::sqrt(dx*dx + dy*dy + dz*dz);
272            const double inv_distance = 1/distance;
273
274            if(distance < cutDistance){
275                potentials[idxTarget]  += ( inv_distance * physicalValues[idxSource] );
276                potentials[idxSource] += ( inv_distance * physicalValues[idxTarget] );
277            }
278            else{
279                potentials[idxTarget]  += ( inv_distance * (physicalValues[idxTarget]-constantIfCut) );
280                potentials[idxSource] += ( inv_distance * (physicalValues[idxTarget]-constantIfCut) );
281            }
282        }
283    }
284}
285```
286
287A part of it is rewritten :
288
289```cpp
290template <class VecType>
291void VectorizedFunction(const int nbParticles, const double* __restrict__ positionsX,
292                        const double* __restrict__ positionsY,const double* __restrict__ positionsZ,
293                        const double* __restrict__ physicalValues, double* __restrict__ potentials,
294                        const double cutDistance, const double constantIfCut){
295
296    const VecType VecOne = 1;
297    const VecType VecConstantIfCut = constantIfCut;
298    const VecType VecCutDistance = cutDistance;
299
300    for(int idxTarget = 0 ; idxTarget < nbParticles ; ++idxTarget){
301
302        const VecType targetX = positionsX[idxTarget];
303        const VecType targetY = positionsY[idxTarget];
304        const VecType targetZ = positionsZ[idxTarget];
305
306        const VecType targetPhysicalValue = physicalValues[idxTarget];
307        VecType targetPotential = VecType::GetZero();
308
309        const int lastToCompute = ((nbParticles-(idxTarget+1))/VecType::VecLength)*VecType::VecLength+(idxTarget+1);
310
311        for(int idxSource = idxTarget+1 ; idxSource < lastToCompute ; idxSource += VecType::VecLength){
312            const VecType dx = targetX - VecType(&positionsX[idxSource]);
313            const VecType dy = targetY - VecType(&positionsY[idxSource]);
314            const VecType dz = targetZ - VecType(&positionsZ[idxSource]);
315
316            const VecType distance = VecType(dx*dx + dy*dy + dz*dz).sqrt();
317            const VecType inv_distance = VecOne/distance;
318
319            const typename VecType::MaskType testRes = (distance < VecCutDistance);
320
321            const VecType sourcesPhysicalValue = VecType(&physicalValues[idxSource]);
322
323            targetPotential += inv_distance * VecType::IfElse(testRes, sourcesPhysicalValue,
324                                                                      sourcesPhysicalValue-VecConstantIfCut);
325            const VecType resSource = inv_distance * VecType::IfElse(testRes, targetPhysicalValue,
326                                                                       targetPhysicalValue-VecConstantIfCut);
327            const VecType currentSource = VecType(&potentials[idxSource]);
328            (resSource+currentSource).storeInArray(&potentials[idxSource]);
329        }
330
331        potentials[idxTarget] += targetPotential.horizontalSum();
332        // .....
333```
334
335
336## Removing Branches
337
338In many cases applications have branches inside their kernels which makes them very difficult to vectorize.
339Inastemp provides tools to explicitly manage the branches.  Note that the approach computes all branches!
340Therefore, large branches may lead to large execution time.
341
342
343### Principle
344
345```cpp
346// Scalar code
347if( a < b )
348   res += c * d // Case A()
349else
350   res -= d // Case B()
351```
352
353The code example is equivalent to a multiplication by zero for the `false` case:
354```cpp
355// scalar code
356conditionA = ( a < b ) ? 1.0 : 0
357conditionB = !( a < b ) ? 1.0 : 0
358res += conditionA * c * d // Case A()
359res -= conditionB * d // Case B()
360```
361
362It is however faster to use bit masks as follows:
363```cpp
364// scalar code
365conditionATrue = ( a < b ) ? 0xFF..FF : 0
366res += conditionA AND (c * d) // Case A()
367res -= (NOT conditionATrue) AND d // Case B()
368```
369
370Using Inastemp one may explicitely use bit-masks.  Alternatively, the Inastemp if-else function can be used.
371
372
373### Explicit use of bit masks
374
375- Getting a mask:
376```cpp
377// Get mask by calling functions
378VecType::MaskType mask = VecType().Is{Lower,Greater}[OrEqual](a,b)
379VecType::MaskType mask = VecType().Is[Not]Equal(a,b)
380// and much more: IsPositive, IsZero,....
381// Or based on C++ operator overloading
382VecType::MaskType mask = a < b; // a (<|>)[=] b
383// and much more: ==, !=, etc..
384```
385
386- Using a mask:
387The VecType classes provide methods such as AND (&), OR (|), XOR (\^), NOT_AND.
388
389
390### If-else functions
391
392The VecType classes provide three methods to deal with conditions.
393Note that Inastemp computes all sub-results and performs all computation,
394ie. even for the false condition.
395
396- IfTrue(condition, value)
397
398```cpp
399res = VecType::IfTrue(a < b , c);
400// res is set to c if a < b, else to zero
401```
402
403The IfTrue method is mainly useful when a part of a formula is conditional:
404```cpp
405res += a * b + .IfTrue(a < b , c);
406```
407
408- IfElse(condition, valueIfTrue, valueIfFalse)
409
410```cpp
411res = VecType::IfElse(a < b , c, d);
412// res is set to c if a < b, else to d
413```
414
415- If-ifelse-else
416The vectorizer classes provide an `if` method which allows to return a value or to call a lambda function.
417
418```cpp
419// Faster
420c = VecType::If(test).Then(a).Else(b);
421// Should not be slower using functions
422c = VecType::If(test).Then([&](){return a;}).Else([&](){return b;});
423```
424
425Also, the lambda/anonymous functions must pass varaibles per references using `[&](){}` and not `[=](){}`.
426
427Here are some examples.
428
429```cpp
430z = something()
431a = VecType::If(cond1).Then( z )
432    .ElseIf(cond2).Then([&](){
433        VecType x = func(z)
434        return x;
435    })
436    .Else([&](){
437        return z*4;
438    });
439```
440
441The structure can be expressed as follows:
442```cpp
443VecType::If( a condition )
444                .Then(a value or a lambda function) // Then must follow If
445          .ElseIf( a condition ) // Optional else if
446                .Then(a value or a lambda function) // Then must follow ElseIf
447          .Else(a value or a lambda function) // Optional else
448
449```
450
451
452### Branch optimization
453
454Knowing that there are branches to be computed one can think about numerical optimizations.
455In the next example we compute two times `b * c` because both the branches are executed:
456```cpp
457a = VecType::If(cond1)
458    .Then([&](){
459        return b * c + 1;
460    })
461    .Else([&](){
462        return b * c + 2;
463    });
464```
465
466The developer should simplify the formulas as best as possible to reduce the number of instructions.
467```cpp
468bc = b * c
469a = VecType::If(cond1)
470    .Then([&](){
471        return bc + 1;
472    })
473    .Else([&](){
474        return bc + 2;
475    });
476```
477
478### Avoid the computation of all branches when not necessary
479
480In some cases, some branches are much more probable than others such that it could make sense to test if
481at least one value in the vector requieres to compute the exeptional path.
482It is possible to test if a MaskType is "all true" or "all false" and to act consequently.
483But this might reduce the performance while doing less work because of CPU missprediction and un-pipelining.
484
485```cpp
486// Scalar code
487if( a < b ){ // this is true 99% of the time
488   c = exp(a);
489}
490else {
491   c = exp(b);
492}
493
494// Vectorized code with all branches computed (2 exp call per scalar value)
495c = VecType::IfElse(a < b).Then( a.Exp() ).Else( b.Exp() );
496
497// Improved one
498VecType::MaskType mask_ab = (a < b);
499if( mask_ab.isAllTrue() ){ // This is higly probable
500    c = exp(a)
501}
502else{ // isAllFalse is not probable here
503    c = VecType::IfElse(a < b).Then(exp(a)).Else(exp(b));
504}
505```
506
507
508## How to add a new vector type
509
510To add a new intrisics type the following steps need to be taken.
511- Ensure the detection of the CPU if you want to automatically enable the SIMD types supported by your hardware.
512    See `CMakeModules/GetCpuInfos.cmake` and `CMakeModules/getCpuInfos.cpp`
513- Ensure that the compiler supports the seleced SIMD types (mandatory!).
514    See `CMakeModules/GetCompilerInfos.cmake`
515    - Add a folder `TYPE` in `CMakeModules` with the same test as the others :
516    	- `compileTestTYPE.cpp` to test if the compiler can compile the code.
517    	- `checkTYPEpe.cpp` to test if the compiler provide operators for this type.
518    - Modify the `GetCompilerInfos.cmake` file where the key `(ADD-NEW-HERE)` is located
519- Add single/double classes following the same syntax as the other:
520	- `InaTYPEOperators.cpp` for the operators (a simple modification of an existing file should do the job)
521	- `InaVecTYPE<{double,float>.hpp` to put the class for the new type in double and single precision
522- Add your type in the main `CMakeLists.txt` file
523
524
525## Common Compilation Errors
526
527### `Inlining failed` compilation error using GCC
528
529In case an error message similar to
530```cpp
531inlining failed in call to always_inline A-SIMD-FUNCTION : target specific option mismatch
532 ```
533is thrown it means that the flags passed to GCC are not enough to enable the
534function `A-SIMD-FUNCTION`. One should carefuly check the version of
535SSE/AVX/AVX512 related to the given function. A `VERBOSE=1 make` should show the
536passed flags.
537
538
539### `Ignoring attributes` GCC compilation warning
540```cpp
541warning: ignoring attributes on template argument ‘__mX {aka __vector(Y) float/double}’ [-Wignored-attributes]
542```
543This warning may safely be ignored because for the Inastemp implementation this
544is not a problem. In fact, Gcc tells us that the type passed by template has its
545attributes removed, because we do not use the template type to instantiate
546variables.
547
548### `Segmentation fault (core dumped)` while using If/ElseIf/Then anonymous/lambda functions
549
550Please ensure to pass the values per references, that is in the lambda function declarations using `[&](){}` and not `[=](){}`.
551In fact, it seems that the compiler is not propagate the memory alignement correctly when copying variable for anonymous functions.
552Therefore, you can also tryied your binary with iSDE and may obtain something like:
553```
554SDE ERROR:  TID: 0 executed instruction with an unaligned memory reference to address 0x1d2e490 INSTR: 0x00042495e: IFORM: VMOVAPS_MEMqq_YMMqq :: vmovaps ymmword ptr [rbx], ymm0
555	FUNCTION: _ZNSt14_Function_base13_Base_managerIZ24VectorizedFunctionLambdaI10InaVecAVX2IfEfEviPKT0_S6_S6_S6_PS4_S4_S4_EUlvE_E15_M_init_functorERSt9_Any_dataOS8_St17integral_constantIbLb0EE
556	FUNCTION ADDR: 0x000424923
557```
558
559# Authors
560- Berenger Bramas (berenger.bramas@mpcdf.mpg.de)
561- Logo Credit to D.M.
562