1 2GNU APL supports, to some extent, execution of APL programs on multiple 3CPU cores in parallel. 4 51. Supported Platforms 6====================== 7 8As of SVN 484 parallel execution in GNU APL requires the following: 9 10a. POSIX thread (pthread) support 11 12b. functions pthread_getaffinity_np() and pthread_setaffinity_np() to 13 set core affinities (i.e. which thread runs on which CPU core) of the 14 threads created by GNU APL for parallel execution. 15 16c. an atomic fetch-and-add function 17 18While a. is normally available on all platforms supported by GNU APL, b. and 19c. may not be available. Recent Linux versions provide a., b., and c., Other 20platforms may be added later. 21 222. ./configure Options 23====================== 24 25Parallel execution is currently experimental and must be ./configure'd 26explicitly (see also README-2-configure). There are two ./configure options 27related to parallel execution: 28 29 CORE_COUNT_WANTED and 30 PERFORMANCE_COUNTERS_WANTED 31 32The first option, CORE_COUNT_WANTED, controls how many cores shall be used 33for parallel execution, while the second option, PERFORMANCE_COUNTERS_WANTED, 34enables performance counters in GNU APL. Performance counters are needed for 35fine-tuning the parallelism as explained in the following. 36 37There are also two top-level make targets called 'parallel' and 'parallel1' 38'make parallel' calls ./configure with CORE_COUNT_WANTED=syl while 39'make parallel1 calls ./configure with CORE_COUNT_WANTED=syl and 40PERFORMANCE_COUNTERS_WANTED=yes. 41 423. Parallelized Functions 43========================= 44 45As of SVN 484 the following APL functions were parallelized: 46 47a. all monadic and dyadic scalar functions 48b. inner products of all dyadic scalar functions 49c. outer products of all dyadic scalar functions 50 51More APL functions may be parallelized later. 52 53 544. Performance measurements and Fine-tuning 55=========================================== 56 57The first benchmarks performed with the parallelized scalar functions have 58shown that 'light-weight' scalar functions such as + , -, or ⌈ may execute 59faster than their sequential counterparts on one platform but slower on another 60platform. The difference between platforms can be dramatic. 61 62For that reason there is a possibility to fine-tune the break-even points 63where parallel execution is performed in favor of sequential execution. 64There is a break-even point for every monadic and for every dyadic scalar 65function. 66 67 684.1 Theory 69----------- 70 71The sequential execution time of a scalar function is: 72 73 A0 + (B0 × N) cycles, 74 75where A0 is the 'start-up time' for the function, B0 is the 'per-item time 76for the function, and N is the vector length. The actual shape does not matter 77for scalar functions, so adding 1 2 3 4 to itself takes the same time as 78adding 2 2⍴1 2 3 4 to itself. 79 80The sequential start-up time A0 is the same for all monadic scalar functions 81and for all dyadic scalar functions; the start-up time for dyadic functions 82is a little longer than for monadic scalar functions. 83 84Now let c be the number of cores. The parallel execution time on c cores 85is then: 86 87 Ac + (Bc × N) cycles, 88 89at least when N is a multiple of c. The difference when N is a not a multiple 90of c can be neglected. Again the parallel start-up time Ac is the same for 91all monadic scalar functions and for all dyadic scalar functions. And Ac ≥ A0; 92the difference (Ac - A0) is the additional start-up and join time for the 93thread pool computing the function in parallel. 94 95From the 4 numbers A0, B0, Ac, and Bc one can compute the break-even point BE, 96which is the vector length at which parallel execution becomes faster than 97sequential execution. The break-even point is: 98 99 BE ← (Ac - A0) ÷ (B0 - Bc) 100 101The break-even point only exists if Bc < B0, that is if the parallel per-item 102cost Bc is smaller than than the sequential per-item cost B0. On some 103architectures Bc = B0 ÷ c (and then (Ac - A0) is the only concern) but 104unfortunately for multi-core CPUs with a shared memory this is not the case. 105 1064.2 Measuring A0, Ac, B0, and Bc 107-------------------------------- 108 109There is a workspace 'ScalarBenchmark.apl' shipped with GNU APL that measures 110the parameters A0, Ac, B0, and Bc and writes a file 'parallel_thresholds; that 111can be fed back to GNU APL in order to set the break-even points for all 112scalar functions. The workspace can also plot the results using aplplot. 113 114The workspace is used as follows: 115 116a. 'make parallel1' as discussed above 117b. adjust some parameters at the beginning of ScalarBenchmark.apl (number 118 of cores, plotting yes/no, and loop counts. 119c. run, e.g.: src/apl -f workspaces/ScalarBenchmark.apl 120 121The last step c. creates a file 'parallel_thresholds' in the current directory. 122If that file is placed in one of the directories used for the 'preferences' 123file (/etc/gnu-apl.d/ or /usr/local/etc/gnu-apl.d/ depending on ./configure 124prefix, or $HOME/.gnu-apl.d, or $HOME/.config/gnu-apl.d) the GNU APL reads 125it on start-up and sets the break-even points for the scalar functions 126accordingly. 127 128The 'ScalarBenchmark.apl' first determines A0 and Ac using small vector 129lengths N. With today's CPUs the measurement results vary considerably, 130Therefore several iterations are performed for each vector length and the 131minimum of all iterations is taken. From these minimums the least-square 132regression line for all lengths is computed. 133 134Then 'ScalarBenchmark.apl' determines the per-item costs B0 and Bc using a 135fairly large vector length N. 136 1374.3 Example 1: Intel 4-core i5-4570 CPU using 3 cores on 64-bit linux 138--------------------------------------------------------------------- 139 140We discuss only two dyadic functions, + and ⋆. + is 'light-weight' and ⋆ 141is not. The summary shown by 'ScalarBenchmark.apl' for + and ⋆ is: 142 143-------------- + Mix_IRC -------------- 144average sequential startup cost: 58 cycles 145average parallel startup cost: 690 cycles 146per item cost sequential: 61 cycles 147per item cost parallel: 95 cycles 148parallel break-even length: not reached 149 150The good news is that the extra cost for parallel start-up and join of 151(690 - 58) = 632 cycles are fairly low. We had earlier made benchmarks with 152other parallel libraries which were semaphore based and showed start-up cost 153in the 10000-20000 cycle ballpark. 154 155The bad news is that the parallel per-item cost for + are higher than the 156sequential per-item cost and therefore there is no break-even point for +. 157The per-item cost of 61 cycles is rather close to the values of a separate 158benchmark that measured the main memory access times alone (see 159tools/memory_benchmark). Our best explanation for not reaching a break-even 160for + (on this machine) is that there are more extra cycles caused by 161main-memory access conflicts of the different cores than cycles saved 162by parallelizing +. 163 164For ⋆ the result is different because ⋆ is not as light-weight as +: 165 166-------------- ⋆ Mix_IRC -------------- 167average sequential startup cost: 58 cycles 168average parallel startup cost: 690 cycles 169per item cost sequential: 147 cycles 170per item cost parallel: 91 cycles 171parallel break-even length: 12 172 173 1744.4 Example 2: Intel 2-core E6550 CPU using 2 cores on 32-bit linux 175------------------------------------------------------------------- 176 177On this older machine the results were rather different: 178 179-------------- Mix_IRC + Mix1_IRC -------------- 180average sequential startup cost: 2027 cycles 181average parallel startup cost: 2340 cycles 182per item cost sequential: 362 cycles 183per item cost parallel: 210 cycles 184parallel break-even length: 3 185 186-------------- Mix_IRC ⋆ Mix_IRC -------------- 187average sequential startup cost: 2027 cycles 188average parallel startup cost: 2340 cycles 189per item cost sequential: 647 cycles 190per item cost parallel: 340 cycles 191parallel break-even length: 2 192 193All functions were faster when executed parallel. The per-item cost were 194considerably higher (in part due to the 32-bit linux) but the parallel 195start-up penalty (2340 - 2027) = 313 was smaller. 196 197Note that the start up cost vary considerably between different runs of 198the same benchmark on the same machine. Therefore you should run several 199measurements before trusting the result. 200 2014.5 Other observations 202---------------------- 203 204Another, somewhat unexpected observation made when looking at several 205benchmarks is that the optimal number of cores seems to be N-1 (i.e. not N) 206on an N-core CPU. 207 208The reason is that other regular operating system activities such 209as timer interrupts occur during the APL execution. The threads used by GNU 210APL get an almost even amount of work and the threads are running at 100% 211load unless the interpreter is blocked on user input. [This is a consequence 212of using a fetch-and-add function instead of semaphores for synchronizing the 213threads.] 214 215If all N cores of a CPU are used by APL then the CPU schedules its activities 216on one of the busy cores (and most likely the same core all the time). 217That core then takes more time than the others and the entire computation\ 218of the APL function is delayed accordingly. 219 220If, however, only N-1 of N cores are used then the OS sees an idle core and 221will most likely perform its activities on that core. These effects are best 222visualized when using the plotting option of ScalarBenchmark.apl. 223 2245. Summary 225========== 226 227The above discussion shows that some speedup can be achieved by parallelizing 228scalar APL functions. However the speedup is not as significant as a naive 229observer might expect (i.e. B0÷Bc = c) and varies considerably between OS 230variants and hardware platforms. 231 232While the start-up cost of the current implementation in GNU APL look good 233compared to other approaches, the per-item cost are somewhat disappointing 234on some machines (and this is due to effects outside GNU APL). A corollary 235of this could be that APL operations with little computational complexity 236(such as A⍴B) will suffer as well and will not show good speedups when 237executed in parallel. 238 239In any case is it worthwhile to fine-tune the break-even points of the 240different scalar functions. 241 242