1
2GNU APL supports, to some extent, execution of APL programs on multiple
3CPU cores in parallel.
4
51. Supported Platforms
6======================
7
8As of SVN 484 parallel execution in GNU APL requires the following:
9
10a.	POSIX thread (pthread) support
11
12b.	functions pthread_getaffinity_np() and pthread_setaffinity_np() to
13	set core affinities (i.e. which thread runs on which CPU core) of the
14	threads created by GNU APL for parallel execution.
15
16c.	an atomic fetch-and-add function
17
18While a. is normally available on all platforms supported by GNU APL, b. and
19c. may not be available. Recent Linux versions provide a., b., and c., Other
20platforms may be added later.
21
222. ./configure Options
23======================
24
25Parallel execution is currently experimental and must be ./configure'd
26explicitly (see also README-2-configure). There are two ./configure options
27related to parallel execution:
28
29   CORE_COUNT_WANTED and
30   PERFORMANCE_COUNTERS_WANTED
31
32The first option, CORE_COUNT_WANTED, controls how many cores shall be used
33for parallel execution, while the second option, PERFORMANCE_COUNTERS_WANTED,
34enables performance counters in GNU APL. Performance counters are needed for
35fine-tuning the parallelism as explained in the following.
36
37There are also two top-level make targets called 'parallel' and 'parallel1'
38'make parallel' calls ./configure with CORE_COUNT_WANTED=syl while
39'make parallel1 calls ./configure with CORE_COUNT_WANTED=syl and
40PERFORMANCE_COUNTERS_WANTED=yes.
41
423. Parallelized Functions
43=========================
44
45As of SVN 484 the following APL functions were parallelized:
46
47a. all monadic and dyadic scalar functions
48b. inner products of all dyadic scalar functions
49c. outer products of all dyadic scalar functions
50
51More APL functions may be parallelized later.
52
53
544. Performance measurements and Fine-tuning
55===========================================
56
57The first benchmarks performed with the parallelized scalar functions have
58shown that 'light-weight' scalar functions such as + , -, or ⌈ may execute
59faster than their sequential counterparts on one platform but slower on another
60platform. The difference between platforms can be dramatic.
61
62For that reason there is a possibility to fine-tune the break-even points
63where parallel execution is performed in favor of sequential execution.
64There is a break-even point for every monadic and for every dyadic scalar
65function.
66
67
684.1 Theory
69-----------
70
71The sequential execution time of a scalar function is:
72
73	A0 + (B0 × N) cycles,
74
75where A0 is the 'start-up time' for the function, B0 is the 'per-item time
76for the function, and N is the vector length. The actual shape does not matter
77for scalar functions, so adding 1 2 3 4 to itself takes the same time as
78adding 2 2⍴1 2 3 4 to itself.
79
80The sequential start-up time A0 is the same for all monadic scalar functions
81and for all dyadic scalar functions; the start-up time for dyadic functions
82is a little longer than for monadic scalar functions.
83
84Now let c be the number of cores. The parallel execution time on c cores
85is then:
86
87	Ac + (Bc × N) cycles,
88
89at least when N is a multiple of c. The difference when N is a not a multiple
90of c can be neglected. Again the parallel start-up time Ac is the same for
91all monadic scalar functions and for all dyadic scalar functions. And Ac ≥ A0;
92the difference (Ac - A0) is the additional start-up and join time for the
93thread pool computing the function in parallel.
94
95From the 4 numbers A0, B0, Ac, and Bc one can compute the break-even point BE,
96which is the vector length at which parallel execution becomes faster than
97sequential execution. The break-even point is:
98
99	BE ← (Ac - A0) ÷ (B0 - Bc)
100
101The break-even point only exists if Bc < B0, that is if the parallel per-item
102cost Bc is smaller than than the sequential per-item cost B0. On some
103architectures Bc = B0 ÷ c (and then (Ac - A0) is the only concern) but
104unfortunately for multi-core CPUs with a shared memory this is not the case.
105
1064.2 Measuring A0, Ac, B0, and Bc
107--------------------------------
108
109There is a workspace 'ScalarBenchmark.apl' shipped with GNU APL that measures
110the parameters A0, Ac, B0, and Bc and writes a file 'parallel_thresholds; that
111can be fed back to GNU APL in order to set the break-even points for all
112scalar functions. The workspace can also plot the results using aplplot.
113
114The workspace is used as  follows:
115
116a.	'make parallel1' as discussed above
117b.	adjust some parameters at the beginning of ScalarBenchmark.apl (number
118	of cores, plotting yes/no, and loop counts.
119c.	run, e.g.: src/apl -f workspaces/ScalarBenchmark.apl
120
121The last step c. creates a file 'parallel_thresholds' in the current directory.
122If that file is placed in one of the directories used for the 'preferences'
123file (/etc/gnu-apl.d/ or /usr/local/etc/gnu-apl.d/ depending on ./configure
124prefix, or $HOME/.gnu-apl.d, or $HOME/.config/gnu-apl.d) the GNU APL reads
125it on start-up and sets the break-even points for the scalar functions
126accordingly.
127
128The 'ScalarBenchmark.apl' first determines A0 and Ac using small vector
129lengths N. With today's CPUs the measurement results vary considerably,
130Therefore several iterations are performed for each vector length and the
131minimum of all iterations is taken. From these minimums the least-square
132regression line for all lengths is computed.
133
134Then 'ScalarBenchmark.apl' determines the per-item costs B0 and Bc using a
135fairly large vector length N.
136
1374.3 Example 1: Intel 4-core i5-4570 CPU using 3 cores on 64-bit linux
138---------------------------------------------------------------------
139
140We discuss only two dyadic functions, + and ⋆. + is 'light-weight' and ⋆
141is not. The summary shown by 'ScalarBenchmark.apl' for + and ⋆ is:
142
143-------------- + Mix_IRC --------------
144average sequential startup cost:      58 cycles
145average parallel startup cost:       690 cycles
146per item cost sequential:             61 cycles
147per item cost parallel:               95 cycles
148parallel break-even length:          not reached
149
150The good news is that the extra cost for parallel start-up and join of
151(690 - 58) = 632 cycles are fairly low. We had earlier made benchmarks with
152other parallel libraries which were semaphore based and showed start-up cost
153in the 10000-20000 cycle ballpark.
154
155The bad news is that the parallel per-item cost for + are higher than the
156sequential per-item cost and therefore there is no break-even point for +.
157The per-item cost of 61 cycles is rather close to the values of a separate
158benchmark that measured the main memory access times alone (see
159tools/memory_benchmark). Our best explanation for not reaching a break-even
160for + (on this machine) is that there are more extra cycles caused by
161main-memory access conflicts of the different cores than cycles saved
162by parallelizing +.
163
164For ⋆ the result is different because ⋆ is not as light-weight as +:
165
166-------------- ⋆ Mix_IRC --------------
167average sequential startup cost:      58 cycles
168average parallel startup cost:       690 cycles
169per item cost sequential:            147 cycles
170per item cost parallel:               91 cycles
171parallel break-even length:           12
172
173
1744.4 Example 2: Intel 2-core E6550 CPU using 2 cores on 32-bit linux
175-------------------------------------------------------------------
176
177On this older machine the results were rather different:
178
179-------------- Mix_IRC + Mix1_IRC --------------
180average sequential startup cost:    2027 cycles
181average parallel startup cost:      2340 cycles
182per item cost sequential:            362 cycles
183per item cost parallel:              210 cycles
184parallel break-even length:            3
185
186-------------- Mix_IRC ⋆ Mix_IRC --------------
187average sequential startup cost:    2027 cycles
188average parallel startup cost:      2340 cycles
189per item cost sequential:            647 cycles
190per item cost parallel:              340 cycles
191parallel break-even length:            2
192
193All functions were faster when executed parallel. The per-item cost were
194considerably higher (in part due to the 32-bit linux) but the parallel
195start-up penalty (2340 - 2027) = 313 was smaller.
196
197Note that the start up cost vary considerably between different runs of
198the same benchmark on the same machine. Therefore you should run several
199measurements before trusting the result.
200
2014.5 Other observations
202----------------------
203
204Another, somewhat unexpected observation made when looking at several
205benchmarks is that the optimal number of cores seems to be N-1 (i.e. not N)
206on an N-core CPU.
207
208The reason is that other regular operating system activities such
209as timer interrupts occur during the APL execution. The threads used by GNU
210APL get an almost even amount of work and the threads are running at 100%
211load unless the interpreter is blocked on user input. [This is a consequence
212of using a fetch-and-add function instead of semaphores for synchronizing the
213threads.]
214
215If all N cores of a CPU are used by APL then the CPU schedules its activities
216on one of the busy cores (and most likely the same core all the time).
217That core then takes more time than the others and the entire computation\
218of the APL function is delayed accordingly.
219
220If, however, only N-1 of N cores are used then the OS sees an idle core and
221will most likely perform its activities on that core. These effects are best
222visualized when using the plotting option of ScalarBenchmark.apl.
223
2245. Summary
225==========
226
227The above discussion shows that some speedup can be achieved by parallelizing
228scalar APL functions. However the speedup is not as significant as a naive
229observer might expect (i.e. B0÷Bc = c) and varies considerably between OS
230variants and hardware platforms.
231
232While the start-up cost of the current implementation in GNU APL look good
233compared to other approaches, the per-item cost are somewhat disappointing
234on some machines (and this is due to effects outside GNU APL). A corollary
235of this could be that APL operations with little computational complexity
236(such as A⍴B) will suffer as well and will not show good speedups when
237executed in parallel.
238
239In any case is it worthwhile to fine-tune the break-even points of the
240different scalar functions.
241
242