1nbody - Unrolling AELEM loops to AELEMFAST
2------------------------------------------
3
4In the [first
5part](http://blogs.perl.org/users/rurban/2012/09/optimizing-compiler-benchmarks-part-1.html)
6I showed some problems and possibilities of the B::C compiler and
7B::CC optimizing compiler with an regexp example which was very bad to
8optimize.
9
10In the [second
11part](http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-2.html)
12I got 2 times faster run-times with the B::CC compiler with the
13[nbody](http://shootout.alioth.debian.org/u32/performance.php?test=nbody) benchmark, which does a lot of arithmetic.
14
15Two open problems were detected: slow function calls, and slow array accesses.
16
17At first I inlined the function call which is called the most, `sub advance`
18which was called N times, N being 5000, 50.000 or 50.000.000.
19
20    for (1..$n) {
21        advance(0.01);
22    }
23
24The runtime with N=50.000.000 went from 22m13.754s down to 21m48.015s,
2525s less. This is not what I wanted.
26php and jruby are at 12 min and 9m. So it is not slow functions calls,
27it is slow array access.
28Inspecting the opcodes shows that a lot of AELEM ops are used, for
29reading and writing arrays.
30
31AELEM checks for lvalue invocation and several more flags, which do
32exist at compile-time and there exists a fast version already
33AELEMFAST, but this only operates on literal constant indices, already
34known at compile-time. The index is stored at compile-time in the
35op->private flag then.
36
37So instead of
38
39    for (my $j = $i + 1; $j < $last + 1; $j++) {
40      # inner-loop $j..4
41      $dx = $xs[$i] - $xs[$j];
42      $dy = $ys[$i] - $ys[$j];
43      $dz = $zs[$i] - $zs[$j];
44      ...
45
46One could generate a macro-like string which just evals the array indices and generate from this
47string the final function.
48
49Array accesses: `$a[const]` are optimized to AELEMFAST, `$a[$lexical]` not.
50So unroll the loop in macro-like fashion.
51
52    $energy = '
53    sub energy
54    {
55      my $e = 0.0;
56      my ($dx, $dy, $dz, $distance);';
57      for my $i (0 .. $last) {
58    $energy .= "
59    # outer-loop $i..4
60        \$e += 0.5 * \$mass[$i] *
61              (\$vxs[$i] * \$vxs[$i] + \$vys[$i] * \$vys[$i] + \$vzs[$i] * \$vzs[$i]);
62    ";
63      for (my $j = $i + 1; $j < $last + 1; $j++) {
64    $energy .= "
65        # inner-loop $j..4
66        \$dx = \$xs[$i] - \$xs[$j];
67        \$dy = \$ys[$i] - \$ys[$j];
68        \$dz = \$zs[$i] - \$zs[$j];
69        \$distance = sqrt(\$dx * \$dx + \$dy * \$dy + \$dz * \$dz);
70        \$e -= (\$mass[$i] * \$mass[$j]) / \$distance;";
71        }
72      }
73    $energy .= '
74      return $e;
75    }';
76    eval $energy; die if $@;
77
78Every `$i` and `$j` got expanded into a literal, 0 .. 4.
79
80I did this loop unrolling for the three functions, and the results
81were impressive. It is a nice little macro trick which you could use
82for normal uncompiled perl code also.  With compiled code the
83loop-unrolling should happen automatically.
84
85Full code here: [nbody.perl-2.perl](https://github.com/rurban/shootout/commit/62b216756320e8c224eef2c933326924ab73c18a)
86
87Original:
88
89    $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl 50000
90    script/perlcc: c time: 0.380353
91    script/perlcc: cc time: 0.967525
92    -0.169075164
93    -0.169078071
94    script/perlcc: r time: 2.214327
95
96Unrolled:
97
98    $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl-2.perl 50000
99    script/perlcc: c time: 0.448817
100    script/perlcc: cc time: 2.167499
101    -0.169075164
102    -0.169078071
103    script/perlcc: r time: 1.341283
104
105Another **2x times faster!**
106
107For comparison the same effect uncompiled:
108
109    $ time perl ../shootout/bench/nbody/nbody.perl 50000
110    -0.169075164
111    -0.169078071
112
113    real	0m3.650s
114    user	0m3.644s
115    sys	0m0.000s
116
117Unrolled:
118
119    $ time perl ../shootout/bench/nbody/nbody.perl-2.perl 50000
120    -0.169075164
121    -0.169078071
122
123    real	0m2.399s
124    user	0m2.388s
125    sys	0m0.004s
126
127So we went from **3.6s** down to **2.4s** and compiled to **1.3s**.
128
129With N=50,000,000 we got **14m12.653s** uncompiled and **7m11.3597s**
130compiled. Close to jruby, even if the array accesses still goes
131through the `av_fetch` function, magic is checked and undefined indices
132are autovivified.
133
134
135Generalization
136--------------
137
138The above macro-code code looks pretty unreadable, similar to lisp
139macros, with its mix of quoted and unquoted variables.  The compiler
140needs to detect unrollable loop code which will lead to more
141constants and AELEMFAST ops. And we better define a helper function
142for easier generation of such unrolled loops.
143
144    # unquote local vars
145    sub qv {
146      my ($s, $env) = @_;
147      # expand our local loop vars
148      $s =~ s/(\$\w+?)\b/exists($env->{$1})?$env->{$1}:$1/sge;
149      $s
150    }
151
152    $energy = '
153    sub energy
154    {
155      my $e = 0.0;
156      my ($dx, $dy, $dz, $distance);';
157      for my $i (0 .. $last) {
158        my $env = {'$i'=>$i,'$last'=>$last};
159        $energy .= qv('
160        # outer-loop $i..4
161        $e += 0.5 * $mass[$i] *
162              ($vxs[$i] * $vxs[$i] + $vys[$i] * $vys[$i] + $vzs[$i] * $vzs[$i]);', $env);
163        for (my $j = $i + 1; $j < $last + 1; $j++) {
164          $env->{'$j'} = $j;
165          $energy .= qv('
166          # inner-loop $j..4
167          $dx = $xs[$i] - $xs[$j];
168          $dy = $ys[$i] - $ys[$j];
169          $dz = $zs[$i] - $zs[$j];
170          $distance = sqrt($dx * $dx + $dy * $dy + $dz * $dz);
171          $e -= ($mass[$i] * $mass[$j]) / $distance;', $env);
172        }
173      }
174      $energy .= '
175      return $e;
176    }';
177    eval $energy; die if $@;
178
179This looks now much better and leads in a BEGIN block to only neglectible
180run-time penalty.
181Full code here: [nbody.perl-2a.perl](https://github.com/rurban/shootout/commit/c35bb85ed84941157eb01b7ca844d3b4472e0df3)
182
183I also tried a generic `unroll_loop()` function, but it was a bit too
184unstable finding the end of the loop blocks on the source level, and
185`qv()` looked good enough. The compiler can use the optree to find the
186optimization.
187
188
189Types and autovivification
190--------------------------
191
192A naive optimization would check the index ranges beforehand, and access
193the array values directly. Something the type optimizer for arrays would
194do.
195
196    my (num @xs[4],  num @ys[4],  num @zs[4]);
197    my (num @vxs[4], num @vys[4], num @vzs[4]);
198    my num @mass[4];
199
200And instead of `$xs[0] * $xs[1]` which compiles to
201AELEMFASTs, currently inlined by B::CC to:
202
203    { AV* av = MUTABLE_AV(PL_curpad[6]);
204      SV** const svp = av_fetch(av, 0, 0);
205      SV *sv = (svp ? *svp : &PL_sv_undef);
206      if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv);
207      PUSHs(sv);
208    }
209    { AV* av = MUTABLE_AV(PL_curpad[6]);
210      SV** const svp = av_fetch(av, 1, 0);
211      SV *sv = (svp ? *svp : &PL_sv_undef);
212      if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv);
213      PUSHs(sv);
214    }
215    rnv0 = POPn; lnv0 = POPn;       /* multiply */
216    d30_tmp = lnv0 * rnv0;
217
218It should compile to:
219
220    d30_tmp = (double)AvARRAY(PL_curpad[6])[0] *
221              (double)AvARRAY(PL_curpad[6])[1];
222
223With the size declaration you can omit the `av_fetch()` call and undef
224check ("autovivification"), with the type `num` you do not need to get
225to the `SvNV` of the array element, the value is stored directly, and
226the type also guarantees that there is no magic to be checked.  So
227`AvARRAY(PL_curpad[6])[0]` would return a double.
228
229And the stack handling (PUSH, PUSH, POP, POP) can also be optimized
230away, since the ops are inlined already.  That would get us close to
231an optimizing compiler as with Haskell, Lua, PyPy or LISP. Not close
232to Go or Java, as their languages are stricter.
233
234I tried a simple B::CC AELEMFAST optimization together with "no autovivification"
235which does not yet eliminate superfluous PUSH/POP pairs but could be applied
236for typed arrays and leads to another 2x times win.
237
2382.80s down to 1.67s on a slower PC with N=50,000.
239
240Compiled to *(perlcc /2a)*:
241
242    PUSHs(AvARRAY(PL_curpad[6])[0]));
243    PUSHs(AvARRAY(PL_curpad[6])[1]));
244    rnv0 = POPn; lnv0 = POPn;       /* multiply */
245    d30_tmp = rnv0 * lnv0;
246
247Without superfluous PUSH/POP pairs I suspect another 2x times win. But this
248is not implemented yet. With typed arrays maybe another 50% win, and we don't
249need the no autovivification overhead.
250
251It should look like *(perlcc /2b)*:
252
253    rnv0 = SvNV(AvARRAY(PL_curpad[6])[0]);
254    lnv0 = SvNV(AvARRAY(PL_curpad[6])[1]);
255    d30_tmp = rnv0 * lnv0;          /* multiply */
256
257I'm just implementing the check for the 'no autovivification' pragma and
258the stack optimizations.
259
260Summary
261-------
262
263[u64q nbody](http://shootout.alioth.debian.org/u64q/performance.php?test=nbody)
264
265Original numbers with N=50,000,000:
266
267    * Fortran       14.09s
268    * C             20.72s
269    * Go            32.11s
270    * SBCL          42.75s
271    * Javascript V8 44.78s - 82.49s
272    * JRuby       8m
273    * PHP        11m
274    * Python 3   16m
275    * Perl       23m
276    * Ruby 1.9   26m
277
278My numbers with N=50,000,000:
279
280    * Perl       22m14s
281    * Perl 1     21m48s         (inline sub advance, no ENTERSUB/LEAVESUB)
282    * perlcc      9m52s
283    * Perl 2    14m13s          (unrolled loop + AELEM => AELEMFAST)
284    * perlcc 2   7m11s
285    * perlcc 2a  4m52s          (no autovivification, 4.5x faster)
286    * perlcc 2b  ? (~2m30)      (no autovivification + stack opt)
287    * perlcc 2c  ? (~1m25s)     (typed arrays + stack opt)
288
289