1nbody - Unrolling AELEM loops to AELEMFAST 2------------------------------------------ 3 4In the [first 5part](http://blogs.perl.org/users/rurban/2012/09/optimizing-compiler-benchmarks-part-1.html) 6I showed some problems and possibilities of the B::C compiler and 7B::CC optimizing compiler with an regexp example which was very bad to 8optimize. 9 10In the [second 11part](http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-2.html) 12I got 2 times faster run-times with the B::CC compiler with the 13[nbody](http://shootout.alioth.debian.org/u32/performance.php?test=nbody) benchmark, which does a lot of arithmetic. 14 15Two open problems were detected: slow function calls, and slow array accesses. 16 17At first I inlined the function call which is called the most, `sub advance` 18which was called N times, N being 5000, 50.000 or 50.000.000. 19 20 for (1..$n) { 21 advance(0.01); 22 } 23 24The runtime with N=50.000.000 went from 22m13.754s down to 21m48.015s, 2525s less. This is not what I wanted. 26php and jruby are at 12 min and 9m. So it is not slow functions calls, 27it is slow array access. 28Inspecting the opcodes shows that a lot of AELEM ops are used, for 29reading and writing arrays. 30 31AELEM checks for lvalue invocation and several more flags, which do 32exist at compile-time and there exists a fast version already 33AELEMFAST, but this only operates on literal constant indices, already 34known at compile-time. The index is stored at compile-time in the 35op->private flag then. 36 37So instead of 38 39 for (my $j = $i + 1; $j < $last + 1; $j++) { 40 # inner-loop $j..4 41 $dx = $xs[$i] - $xs[$j]; 42 $dy = $ys[$i] - $ys[$j]; 43 $dz = $zs[$i] - $zs[$j]; 44 ... 45 46One could generate a macro-like string which just evals the array indices and generate from this 47string the final function. 48 49Array accesses: `$a[const]` are optimized to AELEMFAST, `$a[$lexical]` not. 50So unroll the loop in macro-like fashion. 51 52 $energy = ' 53 sub energy 54 { 55 my $e = 0.0; 56 my ($dx, $dy, $dz, $distance);'; 57 for my $i (0 .. $last) { 58 $energy .= " 59 # outer-loop $i..4 60 \$e += 0.5 * \$mass[$i] * 61 (\$vxs[$i] * \$vxs[$i] + \$vys[$i] * \$vys[$i] + \$vzs[$i] * \$vzs[$i]); 62 "; 63 for (my $j = $i + 1; $j < $last + 1; $j++) { 64 $energy .= " 65 # inner-loop $j..4 66 \$dx = \$xs[$i] - \$xs[$j]; 67 \$dy = \$ys[$i] - \$ys[$j]; 68 \$dz = \$zs[$i] - \$zs[$j]; 69 \$distance = sqrt(\$dx * \$dx + \$dy * \$dy + \$dz * \$dz); 70 \$e -= (\$mass[$i] * \$mass[$j]) / \$distance;"; 71 } 72 } 73 $energy .= ' 74 return $e; 75 }'; 76 eval $energy; die if $@; 77 78Every `$i` and `$j` got expanded into a literal, 0 .. 4. 79 80I did this loop unrolling for the three functions, and the results 81were impressive. It is a nice little macro trick which you could use 82for normal uncompiled perl code also. With compiled code the 83loop-unrolling should happen automatically. 84 85Full code here: [nbody.perl-2.perl](https://github.com/rurban/shootout/commit/62b216756320e8c224eef2c933326924ab73c18a) 86 87Original: 88 89 $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl 50000 90 script/perlcc: c time: 0.380353 91 script/perlcc: cc time: 0.967525 92 -0.169075164 93 -0.169078071 94 script/perlcc: r time: 2.214327 95 96Unrolled: 97 98 $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl-2.perl 50000 99 script/perlcc: c time: 0.448817 100 script/perlcc: cc time: 2.167499 101 -0.169075164 102 -0.169078071 103 script/perlcc: r time: 1.341283 104 105Another **2x times faster!** 106 107For comparison the same effect uncompiled: 108 109 $ time perl ../shootout/bench/nbody/nbody.perl 50000 110 -0.169075164 111 -0.169078071 112 113 real 0m3.650s 114 user 0m3.644s 115 sys 0m0.000s 116 117Unrolled: 118 119 $ time perl ../shootout/bench/nbody/nbody.perl-2.perl 50000 120 -0.169075164 121 -0.169078071 122 123 real 0m2.399s 124 user 0m2.388s 125 sys 0m0.004s 126 127So we went from **3.6s** down to **2.4s** and compiled to **1.3s**. 128 129With N=50,000,000 we got **14m12.653s** uncompiled and **7m11.3597s** 130compiled. Close to jruby, even if the array accesses still goes 131through the `av_fetch` function, magic is checked and undefined indices 132are autovivified. 133 134 135Generalization 136-------------- 137 138The above macro-code code looks pretty unreadable, similar to lisp 139macros, with its mix of quoted and unquoted variables. The compiler 140needs to detect unrollable loop code which will lead to more 141constants and AELEMFAST ops. And we better define a helper function 142for easier generation of such unrolled loops. 143 144 # unquote local vars 145 sub qv { 146 my ($s, $env) = @_; 147 # expand our local loop vars 148 $s =~ s/(\$\w+?)\b/exists($env->{$1})?$env->{$1}:$1/sge; 149 $s 150 } 151 152 $energy = ' 153 sub energy 154 { 155 my $e = 0.0; 156 my ($dx, $dy, $dz, $distance);'; 157 for my $i (0 .. $last) { 158 my $env = {'$i'=>$i,'$last'=>$last}; 159 $energy .= qv(' 160 # outer-loop $i..4 161 $e += 0.5 * $mass[$i] * 162 ($vxs[$i] * $vxs[$i] + $vys[$i] * $vys[$i] + $vzs[$i] * $vzs[$i]);', $env); 163 for (my $j = $i + 1; $j < $last + 1; $j++) { 164 $env->{'$j'} = $j; 165 $energy .= qv(' 166 # inner-loop $j..4 167 $dx = $xs[$i] - $xs[$j]; 168 $dy = $ys[$i] - $ys[$j]; 169 $dz = $zs[$i] - $zs[$j]; 170 $distance = sqrt($dx * $dx + $dy * $dy + $dz * $dz); 171 $e -= ($mass[$i] * $mass[$j]) / $distance;', $env); 172 } 173 } 174 $energy .= ' 175 return $e; 176 }'; 177 eval $energy; die if $@; 178 179This looks now much better and leads in a BEGIN block to only neglectible 180run-time penalty. 181Full code here: [nbody.perl-2a.perl](https://github.com/rurban/shootout/commit/c35bb85ed84941157eb01b7ca844d3b4472e0df3) 182 183I also tried a generic `unroll_loop()` function, but it was a bit too 184unstable finding the end of the loop blocks on the source level, and 185`qv()` looked good enough. The compiler can use the optree to find the 186optimization. 187 188 189Types and autovivification 190-------------------------- 191 192A naive optimization would check the index ranges beforehand, and access 193the array values directly. Something the type optimizer for arrays would 194do. 195 196 my (num @xs[4], num @ys[4], num @zs[4]); 197 my (num @vxs[4], num @vys[4], num @vzs[4]); 198 my num @mass[4]; 199 200And instead of `$xs[0] * $xs[1]` which compiles to 201AELEMFASTs, currently inlined by B::CC to: 202 203 { AV* av = MUTABLE_AV(PL_curpad[6]); 204 SV** const svp = av_fetch(av, 0, 0); 205 SV *sv = (svp ? *svp : &PL_sv_undef); 206 if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv); 207 PUSHs(sv); 208 } 209 { AV* av = MUTABLE_AV(PL_curpad[6]); 210 SV** const svp = av_fetch(av, 1, 0); 211 SV *sv = (svp ? *svp : &PL_sv_undef); 212 if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv); 213 PUSHs(sv); 214 } 215 rnv0 = POPn; lnv0 = POPn; /* multiply */ 216 d30_tmp = lnv0 * rnv0; 217 218It should compile to: 219 220 d30_tmp = (double)AvARRAY(PL_curpad[6])[0] * 221 (double)AvARRAY(PL_curpad[6])[1]; 222 223With the size declaration you can omit the `av_fetch()` call and undef 224check ("autovivification"), with the type `num` you do not need to get 225to the `SvNV` of the array element, the value is stored directly, and 226the type also guarantees that there is no magic to be checked. So 227`AvARRAY(PL_curpad[6])[0]` would return a double. 228 229And the stack handling (PUSH, PUSH, POP, POP) can also be optimized 230away, since the ops are inlined already. That would get us close to 231an optimizing compiler as with Haskell, Lua, PyPy or LISP. Not close 232to Go or Java, as their languages are stricter. 233 234I tried a simple B::CC AELEMFAST optimization together with "no autovivification" 235which does not yet eliminate superfluous PUSH/POP pairs but could be applied 236for typed arrays and leads to another 2x times win. 237 2382.80s down to 1.67s on a slower PC with N=50,000. 239 240Compiled to *(perlcc /2a)*: 241 242 PUSHs(AvARRAY(PL_curpad[6])[0])); 243 PUSHs(AvARRAY(PL_curpad[6])[1])); 244 rnv0 = POPn; lnv0 = POPn; /* multiply */ 245 d30_tmp = rnv0 * lnv0; 246 247Without superfluous PUSH/POP pairs I suspect another 2x times win. But this 248is not implemented yet. With typed arrays maybe another 50% win, and we don't 249need the no autovivification overhead. 250 251It should look like *(perlcc /2b)*: 252 253 rnv0 = SvNV(AvARRAY(PL_curpad[6])[0]); 254 lnv0 = SvNV(AvARRAY(PL_curpad[6])[1]); 255 d30_tmp = rnv0 * lnv0; /* multiply */ 256 257I'm just implementing the check for the 'no autovivification' pragma and 258the stack optimizations. 259 260Summary 261------- 262 263[u64q nbody](http://shootout.alioth.debian.org/u64q/performance.php?test=nbody) 264 265Original numbers with N=50,000,000: 266 267 * Fortran 14.09s 268 * C 20.72s 269 * Go 32.11s 270 * SBCL 42.75s 271 * Javascript V8 44.78s - 82.49s 272 * JRuby 8m 273 * PHP 11m 274 * Python 3 16m 275 * Perl 23m 276 * Ruby 1.9 26m 277 278My numbers with N=50,000,000: 279 280 * Perl 22m14s 281 * Perl 1 21m48s (inline sub advance, no ENTERSUB/LEAVESUB) 282 * perlcc 9m52s 283 * Perl 2 14m13s (unrolled loop + AELEM => AELEMFAST) 284 * perlcc 2 7m11s 285 * perlcc 2a 4m52s (no autovivification, 4.5x faster) 286 * perlcc 2b ? (~2m30) (no autovivification + stack opt) 287 * perlcc 2c ? (~1m25s) (typed arrays + stack opt) 288 289