1This file contains a random collection of the various hacks we did (or 2did not dare to do) in order to get performance. It is not intended 3to be complete nor up to date, but it may be instructive for people to 4read (see also my (Matteo Frigo's) slides on the web page). 5 61) Negative constants: 7 8 a = 0.5 * b; a = 0.5 * b; 9 c = 0.5 * d; c = -0.5 * d; 10 e = 1.0 + a; vs. e = 1.0 + a; 11 f = 1.0 - c; f = 1.0 + c; 12 13 The code on the left is faster. Guess why? The constants 14 0.5 and -0.5 are stored in different memory locations and loaded 15 separately. 16 172) Twiddle factors: 18 19 For some reason that escapes me, every package I have seen 20 (including FFTW 1.0) stores the twiddle factors in such 21 a way that the codelets need to access memory in a random 22 fashion. It is much faster to just store the factors in the 23 order they will be needed. This increased spatial locality 24 exploits caches and improves performance by about 30% for big 25 arrays. 26 273) Loops: 28 29 You may notice that the `twiddle' codelets contain a loop. 30 This is the most logical choice, as opposed to moving the loop 31 outside the codelet. For example, one would expect that 32 33 for (i = 0; i < n; ++i) 34 foo(); 35 36 be slower than the case where the loop is inside foo(). This is 37 usually the case, *except* for the ultrasparc. FFTW would be 10% 38 faster (more or less) if the loop were outside the codelet. 39 Unfortunately, it would be slower everywhere else. If you want to 40 play with this, the codelet generator can be easily hacked... 41 424) Array padding: 43 44 (This is a disgusting hack that my programmer's ethic 45 pervents me from releasing to the public). 46 47 On the IBM RS/6000, for big n, the following **** is *way* faster: 48 49 - Take the input array 50 - `inflate' it, that is, produce a bigger array in which every 51 k-th element is unused (say k=512). In other words, insert 52 holes in the input. 53 - execute fftw normally (properly hacked to keep the holes into 54 account) 55 - compact the array. 56 57 With this hack, FFTW is sensibly faster than IBM's ESSL library. 58 Sorry guys, I don't want to be responsible for releasing this 59 monster (this works only on the RS/6000, fortunately). 60 615) Phase of moon: 62 63 Don't take the numbers on the web page too seriously. The 64 architectural details of modern machines make performance 65 *extremely* sensitive to little details such as where your code and 66 data are placed in memory. The ultimate example of brain damage is 67 the Pentium Pro (what a surprise...), where we had a case in which 68 adding a printf() statement to FFTW slowed down another completely 69 unrelated fft program by a factor of 20. 70 71 Our strategy to generate huge pieces of straight-line code should 72 immunize FFTW sufficiently well against these problems, but you are 73 still likely to observe weird things like: FFTW_ESTIMATE can be 74 faster than FFTW_MEASURE. 75 76 FFTW-17.0 will compute the phase of the moon in the planner and take 77 it into account. 78 796) Estimator: 80 81 The estimator tries to come up with a `reasonable' plan. 82 Unfortunately, this concept is very machine-dependent. The 83 estimator in the release is tuned for the UltraSPARC. 84 85 The following two parameters can be found in fftw/planner.c : 86 87 #define NOTW_OPTIMAL_SIZE 32 88 #define TWIDDLE_OPTIMAL_SIZE 12 89 90 They say: use non-twiddle codelets of size close to 32, and twiddle 91 codelets of size close to 12. More or less, these are the most 92 efficient pieces of code on the UltraSPARC. Your mileage *will* 93 vary. Here are some suggestions for other machines. 94 95 Pentium 96 97 #define NOTW_OPTIMAL_SIZE 8 (or 16) 98 #define TWIDDLE_OPTIMAL_SIZE 8 99 100 Pentium Pro: 101 102 #define NOTW_OPTIMAL_SIZE 32 (or 16) 103 #define TWIDDLE_OPTIMAL_SIZE 2 104 105 106 RS/6000: 107 108 #define NOTW_OPTIMAL_SIZE 32 109 #define TWIDDLE_OPTIMAL_SIZE 32 110 111 The basic idea is: compute some plans for sizes you most care 112 about, print the plan and plug in the numbers that appear more 113 often (or some kind of average). Note the dramatic difference 114 between Pentium an Pentium Pro. (NO LONGER TRUE WITH --enable-i386-hacks) 115 1167) Stack alignment: 117 118 Pentium-type processors impose a huge performance penalty if double- 119 precision values are not aligned to 8-byte boundaries in memory. 120 (We have seen factors of 3 or more in tight loops.) Unfortunately, 121 the Intel ABI specifies that variables on the stack need only be aligned to 122 4-byte boundaries. Even more unfortunately, this convention is followed 123 by Linux/x86 and gcc. 124 125 To get around this, we wrote special macros (HACK_ALIGN_STACK_EVEN 126 and HACK_ALIGN_STACK_ODD) that take effect when FFTW is compiled 127 with gcc/egcs and 'configure --enable-i386-hacks' is used. These 128 macros are called before the computation-intensive "codelets" 129 of FFTW. They use the gcc __builtin_alloca function to examine the 130 stack pointer and align the stack to an even or odd 4-byte boundary, 131 depending upon how many values will be pushed on the stack to call 132 the codelet. 133