1FFT Integer Multiplication code 2=============================== 3 4License: BSD 5 6(Note the FLINT library of which this is a part is overall LGPL v2.1+ and the 7latest version of GMP/MPIR on which this currently depends is LGPL v3+. But 8the files in this implementation of the FFT are individually licensed BSD.) 9 10Introduction 11------------ 12 13Many bignum libraries and programming languages do not contain fast code for 14multiplication of huge integers. It is important to have this when computing 15millions of digits of Pi, multiplying polynomials using Kronecker segmentation 16(or the Schoenhage-Strassen technique using an FFT directly) and a variety of 17other problems. 18 19Here we introduce fast FFT code for multiplication of huge integers. 20 21Currently the code depends on GMP/MPIR, however, any sufficiently well 22developed bignum library should have equivalent primitives for bignums. (There 23is also a dependence on the flint function n_revbin, which is found in the 24ulong_extras directory of flint -- I hereby license it under the BSD 25license as per the remainder of the FFT implementation.) 26 27To use the FFT for multiplying large integers, one needs to use the function 28flint_mpn_mul_fft_main as documented in the doc directory. This relies on tuning 29values supplied in fft_tuning.h in the top level source directory. 30 31Features: 32-------- 33 34* Cache friendly up to huge transforms (integers of billions of bits) 35* Truncated -- no uglytwit performance jumps at power of 2 lengths and no problem 36 with unbalanced multiplications 37* Extremely fast 38* Easy to tune 39* Truncated FFT/IFFT functions can be used for polynomial multiplication 40 41Performance Data 42---------------- 43 44Here are timings for multiplication of two integers of the given number of 45bits, comparing MPIR 2.4.0, this code and GMP 5.0.2 respectively. The timings 46are for varying numbers of iterations as specified. The timings were done on 47a 2.2GHz AMD K10-2 Mangy Cours machine. 48 49The tuning values used are specified in the final two columns. 50 51The first part of the table uses mul_truncate_sqrt2, the second half uses 52mul_mfa_truncate_sqrt2. 53 54bits iters mpir this gmp n w 55 56195840 1000 1.149s 1.105s 0.997s 7 16 57261120 1000 1.483s 1.415s 1.396s 7 16 58391296 100 0.261s 0.248s 0.282s 8 8 59521728 100 0.344s 0.315s 0.411s 8 8 60782592 100 0.577s 0.539s 0.628s 9 4 611043456 100 0.706s 0.688s 0.848s 9 4 621569024 100 1.229s 1.153s 1.317s 9 8 63 642092032 100 1.543s 1.440s 2.765s 9 8 653127296 10 0.283s 0.266s 0.408s 11 1 664169728 10 0.357s 0.335s 0.543s 11 1 676273024 10 0.621s 0.597s 0.843s 11 2 688364032 10 0.831s 0.742s 1.156s 11 2 6912539904 10 1.441s 1.394s 1.798s 12 1 7016719872 1 0.230s 0.205s 0.288s 12 1 7125122816 1 0.379s 0.336s 0.434s 12 2 7233497088 1 0.524s 0.428s 0.646s 12 2 7350245632 1 0.833s 0.693s 1.035s 13 1 7466994176 1 1.596s 0.896s 1.358s 13 1 75100577280 1 1.906s 1.552s 2.177s 13 2 76134103040 1 2.784s 2.076s 2.984s 13 2 77201129984 1 3.971s 3.158s 4.536s 14 1 78268173312 1 5.146s 4.137s 5.781s 14 1 79402456576 1 7.548s 6.443s 9.867s 14 2 80536608768 1 9.841s 8.365s 12.71s 14 2 81804913152 1 15.48s 13.29s 20.06s 15 1 821073217536 1 21.17s 17.16s 27.19s 15 1 831610219520 1 31.64s 28.60s 43.37s 15 2 842146959360 1 43.25s 37.02s 57.66s 15 2 853220340736 1 70.14s 58.09s 92.94s 16 1 864293787648 1 96.00s 74.26s 146.1s 16 1 876441566208 1 150.2s 131.1s 217.5s 16 2 888588754944 1 208.4s 175.0s 312.8s 16 2 8912883132416 1 327.4s 278.6s 447.7s 17 1 9017177509888 1 485.0s 360.ss 614.2s 17 1 91 92 93Additional tuning 94----------------- 95 96Technically one should tune the values that appear in fft_tuning.h. The 97mulmod_2expp1 tuning array indices correspond to (n, w) pairs starting 98at n = 12, w = 1. The values in the array should be nonnegative and less 99than 6. The length of the array is given by FFT_N_NUM. The cutoff 100FFT_MULMOD_2EXPP1_CUTOFF should also be tuned. It must be bigger than 101128. The function that these values tunes is in the file mulmod_2expp1.c. 102See the corresponding test function for an example of how to call it. 103 104The fft tuning array indices correspond to (n, w) pairs starting at 105n = 6, w = 1. The values in the array should be nonnegative and less 106than 6. The function that is tuned is in the file mpn_mul_fft_main.c. 107See the corresponding test function for an example of how to call it. 108The function implementation itself is the best reference for which 109inputs will use which table entries. 110 111Strategy 112-------- 113 114Let's suppose we wish to compute a convolution of length 2n where n is a power of 2. We do this with a standard Fermat transform with coefficients mod p = 2^wn + 1. Note 2^w is a 2n-th root of unity. 115 116We assume wn is divisible by GMP_LIMB_BITS (either 32 or 64). In practice n is always divisible by this constant. 117 118Write l = wn/GMP_LIMB_BITS. Each coeff is stored in a block of l+1 limbs in twos complement form. We accumulate carries in the top limb meaning reduction mod p does not need to be done after an addition or subtraction. 119 120Coefficients are also accessed via one level of indirection so that coefficients can be swapped by swapping pointers. 121 122A couple of extra temporary coefficients are allocated for operations which cannot be done entirely in-place. 123 1241. Efficient butterflies 125 126The FFT butterfly step is: 127 128[a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})] 129 130We use MPIR's sumdiff to simultaneously perform the addition and subtraction. The multiplication by z^i is a shift by iw bits which we decompose into a shift by b bits and x limbs. The output is written in a location with an offset of x limbs. To handle the wraparound we split the operation into two parts. Finally we shift by the remaining b bits. An additional negation needs to occur when i >= n as nw = -1 mod p. 131 132The IFFT butterfly is: 133 134[a{i}, b{i}] => [a{i}+z^-i*b{i}, a{i}-z^-i*b{i}] 135 136We first decompose iw into b bits and x limbs. We perform the bit shift first, in-place. Then we use sumdiff, this time reading at an offset of x limbs, splitting the operation into two parts as before. 137 1382. Cache locality 139 140We use the Matrix Fourier Algorithm. To perform an FFT of length m = RC we: 141 142 * Split the coefficients into R rows of C columns 143 * Perform a length R FFT on each column, i.e. with an input stride of C 144 * Multiply each coefficient by z^{r*c} where z = exp(2*Pi*I/m), 145note z corresponds to a shift of w bits 146 * Perform a length C FFT on each row, i.e. with an input stride of 1 147 * Transpose the matrix of coefficients 148 149To perform an IFFT we complete the steps in reverse, using IFFT's instead of FFT's. 150 151We set R, C to be both around sqrt(m) to minimise the maximum size of FFT which is in cache at any one time. When the FFT is followed by the IFFT as in the convolution we do not perform the transposes of the matrix coefficients as they cancel each other out. 152 153We do not perform the twiddles by z^{rc} in a separate pass over the data. We combine them with the length R FFT's and IFFT's. They are combined with the butterflies at the very bottom level of the FFT's and IFFT's. They essentially cost nothing as they just increase the bit shifts already being performed. 154 155The algorithm expects the FFT's to output their coefficients in reverse binary order, thus we have to revbin the coefficient order after the column FFTs and before the column IFFTs. 156 1573. Truncation 158 159When performing a convolution where we know that many of the output coefficients will be zero (this happens when multiplying integers that are not of an optimal split-into-a-nice-power-of-two length) we can use Van der Hoeven's truncated FFT. 160 161There are two cases: a) less than or equal to half of the FFT output coeffs 162are non-zero and b) more than half the coefficients are non-zero: 163 164a) A 0 0 0 165 166b) A A A 0 167 168In the first case, the first layer of the FFT would do nothing. As we 169only care about the left half, we recurse on only the left half A 0, 170ignoring the remaining zeros. 171 172In the second case we compute the first layer of the FFT. We then do 173an ordinary FFT on the left half and recurse with a truncated FFT on 174the right half. 175 176Of course we need to be careful in that the first layer of the FFT 177will have replaced our zeroes with non-zero coefficients, so we don't 178recurse to the above two cases. 179 180We start instead with an FFT with non-zero coefficients (labelled B). 181 182A B B B 183 184or 185 186A A A B 187 188But the cases can be dealt with in a similar way to the cases where 189there are zeros. The saving comes in that we repeatedly ignore 190coefficients on the right hand side when they are all past the 191truncation point. 192 193The IFFT is slightly more involved. We know that we are going to 194*end up with* zeroes on the right hand side. We start with the results 195of the pointwise mults, though we do not perform all the pointwise 196mults. If we are going to end up with c zeroes, we do not perform the 197last c pointwise mults. 198 199So we want our IFFT to get to 200 201A A A 0 202 203starting from 204 205P P P ? 206 207Again there are two cases, depending on how many zeros we will end up with: 208 209a) A 0 0 0 210 211b) A A A 0 212 213In case (a) , by working backwards from what we know we will get, the 214next to last level must be 215 216A/2 0 (A/2)~ 0 where ~ is the opposite of the twiddle that will be 217applied by the IFFT butterfly. 218 219But I can compute the LHS, A/2 0, simply by recursing on the truncated 220IFFT. Then it is just a matter of multiplying by 2 to get A 0 which is 221what I was after. 222 223In case (b) an ordinary IFFT can compute the left hand of the 224penultimate layer, as we have all the necessary pointwise mults for 225that. 226 227A A A 0 228B B ? ? 229 230The right hand side we compute by recursing on the truncated IFFT. But 231we don't know what the final question mark is. To get it we have to 232reverse the steps of the IFFT to find it. As we have the second B we 233can compute the second A simply by performing some IFFT butterflies. 234Now we can compute the second ? by reversing the IFFT butterflies. So 235we are left with: 236 237A A' A 0' 238B' B' ? C' 239 240where I wrote a dash on the coefficients we actually now know. 241 242Now we can recurse using the truncated IFFT on the right hand side. 243 244Although the coefficients C' are not zero, the principles are the same 245and we split into two cases as above. 246 247This allows us to get the question mark, yielding: 248 249A A' A 0' 250B' B' C' C' 251 252and clearly now we can compute the A's we don't know from the known 253coefficients. 254 255To combine the MFA with truncation we simply truncate at one level of the MFA, i.e. set the truncation length to be a multiple of the length of the inner FFT's. When we are at the lower levels computing row FFT's we don't compute those which lie past the truncation point. 256 257We need to take care to perform the right pointwise mults because we do not transpose the matrix or output coefficients in revbin order. 258 2594. Negacyclic convolution 260 261The pointwise multiplications mod p are somtimes large enough to make use of an FFT. For this purpose we use a negacyclic convolution which naturally performs integer multiplication mod p. 262 263If we do this naively we break up into coefficients whose sizes are multiples of half the negacyclic FFT lengths. This becomes inefficient. 264 265In order to get around this we must perform two multiplications, one via a negacyclic FFT with big coefficients and one naively with very small coefficients and CRT them together. This gives more flexibility in the size of coefficients we use in the negacyclic FFT allowing the large pointwise multication mod p to be done efficiently (not implemented yet). 266 2675. Sqrt 2 trick 268 269In the ring Z/pZ where p = 2^S + 1 the value 2^(2S/4)-2^(S/4) is a 270square root of 2. This allows us to perform a convolution of twice 271the length without twice the cost. To perform the operations we need 272to be able to perform twiddles by powers of sqrt2. These are decomposed 273and the operations are combined as much as possible with the 274multiplications by powers of 2. 275 276Acknowledgements 277---------------- 278 279"Matters Computational: ideas, algorithms and source code", by Jorg 280Arndt, see http://www.jjj.de/fxt/fxtbook.pdf 281 282"Primes numbers: a computational perspective", by Richard Crandall and 283Carl Pomerance, 2nd ed., 2005, Springer. 284 285"A GMP-based implementation of Schonhage-Strassen's Large Integer 286Multiplication Algorithm" by Pierrick Gaudry, Alexander Kruppa and 287Paul Zimmermann, ISAAC 2007 proceedings, pp 167-174. See 288http://www.loria.fr/~gaudry/publis/issac07.pdf 289 290"Multidigit multiplication for mathematicians" by Dan Bernstein (to 291appear). see http://cr.yp.to/papers/m3.pdf 292 293"A cache-friendly truncated FFT" by David Harvey, Theor. Comput. Sci. 410 (2009), 2649.2658. See http://web.maths.unsw.edu.au/~davidharvey/papers/cache-trunc-fft/ 294 295"The truncated Fourier transform and applications" by Joris van der Hoeven, J. Gutierrez, editor, Proc. ISSAC 2004, pages 290.296, Univ. of Cantabria, Santander, Spain, July 4.7 2004. 296 297