1FFT Integer Multiplication code
2===============================
3
4License: BSD
5
6(Note the FLINT library of which this is a part is overall LGPL v2.1+ and the
7latest version of GMP/MPIR on which this currently depends is LGPL v3+. But
8the files in this implementation of the FFT are individually licensed BSD.)
9
10Introduction
11------------
12
13Many bignum libraries and programming languages do not contain fast code for
14multiplication of huge integers. It is important to have this when computing
15millions of digits of Pi, multiplying polynomials using Kronecker segmentation
16(or the Schoenhage-Strassen technique using an FFT directly) and a variety of
17other problems.
18
19Here we introduce fast FFT code for multiplication of huge integers.
20
21Currently the code depends on GMP/MPIR, however, any sufficiently well
22developed bignum library should have equivalent primitives for bignums. (There
23is also a dependence on the flint function n_revbin, which is found in the
24ulong_extras directory of flint -- I hereby license it under the BSD
25license as per the remainder of the FFT implementation.)
26
27To use the FFT for multiplying large integers, one needs to use the function
28flint_mpn_mul_fft_main as documented in the doc directory. This relies on tuning
29values supplied in fft_tuning.h in the top level source directory.
30
31Features:
32--------
33
34* Cache friendly up to huge transforms (integers of billions of bits)
35* Truncated -- no uglytwit performance jumps at power of 2 lengths and no problem
36  with unbalanced multiplications
37* Extremely fast
38* Easy to tune
39* Truncated FFT/IFFT functions can be used for polynomial multiplication
40
41Performance Data
42----------------
43
44Here are timings for multiplication of two integers of the given number of
45bits, comparing MPIR 2.4.0, this code and GMP 5.0.2 respectively. The timings
46are for varying numbers of iterations as specified. The timings were done on
47a 2.2GHz AMD K10-2 Mangy Cours machine.
48
49The tuning values used are specified in the final two columns.
50
51The first part of the table uses mul_truncate_sqrt2, the second half uses
52mul_mfa_truncate_sqrt2.
53
54bits          iters mpir   this   gmp     n w
55
56195840        1000  1.149s 1.105s 0.997s  7 16
57261120        1000  1.483s 1.415s 1.396s  7 16
58391296        100   0.261s 0.248s 0.282s  8 8
59521728        100   0.344s 0.315s 0.411s  8 8
60782592        100   0.577s 0.539s 0.628s  9 4
611043456       100   0.706s 0.688s 0.848s  9 4
621569024       100   1.229s 1.153s 1.317s  9 8
63
642092032       100   1.543s 1.440s 2.765s  9 8
653127296       10    0.283s 0.266s 0.408s 11 1
664169728       10    0.357s 0.335s 0.543s 11 1
676273024       10    0.621s 0.597s 0.843s 11 2
688364032       10    0.831s 0.742s 1.156s 11 2
6912539904      10    1.441s 1.394s 1.798s 12 1
7016719872      1     0.230s 0.205s 0.288s 12 1
7125122816      1     0.379s 0.336s 0.434s 12 2
7233497088      1     0.524s 0.428s 0.646s 12 2
7350245632      1     0.833s 0.693s 1.035s 13 1
7466994176      1     1.596s 0.896s 1.358s 13 1
75100577280     1     1.906s 1.552s 2.177s 13 2
76134103040     1     2.784s 2.076s 2.984s 13 2
77201129984     1     3.971s 3.158s 4.536s 14 1
78268173312     1     5.146s 4.137s 5.781s 14 1
79402456576     1     7.548s 6.443s 9.867s 14 2
80536608768     1     9.841s 8.365s 12.71s 14 2
81804913152     1     15.48s 13.29s 20.06s 15 1
821073217536    1     21.17s 17.16s 27.19s 15 1
831610219520    1     31.64s 28.60s 43.37s 15 2
842146959360    1     43.25s 37.02s 57.66s 15 2
853220340736    1     70.14s 58.09s 92.94s 16 1
864293787648    1     96.00s 74.26s 146.1s 16 1
876441566208    1     150.2s 131.1s 217.5s 16 2
888588754944    1     208.4s 175.0s 312.8s 16 2
8912883132416   1     327.4s 278.6s 447.7s 17 1
9017177509888   1     485.0s 360.ss 614.2s 17 1
91
92
93Additional tuning
94-----------------
95
96Technically one should tune the values that appear in fft_tuning.h. The
97mulmod_2expp1 tuning array indices correspond to (n, w) pairs starting
98at n = 12, w = 1. The values in the array should be nonnegative and less
99than 6. The length of the array is given by FFT_N_NUM. The cutoff
100FFT_MULMOD_2EXPP1_CUTOFF should also be tuned. It must be bigger than
101128. The function that these values tunes is in the file mulmod_2expp1.c.
102See the corresponding test function for an example of how to call it.
103
104The fft tuning array indices correspond to (n, w) pairs starting at
105n = 6, w = 1. The values in the array should be nonnegative and less
106than 6. The function that is tuned is in the file mpn_mul_fft_main.c.
107See the corresponding test function for an example of how to call it.
108The function implementation itself is the best reference for which
109inputs will use which table entries.
110
111Strategy
112--------
113
114Let's suppose we wish to compute a convolution of length 2n where n is a power of 2. We do this with a standard Fermat transform with coefficients mod p = 2^wn + 1. Note 2^w is a 2n-th root of unity.
115
116We assume wn is divisible by GMP_LIMB_BITS (either 32 or 64). In practice n is always divisible by this constant.
117
118Write l = wn/GMP_LIMB_BITS. Each coeff is stored in a block of l+1 limbs in twos complement form. We accumulate carries in the top limb meaning reduction mod p does not need to be done after an addition or subtraction.
119
120Coefficients are also accessed via one level of indirection so that coefficients can be swapped by swapping pointers.
121
122A couple of extra temporary coefficients are allocated for operations which cannot be done entirely in-place.
123
1241. Efficient butterflies
125
126The FFT butterfly step is:
127
128[a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})]
129
130We use MPIR's sumdiff to simultaneously perform the addition and subtraction. The multiplication by z^i is a shift by iw bits which we decompose into a shift by b bits and x limbs. The output is written in a location with an offset of x limbs. To handle the wraparound we split the operation into two parts. Finally we shift by the remaining b bits. An additional negation needs to occur when i >= n as nw = -1 mod p.
131
132The IFFT butterfly is:
133
134[a{i}, b{i}] => [a{i}+z^-i*b{i}, a{i}-z^-i*b{i}]
135
136We first decompose iw into b bits and x limbs. We perform the bit shift first, in-place. Then we use sumdiff, this time reading at an offset of x limbs, splitting the operation into two parts as before.
137
1382. Cache locality
139
140We use the Matrix Fourier Algorithm. To perform an FFT of length m = RC we:
141
142  * Split the coefficients into R rows of C columns
143  * Perform a length R FFT on each column, i.e. with an input stride of C
144  * Multiply each coefficient by z^{r*c} where z = exp(2*Pi*I/m),
145note z corresponds to a shift of w bits
146  * Perform a length C FFT on each row, i.e. with an input stride of 1
147  * Transpose the matrix of coefficients
148
149To perform an IFFT we complete the steps in reverse, using IFFT's instead of FFT's.
150
151We set R, C to be both around sqrt(m) to minimise the maximum size of FFT which is in cache at any one time. When the FFT is followed by the IFFT as in the convolution we do not perform the transposes of the matrix coefficients as they cancel each other out.
152
153We do not perform the twiddles by z^{rc} in a separate pass over the data. We combine them with the length R FFT's and IFFT's. They are combined with the butterflies at the very bottom level of the FFT's and IFFT's. They essentially cost nothing as they just increase the bit shifts already being performed.
154
155The algorithm expects the FFT's to output their coefficients in reverse binary order, thus we have to revbin the coefficient order after the column FFTs and before the column IFFTs.
156
1573. Truncation
158
159When performing a convolution where we know that many of the output coefficients will be zero (this happens when multiplying integers that are not of an optimal split-into-a-nice-power-of-two length) we can use Van der Hoeven's truncated FFT.
160
161There are two cases: a) less than or equal to half of the FFT output coeffs
162are non-zero and b) more than half the coefficients are non-zero:
163
164a) A 0 0 0
165
166b) A A A 0
167
168In the first case, the first layer of the FFT would do nothing. As we
169only care about the left half, we recurse on only the left half A 0,
170ignoring the remaining zeros.
171
172In the second case we compute the first layer of the FFT. We then do
173an ordinary FFT on the left half and recurse with a truncated FFT on
174the right half.
175
176Of course we need to be careful in that the first layer of the FFT
177will have replaced our zeroes with non-zero coefficients, so we don't
178recurse to the above two cases.
179
180We start instead with an FFT with non-zero coefficients (labelled B).
181
182A B B B
183
184or
185
186A A A B
187
188But the cases can be dealt with in a similar way to the cases where
189there are zeros. The saving comes in that we repeatedly ignore
190coefficients on the right hand side when they are all past the
191truncation point.
192
193The IFFT is slightly more involved. We know that we are going to
194*end up with* zeroes on the right hand side. We start with the results
195of the pointwise mults, though we do not perform all the pointwise
196mults. If we are going to end up with c zeroes, we do not perform the
197last c pointwise mults.
198
199So we want our IFFT to get to
200
201A A A 0
202
203starting from
204
205P P P ?
206
207Again there are two cases, depending on how many zeros we will end up with:
208
209a) A 0 0 0
210
211b) A A A 0
212
213In case (a) , by working backwards from what we know we will get, the
214next to last level must be
215
216A/2 0 (A/2)~ 0 where ~ is the opposite of the twiddle that will be
217applied by the IFFT butterfly.
218
219But I can compute the LHS, A/2 0, simply by recursing on the truncated
220IFFT. Then it is just a matter of multiplying by 2 to get A 0 which is
221what I was after.
222
223In case (b) an ordinary IFFT can compute the left hand of the
224penultimate layer, as we have all the necessary pointwise mults for
225that.
226
227A A A 0
228B B ? ?
229
230The right hand side we compute by recursing on the truncated IFFT. But
231we don't know what the final question mark is. To get it we have to
232reverse the steps of the IFFT to find it. As we have the second B we
233can compute the second A simply by performing some IFFT butterflies.
234Now we can compute the second ? by reversing the IFFT butterflies. So
235we are left with:
236
237A A' A 0'
238B' B' ? C'
239
240where I wrote a dash on the coefficients we actually now know.
241
242Now we can recurse using the truncated IFFT on the right hand side.
243
244Although the coefficients C' are not zero, the principles are the same
245and we split into two cases as above.
246
247This allows us to get the question mark, yielding:
248
249A A' A 0'
250B' B' C' C'
251
252and clearly now we can compute the A's we don't know from the known
253coefficients.
254
255To combine the MFA with truncation we simply truncate at one level of the MFA, i.e. set the truncation length to be a multiple of the length of the inner FFT's. When we are at the lower levels computing row FFT's we don't compute those which lie past the truncation point.
256
257We need to take care to perform the right pointwise mults because we do not transpose the matrix or output coefficients in revbin order.
258
2594. Negacyclic convolution
260
261The pointwise multiplications mod p are somtimes large enough to make use of an FFT. For this purpose we use a negacyclic convolution which naturally performs integer multiplication mod p.
262
263If we do this naively we break up into coefficients whose sizes are multiples of half the negacyclic FFT lengths. This becomes inefficient.
264
265In order to get around this we must perform two multiplications, one via a negacyclic FFT with big coefficients and one naively with very small coefficients and CRT them together. This gives more flexibility in the size of coefficients we use in the negacyclic FFT allowing the large pointwise multication mod p to be done efficiently (not implemented yet).
266
2675. Sqrt 2 trick
268
269In the ring Z/pZ where p = 2^S + 1 the value 2^(2S/4)-2^(S/4) is a
270square root of 2. This allows us to perform a convolution of twice
271the length without twice the cost. To perform the operations we need
272to be able to perform twiddles by powers of sqrt2. These are decomposed
273and the operations are combined as much as possible with the
274multiplications by powers of 2.
275
276Acknowledgements
277----------------
278
279"Matters Computational: ideas, algorithms and source code", by Jorg
280Arndt, see http://www.jjj.de/fxt/fxtbook.pdf
281
282"Primes numbers: a computational perspective", by Richard Crandall and
283Carl Pomerance, 2nd ed., 2005, Springer.
284
285"A GMP-based implementation of Schonhage-Strassen's Large Integer
286Multiplication Algorithm" by Pierrick Gaudry, Alexander Kruppa and
287Paul Zimmermann, ISAAC 2007 proceedings, pp 167-174. See
288http://www.loria.fr/~gaudry/publis/issac07.pdf
289
290"Multidigit multiplication for mathematicians" by Dan Bernstein (to
291appear). see http://cr.yp.to/papers/m3.pdf
292
293"A cache-friendly truncated FFT" by David Harvey, Theor. Comput. Sci. 410 (2009), 2649.2658. See http://web.maths.unsw.edu.au/~davidharvey/papers/cache-trunc-fft/
294
295"The truncated Fourier transform and applications" by Joris van der Hoeven, J. Gutierrez, editor, Proc. ISSAC 2004, pages 290.296, Univ. of Cantabria, Santander, Spain, July 4.7 2004.
296
297