1.. _fft:
2
3**fft.h** -- Schoenhage-Strassen FFT
4================================================================================
5
6
7Split/combine FFT coefficients
8--------------------------------------------------------------------------------
9
10
11.. function:: mp_size_t fft_split_limbs(mp_limb_t ** poly, mp_srcptr limbs, mp_size_t total_limbs, mp_size_t coeff_limbs, mp_size_t output_limbs)
12
13    Split an integer ``(limbs, total_limbs)`` into coefficients of length
14    ``coeff_limbs`` limbs and store as the coefficients of ``poly``
15    which are assumed to have space for ``output_limbs + 1`` limbs per
16    coefficient. The coefficients of the polynomial do not need to be zeroed
17    before calling this function, however the number of coefficients written
18    is returned by the function and any coefficients beyond this point are
19    not touched.
20
21.. function:: mp_size_t fft_split_bits(mp_limb_t ** poly, mp_srcptr limbs, mp_size_t total_limbs, flint_bitcnt_t bits, mp_size_t output_limbs)
22
23    Split an integer ``(limbs, total_limbs)`` into coefficients of the
24    given number of ``bits`` and store as the coefficients of ``poly``
25    which are assumed to have space for ``output_limbs + 1`` limbs per
26    coefficient. The coefficients of the polynomial do not need to be zeroed
27    before calling this function, however the number of coefficients written
28    is returned by the function and any coefficients beyond this point are
29    not touched.
30
31.. function:: void fft_combine_limbs(mp_limb_t * res, mp_limb_t ** poly, slong length, mp_size_t coeff_limbs, mp_size_t output_limbs, mp_size_t total_limbs)
32
33    Evaluate the polynomial ``poly`` of the given ``length`` at
34    ``B^coeff_limbs``, where ``B = 2^FLINT_BITS``, and add the
35    result to the integer ``(res, total_limbs)`` throwing away any bits
36    that exceed the given number of limbs. The polynomial coefficients are
37    assumed to have at least ``output_limbs`` limbs each, however any
38    additional limbs are ignored.
39
40    If the integer is initially zero the result will just be the evaluation
41    of the polynomial.
42
43.. function:: void fft_combine_bits(mp_limb_t * res, mp_limb_t ** poly, slong length, flint_bitcnt_t bits, mp_size_t output_limbs, mp_size_t total_limbs)
44
45    Evaluate the polynomial ``poly`` of the given ``length`` at
46    ``2^bits`` and add the result to the integer
47    ``(res, total_limbs)`` throwing away any bits that exceed the given
48    number of limbs. The polynomial coefficients are assumed to have at least
49    ``output_limbs`` limbs each, however any additional limbs are ignored.
50    If the integer is initially zero the result will just be the evaluation
51    of the polynomial.
52
53
54Test helper functions
55--------------------------------------------------------------------------------
56
57
58.. function:: void fermat_to_mpz(mpz_t m, mp_limb_t * i, mp_size_t limbs)
59
60    Convert the Fermat number ``(i, limbs)`` modulo ``B^limbs + 1`` to
61    an ``mpz_t m``. Assumes ``m`` has been initialised. This function
62    is used only in test code.
63
64
65Arithmetic modulo a generalised Fermat number
66--------------------------------------------------------------------------------
67
68
69.. function:: void mpn_addmod_2expp1_1(mp_limb_t * r, mp_size_t limbs, mp_limb_signed_t c)
70
71    Adds the signed limb ``c`` to the generalised fermat number ``r``
72    modulo ``B^limbs + 1``. The compiler should be able to inline
73    this for the case that there is no overflow from the first limb.
74
75.. function:: void mpn_normmod_2expp1(mp_limb_t * t, mp_size_t limbs)
76
77    Given ``t`` a signed integer of ``limbs + 1`` limbs in twos
78    complement format, reduce ``t`` to the corresponding value modulo the
79    generalised Fermat number ``B^limbs + 1``, where
80    ``B = 2^FLINT_BITS``.
81
82.. function:: void mpn_mul_2expmod_2expp1(mp_limb_t * t, mp_limb_t * i1, mp_size_t limbs, flint_bitcnt_t d)
83
84    Given ``i1`` a signed integer of ``limbs + 1`` limbs in twos
85    complement format reduced modulo ``B^limbs + 1`` up to some
86    overflow, compute ``t = i1*2^d`` modulo `p`. The result will not
87    necessarily be fully reduced. The number of bits ``d`` must be
88    nonnegative and less than ``FLINT_BITS``. Aliasing is permitted.
89
90.. function:: void mpn_div_2expmod_2expp1(mp_limb_t * t, mp_limb_t * i1, mp_size_t limbs, flint_bitcnt_t d)
91
92    Given ``i1`` a signed integer of ``limbs + 1`` limbs in twos
93    complement format reduced modulo ``B^limbs + 1`` up to some
94    overflow, compute ``t = i1/2^d`` modulo `p`. The result will not
95    necessarily be fully reduced. The number of bits ``d`` must be
96    nonnegative and less than ``FLINT_BITS``. Aliasing is permitted.
97
98
99
100Generic butterflies
101--------------------------------------------------------------------------------
102
103
104.. function:: void fft_adjust(mp_limb_t * r, mp_limb_t * i1, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w)
105
106    Set ``r`` to ``i1`` times `z^i` modulo ``B^limbs + 1`` where
107    `z` corresponds to multiplication by `2^w`. This can be thought of as part
108    of a butterfly operation. We require `0 \leq i < n` where `nw =`
109    ``limbs*FLINT_BITS``. Aliasing is not supported.
110
111.. function:: void fft_adjust_sqrt2(mp_limb_t * r, mp_limb_t * i1, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w, mp_limb_t * temp)
112
113    Set ``r`` to ``i1`` times `z^i` modulo ``B^limbs + 1`` where
114    `z` corresponds to multiplication by `\sqrt{2}^w`. This can be thought of
115    as part of a butterfly operation. We require `0 \leq i < 2*n` and odd
116    where `nw =` ``limbs*FLINT_BITS``.
117
118.. function:: void butterfly_lshB(mp_limb_t * t, mp_limb_t * u, mp_limb_t * i1, mp_limb_t * i2, mp_size_t limbs, mp_size_t x, mp_size_t y)
119
120    We are given two integers ``i1`` and ``i2`` modulo
121    ``B^limbs + 1`` which are not necessarily normalised. We compute
122    ``t = (i1 + i2)*B^x`` and ``u = (i1 - i2)*B^y`` modulo `p`. Aliasing
123    between inputs and outputs is not permitted. We require ``x`` and
124    ``y`` to be less than ``limbs`` and nonnegative.
125
126.. function:: void butterfly_rshB(mp_limb_t * t, mp_limb_t * u, mp_limb_t * i1, mp_limb_t * i2, mp_size_t limbs, mp_size_t x, mp_size_t y)
127
128    We are given two integers ``i1`` and ``i2`` modulo
129    ``B^limbs + 1`` which are not necessarily normalised. We compute
130    ``t = (i1 + i2)/B^x`` and ``u = (i1 - i2)/B^y`` modulo `p`. Aliasing
131    between inputs and outputs is not permitted. We require ``x`` and
132    ``y`` to be less than ``limbs`` and nonnegative.
133
134
135Radix 2 transforms
136--------------------------------------------------------------------------------
137
138
139.. function:: void fft_butterfly(mp_limb_t * s, mp_limb_t * t, mp_limb_t * i1, mp_limb_t * i2, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w)
140
141    Set ``s = i1 + i2``, ``t = z1^i*(i1 - i2)`` modulo
142    ``B^limbs + 1`` where ``z1 = exp(Pi*I/n)`` corresponds to
143    multiplication by `2^w`. Requires `0 \leq i < n` where `nw =`
144    ``limbs*FLINT_BITS``.
145
146.. function:: void ifft_butterfly(mp_limb_t * s, mp_limb_t * t, mp_limb_t * i1, mp_limb_t * i2, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w)
147
148    Set ``s = i1 + z1^i*i2``, ``t = i1 -  z1^i*i2`` modulo
149    ``B^limbs + 1`` where\\ ``z1 = exp(-Pi*I/n)`` corresponds to
150    division by `2^w`. Requires `0 \leq i < 2n` where `nw =`
151    ``limbs*FLINT_BITS``.
152
153.. function:: void fft_radix2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2)
154
155    The radix 2 DIF FFT works as follows:
156
157    Input: ``[i0, i1, ..., i(m-1)]``, for `m = 2n` a power of `2`.
158
159    Output: ``[r0, r1, ..., r(m-1)]`` `` = FFT[i0, i1, ..., i(m-1)]``.
160
161    Algorithm:
162
163    | `\bullet` Recursively compute ``[r0, r2, r4, ...., r(m-2)]``
164    |     ``= FFT[i0+i(m/2), i1+i(m/2+1), ..., i(m/2-1)+i(m-1)]``
165    |
166    | `\bullet` Let ``[t0, t1, ..., t(m/2-1)]``
167    |     ``= [i0-i(m/2), i1-i(m/2+1), ..., i(m/2-1)-i(m-1)]``
168    |
169    | `\bullet` Let ``[u0, u1, ..., u(m/2-1)]``
170    |     ``= [z1^0*t0, z1^1*t1, ..., z1^(m/2-1)*t(m/2-1)]``
171    |     where ``z1 = exp(2*Pi*I/m)`` corresponds to multiplication by `2^w`.
172    |
173    | `\bullet` Recursively compute ``[r1, r3, ..., r(m-1)]``
174    |     ``= FFT[u0, u1, ..., u(m/2-1)]``
175
176    The parameters are as follows:
177
178    `\bullet` ``2*n`` is the length of the input and output arrays
179
180    `\bullet` `w` is such that `2^w` is an `2n`-th root of unity in the ring `\mathbf{Z}/p\mathbf{Z}` that we are working in, i.e. `p = 2^{wn} + 1` (here `n` is divisible by
181         ``GMP_LIMB_BITS``)
182
183    `\bullet` ``ii`` is the array of inputs (each input is an
184         array of limbs of length ``wn/GMP_LIMB_BITS + 1`` (the
185         extra limbs being a "carry limb"). Outputs are written
186         in-place.
187
188    We require `nw` to be at least 64 and the two temporary space pointers to
189    point to blocks of size ``n*w + FLINT_BITS`` bits.
190
191.. function:: void fft_truncate(mp_limb_t ** ii,  mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t trunc)
192
193    As for ``fft_radix2`` except that only the first ``trunc``
194    coefficients of the output are computed and the input is regarded as
195    having (implied) zero coefficients from coefficient ``trunc`` onwards.
196    The coefficients must exist as the algorithm needs to use this extra
197    space, but their value is irrelevant. The value of ``trunc`` must be
198    divisible by 2.
199
200.. function:: void fft_truncate1(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t trunc)
201
202    As for ``fft_radix2`` except that only the first ``trunc``
203    coefficients of the output are computed. The transform still needs all
204    `2n` input coefficients to be specified.
205
206.. function:: void ifft_radix2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2)
207
208    The radix 2 DIF IFFT works as follows:
209
210    Input: ``[i0, i1, ..., i(m-1)]``, for `m = 2n` a power of `2`.
211
212    Output: ``[r0, r1, ..., r(m-1)]``\\
213            `` = IFFT[i0, i1, ..., i(m-1)]``.
214
215    Algorithm:
216
217    `\bullet` Recursively compute ``[s0, s1, ...., s(m/2-1)]``
218         ``= IFFT[i0, i2, ..., i(m-2)]``
219
220    `\bullet` Recursively compute ``[t(m/2), t(m/2+1), ..., t(m-1)]``
221         ``= IFFT[i1, i3, ..., i(m-1)]``
222
223    `\bullet` Let ``[r0, r1, ..., r(m/2-1)]``
224         ``= [s0+z1^0*t0, s1+z1^1*t1, ..., s(m/2-1)+z1^(m/2-1)*t(m/2-1)]`` where ``z1 = exp(-2*Pi*I/m)`` corresponds to division by `2^w`.
225
226    `\bullet` Let ``[r(m/2), r(m/2+1), ..., r(m-1)]``
227        ``= [s0-z1^0*t0, s1-z1^1*t1, ..., s(m/2-1)-z1^(m/2-1)*t(m/2-1)]``
228
229    The parameters are as follows:
230
231    `\bullet` ``2*n`` is the length of the input and output
232        arrays
233
234    `\bullet` `w` is such that `2^w` is an `2n`-th root of unity in the ring `\mathbf{Z}/p\mathbf{Z}` that we are working in, i.e. `p = 2^{wn} + 1` (here `n` is divisible by
235        ``GMP_LIMB_BITS``)
236
237    `\bullet` ``ii`` is the array of inputs (each input is an array of limbs of length ``wn/GMP_LIMB_BITS + 1`` (the extra limbs being a "carry limb"). Outputs are written in-place.
238
239    We require `nw` to be at least 64 and the two temporary space pointers
240    to point to blocks of size ``n*w + FLINT_BITS`` bits.
241
242.. function:: void ifft_truncate(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t trunc)
243
244    As for ``ifft_radix2`` except that the output is assumed to have
245    zeros from coefficient trunc onwards and only the first trunc
246    coefficients of the input are specified. The remaining coefficients need
247    to exist as the extra space is needed, but their value is irrelevant.
248    The value of ``trunc`` must be divisible by 2.
249
250    Although the implementation does not require it, we assume for simplicity
251    that ``trunc`` is greater than `n`. The algorithm begins by computing
252    the inverse transform of the first `n` coefficients of the input array.
253    The unspecified coefficients of the second half of the array are then
254    written: coefficient ``trunc + i`` is computed as a twist of
255    coefficient ``i`` by a root of unity. The values of these coefficients
256    are then equal to what they would have been if the inverse transform of
257    the right hand side of the input array had been computed with full data
258    from the start. The function ``ifft_truncate1`` is then called on the
259    entire right half of the input array with this auxilliary data filled in.
260    Finally a single layer of the IFFT is completed on all the coefficients
261    up to ``trunc`` being careful to note that this involves doubling the
262    coefficients from ``trunc - n`` up to ``n``.
263
264.. function:: void ifft_truncate1(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t trunc)
265
266    Computes the first ``trunc`` coefficients of the radix 2 inverse
267    transform assuming the first ``trunc`` coefficients are given and that
268    the remaining coefficients have been set to the value they would have if
269    an inverse transform had already been applied with full data.
270
271    The algorithm is the same as for ``ifft_truncate`` except that the
272    coefficients from ``trunc`` onwards after the inverse transform are
273    not inferred to be zero but the supplied values.
274
275.. function:: void fft_butterfly_sqrt2(mp_limb_t * s, mp_limb_t * t, mp_limb_t * i1, mp_limb_t * i2, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w, mp_limb_t * temp)
276
277    Let `w = 2k + 1`, `i = 2j + 1`. Set ``s = i1 + i2``,
278    ``t = z1^i*(i1 - i2)`` modulo ``B^limbs + 1`` where
279    ``z1^2 = exp(Pi*I/n)`` corresponds to multiplication by `2^w`. Requires
280    `0 \leq i < 2n` where `nw =` ``limbs*FLINT_BITS``.
281
282    Here ``z1`` corresponds to multiplication by `2^k` then multiplication
283    by\\ ``(2^(3nw/4) - 2^(nw/4))``. We see ``z1^i`` corresponds to
284    multiplication by ``(2^(3nw/4) - 2^(nw/4))*2^(j+ik)``.
285
286    We first multiply by ``2^(j + ik + wn/4)`` then multiply by an
287    additional ``2^(nw/2)`` and subtract.
288
289.. function:: void ifft_butterfly_sqrt2(mp_limb_t * s, mp_limb_t * t, mp_limb_t * i1, mp_limb_t * i2, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w, mp_limb_t * temp)
290
291    Let `w = 2k + 1`, `i = 2j + 1`. Set ``s = i1 + z1^i*i2``,
292    ``t = i1 - z1^i*i2`` modulo ``B^limbs + 1`` where
293    ``z1^2 = exp(-Pi*I/n)`` corresponds to division by `2^w`. Requires
294    `0 \leq i < 2n` where `nw =` ``limbs*FLINT_BITS``.
295
296    Here ``z1`` corresponds to division by `2^k` then division by
297    ``(2^(3nw/4) - 2^(nw/4))``. We see ``z1^i`` corresponds to division
298    by ``(2^(3nw/4) - 2^(nw/4))*2^(j+ik)`` which is the same as division
299    by ``2^(j+ik + 1)`` then multiplication by
300    ``(2^(3nw/4) - 2^(nw/4))``.
301
302    Of course, division by ``2^(j+ik + 1)`` is the same as multiplication
303    by ``2^(2*wn - j - ik - 1)``. The exponent is positive as
304    `i \leq 2*n`, `j < n`, `k < w/2`.
305
306    We first multiply by ``2^(2*wn - j - ik - 1 + wn/4)`` then multiply by
307    an additional ``2^(nw/2)`` and subtract.
308
309.. function:: void fft_truncate_sqrt2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t trunc)
310
311    As per ``fft_truncate`` except that the transform is twice the usual
312    length, i.e. length `4n` rather than `2n`. This is achieved by making use
313    of twiddles by powers of a square root of 2, not powers of 2 in the first
314    layer of the transform.
315
316    We require `nw` to be at least 64 and the three temporary space pointers
317    to point to blocks of size ``n*w + FLINT_BITS`` bits.
318
319.. function:: void ifft_truncate_sqrt2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t trunc)
320
321    As per ``ifft_truncate`` except that the transform is twice the usual
322    length, i.e. length `4n` instead of `2n`. This is achieved by making use
323    of twiddles by powers of a square root of 2, not powers of 2 in the final
324    layer of the transform.
325
326    We require `nw` to be at least 64 and the three temporary space pointers
327    to point to blocks of size ``n*w + FLINT_BITS`` bits.
328
329
330Matrix Fourier Transforms
331--------------------------------------------------------------------------------
332
333
334.. function:: void fft_butterfly_twiddle(mp_limb_t * u, mp_limb_t * v, mp_limb_t * s, mp_limb_t * t, mp_size_t limbs, flint_bitcnt_t b1, flint_bitcnt_t b2)
335
336    Set ``u = 2^b1*(s + t)``, ``v = 2^b2*(s - t)`` modulo
337    ``B^limbs + 1``. This is used to compute
338    ``u = 2^(ws*tw1)*(s + t)``,\\ ``v = 2^(w+ws*tw2)*(s - t)`` in the
339    matrix fourier algorithm, i.e. effectively computing an ordinary butterfly
340    with additional twiddles by ``z1^rc`` for row `r` and column `c` of the
341    matrix of coefficients. Aliasing is not allowed.
342
343.. function:: void ifft_butterfly_twiddle(mp_limb_t * u, mp_limb_t * v, mp_limb_t * s, mp_limb_t * t, mp_size_t limbs, flint_bitcnt_t b1, flint_bitcnt_t b2)
344
345    Set ``u = s/2^b1 + t/2^b1)``, ``v = s/2^b1 - t/2^b1`` modulo
346    ``B^limbs + 1``. This is used to compute
347    ``u = 2^(-ws*tw1)*s + 2^(-ws*tw2)*t)``,\\
348    ``v = 2^(-ws*tw1)*s + 2^(-ws*tw2)*t)`` in the matrix fourier algorithm,
349    i.e. effectively computing an ordinary butterfly with additional twiddles
350    by ``z1^(-rc)`` for row `r` and column `c` of the matrix of
351    coefficients. Aliasing is not allowed.
352
353.. function:: void fft_radix2_twiddle(mp_limb_t ** ii, mp_size_t is, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t ws, mp_size_t r, mp_size_t c, mp_size_t rs)
354
355    As for ``fft_radix2`` except that the coefficients are spaced by
356    ``is`` in the array ``ii`` and an additional twist by ``z^c*i``
357    is applied to each coefficient where `i` starts at `r` and increases by
358    ``rs`` as one moves from one coefficient to the next. Here ``z``
359    corresponds to multiplication by ``2^ws``.
360
361.. function:: void ifft_radix2_twiddle(mp_limb_t ** ii, mp_size_t is, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t ws, mp_size_t r, mp_size_t c, mp_size_t rs)
362
363    As for ``ifft_radix2`` except that the coefficients are spaced by
364    ``is`` in the array ``ii`` and an additional twist by
365    ``z^(-c*i)`` is applied to each coefficient where `i` starts at `r`
366    and increases by ``rs`` as one moves from one coefficient to the next.
367    Here ``z`` corresponds to multiplication by ``2^ws``.
368
369.. function:: void fft_truncate1_twiddle(mp_limb_t ** ii, mp_size_t is, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t ws, mp_size_t r, mp_size_t c, mp_size_t rs, mp_size_t trunc)
370
371    As per ``fft_radix2_twiddle`` except that the transform is truncated
372    as per\\ ``fft_truncate1``.
373
374.. function:: void ifft_truncate1_twiddle(mp_limb_t ** ii, mp_size_t is, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t ws, mp_size_t r, mp_size_t c, mp_size_t rs, mp_size_t trunc)
375
376    As per ``ifft_radix2_twiddle`` except that the transform is truncated
377    as per\\ ``ifft_truncate1``.
378
379.. function:: void fft_mfa_truncate_sqrt2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc)
380
381    This is as per the ``fft_truncate_sqrt2`` function except that the
382    matrix fourier algorithm is used for the left and right FFTs. The total
383    transform length is `4n` where ``n = 2^depth`` so that the left and
384    right transforms are both length `2n`. We require ``trunc > 2*n`` and
385    that ``trunc`` is divisible by ``2*n1`` (explained below).
386
387    The matrix fourier algorithm, which is applied to each transform of length
388    `2n`, works as follows. We set ``n1`` to a power of 2 about the square
389    root of `n`. The data is then thought of as a set of ``n2`` rows each
390    with ``n1`` columns (so that ``n1*n2 = 2n``).
391
392    The length `2n` transform is then computed using a whole pile of short
393    transforms. These comprise ``n1`` column transforms of length ``n2``
394    followed by some twiddles by roots of unity (namely ``z^rc`` where `r`
395    is the row and `c` the column within the data) followed by ``n2``
396    row transforms of length ``n1``. Along the way the data needs to be
397    rearranged due to the fact that the short transforms output the data in
398    binary reversed order compared with what is needed.
399
400    The matrix fourier algorithm provides better cache locality by decomposing
401    the long length `2n` transforms into many transforms of about the square
402    root of the original length.
403
404    For better cache locality the sqrt2 layer of the full length `4n`
405    transform is folded in with the column FFTs performed as part of the first
406    matrix fourier algorithm on the left half of the data.
407
408    The second half of the data requires a truncated version of the matrix
409    fourier algorithm. This is achieved by truncating to an exact multiple of
410    the row length so that the row transforms are full length. Moreover, the
411    column transforms will then be truncated transforms and their truncated
412    length needs to be a multiple of 2. This explains the condition on
413    ``trunc`` given above.
414
415    To improve performance, the extra twiddles by roots of unity are combined
416    with the butterflies performed at the last layer of the column transforms.
417
418    We require `nw` to be at least 64 and the three temporary space pointers
419    to point to blocks of size ``n*w + FLINT_BITS`` bits.
420
421.. function:: void ifft_mfa_truncate_sqrt2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc)
422
423    This is as per the ``ifft_truncate_sqrt2`` function except that the
424    matrix fourier algorithm is used for the left and right IFFTs. The total
425    transform length is `4n` where ``n = 2^depth`` so that the left and
426    right transforms are both length `2n`. We require ``trunc > 2*n`` and
427    that ``trunc`` is divisible by ``2*n1``.
428
429    We set ``n1`` to a power of 2 about the square root of `n`.
430
431    As per the matrix fourier FFT the sqrt2 layer is folded into the the
432    final column IFFTs for better cache locality and the extra twiddles that
433    occur in the matrix fourier algorithm are combined with the butterflied
434    performed at the first layer of the final column transforms.
435
436    We require `nw` to be at least 64 and the three temporary space pointers
437    to point to blocks of size ``n*w + FLINT_BITS`` bits.
438
439.. function:: void fft_mfa_truncate_sqrt2_outer(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc)
440
441    Just the outer layers of ``fft_mfa_truncate_sqrt2``.
442
443.. function:: void fft_mfa_truncate_sqrt2_inner(mp_limb_t ** ii, mp_limb_t ** jj, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc, mp_limb_t * tt)
444
445    The inner layers of ``fft_mfa_truncate_sqrt2`` and
446    ``ifft_mfa_truncate_sqrt2`` combined with pointwise mults.
447
448.. function:: void ifft_mfa_truncate_sqrt2_outer(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc)
449
450    The outer layers of ``ifft_mfa_truncate_sqrt2`` combined with
451    normalisation.
452
453
454Negacyclic multiplication
455--------------------------------------------------------------------------------
456
457
458.. function:: void fft_negacyclic(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp)
459
460    As per ``fft_radix2`` except that it performs a sqrt2 negacyclic
461    transform of length `2n`. This is the same as the radix 2 transform
462    except that the `i`-th coefficient of the input is first multiplied by
463    `\sqrt{2}^{iw}`.
464
465    We require `nw` to be at least 64 and the two temporary space pointers to
466    point to blocks of size ``n*w + FLINT_BITS`` bits.
467
468.. function:: void ifft_negacyclic(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp)
469
470    As per ``ifft_radix2`` except that it performs a sqrt2 negacyclic
471    inverse transform of length `2n`. This is the same as the radix 2 inverse
472    transform except that the `i`-th coefficient of the output is finally
473    divided by `\sqrt{2}^{iw}`.
474
475    We require `nw` to be at least 64 and the two temporary space pointers to
476    point to blocks of size ``n*w + FLINT_BITS`` bits.
477
478.. function:: void fft_naive_convolution_1(mp_limb_t * r, mp_limb_t * ii, mp_limb_t * jj, mp_size_t m)
479
480    Performs a naive negacyclic convolution of ``ii`` with ``jj``,
481    both of length `m` and sets `r` to the result. This is essentially
482    multiplication of polynomials modulo `x^m + 1`.
483
484.. function:: void _fft_mulmod_2expp1(mp_limb_t * r1, mp_limb_t * i1, mp_limb_t * i2, mp_size_t r_limbs, flint_bitcnt_t depth, flint_bitcnt_t w)
485
486    Multiply ``i1`` by ``i2`` modulo ``B^r_limbs + 1`` where
487    ``r_limbs = nw/FLINT_BITS`` with ``n = 2^depth``. Uses the
488    negacyclic FFT convolution CRT'd with a 1 limb naive convolution. We
489    require that ``depth`` and ``w`` have been selected as per the
490    wrapper ``fft_mulmod_2expp1`` below.
491
492.. function:: slong fft_adjust_limbs(mp_size_t limbs)
493
494    Given a number of limbs, returns a new number of limbs (no more than
495    the next power of 2) which will work with the Nussbaumer code. It is only
496    necessary to make this adjustment if
497    ``limbs > FFT_MULMOD_2EXPP1_CUTOFF``.
498
499.. function:: void fft_mulmod_2expp1(mp_limb_t * r, mp_limb_t * i1, mp_limb_t * i2, mp_size_t n, mp_size_t w, mp_limb_t * tt)
500
501    As per ``_fft_mulmod_2expp1`` but with a tuned cutoff below which more
502    classical methods are used for the convolution. The temporary space is
503    required to fit ``n*w + FLINT_BITS`` bits. There are no restrictions
504    on `n`, but if ``limbs = n*w/FLINT_BITS`` then if ``limbs`` exceeds
505    ``FFT_MULMOD_2EXPP1_CUTOFF`` the function ``fft_adjust_limbs`` must
506    be called to increase the number of limbs to an appropriate value.
507
508
509Integer multiplication
510--------------------------------------------------------------------------------
511
512
513.. function:: void mul_truncate_sqrt2(mp_ptr r1, mp_srcptr i1, mp_size_t n1, mp_srcptr i2, mp_size_t n2, flint_bitcnt_t depth, flint_bitcnt_t w)
514
515    Integer multiplication using the radix 2 truncated sqrt2 transforms.
516
517    Set ``(r1, n1 + n2)`` to the product of ``(i1, n1)`` by
518    ``(i2, n2)``. This is achieved through an FFT convolution of length at
519    most ``2^(depth + 2)`` with coefficients of size `nw` bits where
520    ``n = 2^depth``. We require ``depth >= 6``. The input data is
521    broken into chunks of data not exceeding ``(nw - (depth + 1))/2``
522    bits. If breaking the first integer into chunks of this size results in
523    ``j1`` coefficients and breaking the second integer results in
524    ``j2`` chunks then ``j1 + j2 - 1 <= 2^(depth + 2)``.
525
526    If ``n = 2^depth`` then we require `nw` to be at least 64.
527
528.. function:: void mul_mfa_truncate_sqrt2(mp_ptr r1, mp_srcptr i1, mp_size_t n1, mp_srcptr i2, mp_size_t n2, flint_bitcnt_t depth, flint_bitcnt_t w)
529
530    As for ``mul_truncate_sqrt2`` except that the cache friendly matrix
531    fourier algorithm is used.
532
533    If ``n = 2^depth`` then we require `nw` to be at least 64. Here we
534    also require `w` to be `2^i` for some `i \geq 0`.
535
536.. function:: void flint_mpn_mul_fft_main(mp_ptr r1, mp_srcptr i1, mp_size_t n1, mp_srcptr i2, mp_size_t n2)
537
538    The main integer multiplication routine. Sets ``(r1, n1 + n2)`` to
539    ``(i1, n1)`` times ``(i2, n2)``. We require ``n1 >= n2 > 0``.
540
541
542Convolution
543--------------------------------------------------------------------------------
544
545
546.. function:: void fft_convolution(mp_limb_t ** ii, mp_limb_t ** jj, slong depth, slong limbs, slong trunc, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** s1, mp_limb_t * tt)
547
548    Perform an FFT convolution of ``ii`` with ``jj``, both of length
549    ``4*n`` where ``n = 2^depth``. Assume that all but the first
550    ``trunc`` coefficients of the output (placed in ``ii``) are zero.
551    Each coefficient is taken modulo ``B^limbs + 1``. The temporary
552    spaces ``t1``, ``t2`` and ``s1`` must have ``limbs + 1``
553    limbs of space and ``tt`` must have ``2*(limbs + 1)`` of free
554    space.
555
556FFT Precaching
557-------------------------------------------------------------------------------
558
559
560.. function:: void fft_precache(mp_limb_t ** jj, slong depth, slong limbs, slong trunc, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** s1)
561
562    Precompute the FFT of ``jj`` for use with precache functions. The
563    parameters are as for ``fft_convolution``.
564
565.. function:: void fft_convolution_precache(mp_limb_t ** ii, mp_limb_t ** jj, slong depth, slong limbs, slong trunc, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** s1, mp_limb_t ** tt)
566
567    As per ``fft_convolution`` except that it is assumed ``fft_precache`` has
568    been called on ``jj`` with the same parameters. This will then run faster
569    than if ``fft_convolution`` had been run with the original ``jj``.
570