1.. _fft: 2 3**fft.h** -- Schoenhage-Strassen FFT 4================================================================================ 5 6 7Split/combine FFT coefficients 8-------------------------------------------------------------------------------- 9 10 11.. function:: mp_size_t fft_split_limbs(mp_limb_t ** poly, mp_srcptr limbs, mp_size_t total_limbs, mp_size_t coeff_limbs, mp_size_t output_limbs) 12 13 Split an integer ``(limbs, total_limbs)`` into coefficients of length 14 ``coeff_limbs`` limbs and store as the coefficients of ``poly`` 15 which are assumed to have space for ``output_limbs + 1`` limbs per 16 coefficient. The coefficients of the polynomial do not need to be zeroed 17 before calling this function, however the number of coefficients written 18 is returned by the function and any coefficients beyond this point are 19 not touched. 20 21.. function:: mp_size_t fft_split_bits(mp_limb_t ** poly, mp_srcptr limbs, mp_size_t total_limbs, flint_bitcnt_t bits, mp_size_t output_limbs) 22 23 Split an integer ``(limbs, total_limbs)`` into coefficients of the 24 given number of ``bits`` and store as the coefficients of ``poly`` 25 which are assumed to have space for ``output_limbs + 1`` limbs per 26 coefficient. The coefficients of the polynomial do not need to be zeroed 27 before calling this function, however the number of coefficients written 28 is returned by the function and any coefficients beyond this point are 29 not touched. 30 31.. function:: void fft_combine_limbs(mp_limb_t * res, mp_limb_t ** poly, slong length, mp_size_t coeff_limbs, mp_size_t output_limbs, mp_size_t total_limbs) 32 33 Evaluate the polynomial ``poly`` of the given ``length`` at 34 ``B^coeff_limbs``, where ``B = 2^FLINT_BITS``, and add the 35 result to the integer ``(res, total_limbs)`` throwing away any bits 36 that exceed the given number of limbs. The polynomial coefficients are 37 assumed to have at least ``output_limbs`` limbs each, however any 38 additional limbs are ignored. 39 40 If the integer is initially zero the result will just be the evaluation 41 of the polynomial. 42 43.. function:: void fft_combine_bits(mp_limb_t * res, mp_limb_t ** poly, slong length, flint_bitcnt_t bits, mp_size_t output_limbs, mp_size_t total_limbs) 44 45 Evaluate the polynomial ``poly`` of the given ``length`` at 46 ``2^bits`` and add the result to the integer 47 ``(res, total_limbs)`` throwing away any bits that exceed the given 48 number of limbs. The polynomial coefficients are assumed to have at least 49 ``output_limbs`` limbs each, however any additional limbs are ignored. 50 If the integer is initially zero the result will just be the evaluation 51 of the polynomial. 52 53 54Test helper functions 55-------------------------------------------------------------------------------- 56 57 58.. function:: void fermat_to_mpz(mpz_t m, mp_limb_t * i, mp_size_t limbs) 59 60 Convert the Fermat number ``(i, limbs)`` modulo ``B^limbs + 1`` to 61 an ``mpz_t m``. Assumes ``m`` has been initialised. This function 62 is used only in test code. 63 64 65Arithmetic modulo a generalised Fermat number 66-------------------------------------------------------------------------------- 67 68 69.. function:: void mpn_addmod_2expp1_1(mp_limb_t * r, mp_size_t limbs, mp_limb_signed_t c) 70 71 Adds the signed limb ``c`` to the generalised fermat number ``r`` 72 modulo ``B^limbs + 1``. The compiler should be able to inline 73 this for the case that there is no overflow from the first limb. 74 75.. function:: void mpn_normmod_2expp1(mp_limb_t * t, mp_size_t limbs) 76 77 Given ``t`` a signed integer of ``limbs + 1`` limbs in twos 78 complement format, reduce ``t`` to the corresponding value modulo the 79 generalised Fermat number ``B^limbs + 1``, where 80 ``B = 2^FLINT_BITS``. 81 82.. function:: void mpn_mul_2expmod_2expp1(mp_limb_t * t, mp_limb_t * i1, mp_size_t limbs, flint_bitcnt_t d) 83 84 Given ``i1`` a signed integer of ``limbs + 1`` limbs in twos 85 complement format reduced modulo ``B^limbs + 1`` up to some 86 overflow, compute ``t = i1*2^d`` modulo `p`. The result will not 87 necessarily be fully reduced. The number of bits ``d`` must be 88 nonnegative and less than ``FLINT_BITS``. Aliasing is permitted. 89 90.. function:: void mpn_div_2expmod_2expp1(mp_limb_t * t, mp_limb_t * i1, mp_size_t limbs, flint_bitcnt_t d) 91 92 Given ``i1`` a signed integer of ``limbs + 1`` limbs in twos 93 complement format reduced modulo ``B^limbs + 1`` up to some 94 overflow, compute ``t = i1/2^d`` modulo `p`. The result will not 95 necessarily be fully reduced. The number of bits ``d`` must be 96 nonnegative and less than ``FLINT_BITS``. Aliasing is permitted. 97 98 99 100Generic butterflies 101-------------------------------------------------------------------------------- 102 103 104.. function:: void fft_adjust(mp_limb_t * r, mp_limb_t * i1, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w) 105 106 Set ``r`` to ``i1`` times `z^i` modulo ``B^limbs + 1`` where 107 `z` corresponds to multiplication by `2^w`. This can be thought of as part 108 of a butterfly operation. We require `0 \leq i < n` where `nw =` 109 ``limbs*FLINT_BITS``. Aliasing is not supported. 110 111.. function:: void fft_adjust_sqrt2(mp_limb_t * r, mp_limb_t * i1, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w, mp_limb_t * temp) 112 113 Set ``r`` to ``i1`` times `z^i` modulo ``B^limbs + 1`` where 114 `z` corresponds to multiplication by `\sqrt{2}^w`. This can be thought of 115 as part of a butterfly operation. We require `0 \leq i < 2*n` and odd 116 where `nw =` ``limbs*FLINT_BITS``. 117 118.. function:: void butterfly_lshB(mp_limb_t * t, mp_limb_t * u, mp_limb_t * i1, mp_limb_t * i2, mp_size_t limbs, mp_size_t x, mp_size_t y) 119 120 We are given two integers ``i1`` and ``i2`` modulo 121 ``B^limbs + 1`` which are not necessarily normalised. We compute 122 ``t = (i1 + i2)*B^x`` and ``u = (i1 - i2)*B^y`` modulo `p`. Aliasing 123 between inputs and outputs is not permitted. We require ``x`` and 124 ``y`` to be less than ``limbs`` and nonnegative. 125 126.. function:: void butterfly_rshB(mp_limb_t * t, mp_limb_t * u, mp_limb_t * i1, mp_limb_t * i2, mp_size_t limbs, mp_size_t x, mp_size_t y) 127 128 We are given two integers ``i1`` and ``i2`` modulo 129 ``B^limbs + 1`` which are not necessarily normalised. We compute 130 ``t = (i1 + i2)/B^x`` and ``u = (i1 - i2)/B^y`` modulo `p`. Aliasing 131 between inputs and outputs is not permitted. We require ``x`` and 132 ``y`` to be less than ``limbs`` and nonnegative. 133 134 135Radix 2 transforms 136-------------------------------------------------------------------------------- 137 138 139.. function:: void fft_butterfly(mp_limb_t * s, mp_limb_t * t, mp_limb_t * i1, mp_limb_t * i2, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w) 140 141 Set ``s = i1 + i2``, ``t = z1^i*(i1 - i2)`` modulo 142 ``B^limbs + 1`` where ``z1 = exp(Pi*I/n)`` corresponds to 143 multiplication by `2^w`. Requires `0 \leq i < n` where `nw =` 144 ``limbs*FLINT_BITS``. 145 146.. function:: void ifft_butterfly(mp_limb_t * s, mp_limb_t * t, mp_limb_t * i1, mp_limb_t * i2, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w) 147 148 Set ``s = i1 + z1^i*i2``, ``t = i1 - z1^i*i2`` modulo 149 ``B^limbs + 1`` where\\ ``z1 = exp(-Pi*I/n)`` corresponds to 150 division by `2^w`. Requires `0 \leq i < 2n` where `nw =` 151 ``limbs*FLINT_BITS``. 152 153.. function:: void fft_radix2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2) 154 155 The radix 2 DIF FFT works as follows: 156 157 Input: ``[i0, i1, ..., i(m-1)]``, for `m = 2n` a power of `2`. 158 159 Output: ``[r0, r1, ..., r(m-1)]`` `` = FFT[i0, i1, ..., i(m-1)]``. 160 161 Algorithm: 162 163 | `\bullet` Recursively compute ``[r0, r2, r4, ...., r(m-2)]`` 164 | ``= FFT[i0+i(m/2), i1+i(m/2+1), ..., i(m/2-1)+i(m-1)]`` 165 | 166 | `\bullet` Let ``[t0, t1, ..., t(m/2-1)]`` 167 | ``= [i0-i(m/2), i1-i(m/2+1), ..., i(m/2-1)-i(m-1)]`` 168 | 169 | `\bullet` Let ``[u0, u1, ..., u(m/2-1)]`` 170 | ``= [z1^0*t0, z1^1*t1, ..., z1^(m/2-1)*t(m/2-1)]`` 171 | where ``z1 = exp(2*Pi*I/m)`` corresponds to multiplication by `2^w`. 172 | 173 | `\bullet` Recursively compute ``[r1, r3, ..., r(m-1)]`` 174 | ``= FFT[u0, u1, ..., u(m/2-1)]`` 175 176 The parameters are as follows: 177 178 `\bullet` ``2*n`` is the length of the input and output arrays 179 180 `\bullet` `w` is such that `2^w` is an `2n`-th root of unity in the ring `\mathbf{Z}/p\mathbf{Z}` that we are working in, i.e. `p = 2^{wn} + 1` (here `n` is divisible by 181 ``GMP_LIMB_BITS``) 182 183 `\bullet` ``ii`` is the array of inputs (each input is an 184 array of limbs of length ``wn/GMP_LIMB_BITS + 1`` (the 185 extra limbs being a "carry limb"). Outputs are written 186 in-place. 187 188 We require `nw` to be at least 64 and the two temporary space pointers to 189 point to blocks of size ``n*w + FLINT_BITS`` bits. 190 191.. function:: void fft_truncate(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t trunc) 192 193 As for ``fft_radix2`` except that only the first ``trunc`` 194 coefficients of the output are computed and the input is regarded as 195 having (implied) zero coefficients from coefficient ``trunc`` onwards. 196 The coefficients must exist as the algorithm needs to use this extra 197 space, but their value is irrelevant. The value of ``trunc`` must be 198 divisible by 2. 199 200.. function:: void fft_truncate1(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t trunc) 201 202 As for ``fft_radix2`` except that only the first ``trunc`` 203 coefficients of the output are computed. The transform still needs all 204 `2n` input coefficients to be specified. 205 206.. function:: void ifft_radix2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2) 207 208 The radix 2 DIF IFFT works as follows: 209 210 Input: ``[i0, i1, ..., i(m-1)]``, for `m = 2n` a power of `2`. 211 212 Output: ``[r0, r1, ..., r(m-1)]``\\ 213 `` = IFFT[i0, i1, ..., i(m-1)]``. 214 215 Algorithm: 216 217 `\bullet` Recursively compute ``[s0, s1, ...., s(m/2-1)]`` 218 ``= IFFT[i0, i2, ..., i(m-2)]`` 219 220 `\bullet` Recursively compute ``[t(m/2), t(m/2+1), ..., t(m-1)]`` 221 ``= IFFT[i1, i3, ..., i(m-1)]`` 222 223 `\bullet` Let ``[r0, r1, ..., r(m/2-1)]`` 224 ``= [s0+z1^0*t0, s1+z1^1*t1, ..., s(m/2-1)+z1^(m/2-1)*t(m/2-1)]`` where ``z1 = exp(-2*Pi*I/m)`` corresponds to division by `2^w`. 225 226 `\bullet` Let ``[r(m/2), r(m/2+1), ..., r(m-1)]`` 227 ``= [s0-z1^0*t0, s1-z1^1*t1, ..., s(m/2-1)-z1^(m/2-1)*t(m/2-1)]`` 228 229 The parameters are as follows: 230 231 `\bullet` ``2*n`` is the length of the input and output 232 arrays 233 234 `\bullet` `w` is such that `2^w` is an `2n`-th root of unity in the ring `\mathbf{Z}/p\mathbf{Z}` that we are working in, i.e. `p = 2^{wn} + 1` (here `n` is divisible by 235 ``GMP_LIMB_BITS``) 236 237 `\bullet` ``ii`` is the array of inputs (each input is an array of limbs of length ``wn/GMP_LIMB_BITS + 1`` (the extra limbs being a "carry limb"). Outputs are written in-place. 238 239 We require `nw` to be at least 64 and the two temporary space pointers 240 to point to blocks of size ``n*w + FLINT_BITS`` bits. 241 242.. function:: void ifft_truncate(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t trunc) 243 244 As for ``ifft_radix2`` except that the output is assumed to have 245 zeros from coefficient trunc onwards and only the first trunc 246 coefficients of the input are specified. The remaining coefficients need 247 to exist as the extra space is needed, but their value is irrelevant. 248 The value of ``trunc`` must be divisible by 2. 249 250 Although the implementation does not require it, we assume for simplicity 251 that ``trunc`` is greater than `n`. The algorithm begins by computing 252 the inverse transform of the first `n` coefficients of the input array. 253 The unspecified coefficients of the second half of the array are then 254 written: coefficient ``trunc + i`` is computed as a twist of 255 coefficient ``i`` by a root of unity. The values of these coefficients 256 are then equal to what they would have been if the inverse transform of 257 the right hand side of the input array had been computed with full data 258 from the start. The function ``ifft_truncate1`` is then called on the 259 entire right half of the input array with this auxilliary data filled in. 260 Finally a single layer of the IFFT is completed on all the coefficients 261 up to ``trunc`` being careful to note that this involves doubling the 262 coefficients from ``trunc - n`` up to ``n``. 263 264.. function:: void ifft_truncate1(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t trunc) 265 266 Computes the first ``trunc`` coefficients of the radix 2 inverse 267 transform assuming the first ``trunc`` coefficients are given and that 268 the remaining coefficients have been set to the value they would have if 269 an inverse transform had already been applied with full data. 270 271 The algorithm is the same as for ``ifft_truncate`` except that the 272 coefficients from ``trunc`` onwards after the inverse transform are 273 not inferred to be zero but the supplied values. 274 275.. function:: void fft_butterfly_sqrt2(mp_limb_t * s, mp_limb_t * t, mp_limb_t * i1, mp_limb_t * i2, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w, mp_limb_t * temp) 276 277 Let `w = 2k + 1`, `i = 2j + 1`. Set ``s = i1 + i2``, 278 ``t = z1^i*(i1 - i2)`` modulo ``B^limbs + 1`` where 279 ``z1^2 = exp(Pi*I/n)`` corresponds to multiplication by `2^w`. Requires 280 `0 \leq i < 2n` where `nw =` ``limbs*FLINT_BITS``. 281 282 Here ``z1`` corresponds to multiplication by `2^k` then multiplication 283 by\\ ``(2^(3nw/4) - 2^(nw/4))``. We see ``z1^i`` corresponds to 284 multiplication by ``(2^(3nw/4) - 2^(nw/4))*2^(j+ik)``. 285 286 We first multiply by ``2^(j + ik + wn/4)`` then multiply by an 287 additional ``2^(nw/2)`` and subtract. 288 289.. function:: void ifft_butterfly_sqrt2(mp_limb_t * s, mp_limb_t * t, mp_limb_t * i1, mp_limb_t * i2, mp_size_t i, mp_size_t limbs, flint_bitcnt_t w, mp_limb_t * temp) 290 291 Let `w = 2k + 1`, `i = 2j + 1`. Set ``s = i1 + z1^i*i2``, 292 ``t = i1 - z1^i*i2`` modulo ``B^limbs + 1`` where 293 ``z1^2 = exp(-Pi*I/n)`` corresponds to division by `2^w`. Requires 294 `0 \leq i < 2n` where `nw =` ``limbs*FLINT_BITS``. 295 296 Here ``z1`` corresponds to division by `2^k` then division by 297 ``(2^(3nw/4) - 2^(nw/4))``. We see ``z1^i`` corresponds to division 298 by ``(2^(3nw/4) - 2^(nw/4))*2^(j+ik)`` which is the same as division 299 by ``2^(j+ik + 1)`` then multiplication by 300 ``(2^(3nw/4) - 2^(nw/4))``. 301 302 Of course, division by ``2^(j+ik + 1)`` is the same as multiplication 303 by ``2^(2*wn - j - ik - 1)``. The exponent is positive as 304 `i \leq 2*n`, `j < n`, `k < w/2`. 305 306 We first multiply by ``2^(2*wn - j - ik - 1 + wn/4)`` then multiply by 307 an additional ``2^(nw/2)`` and subtract. 308 309.. function:: void fft_truncate_sqrt2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t trunc) 310 311 As per ``fft_truncate`` except that the transform is twice the usual 312 length, i.e. length `4n` rather than `2n`. This is achieved by making use 313 of twiddles by powers of a square root of 2, not powers of 2 in the first 314 layer of the transform. 315 316 We require `nw` to be at least 64 and the three temporary space pointers 317 to point to blocks of size ``n*w + FLINT_BITS`` bits. 318 319.. function:: void ifft_truncate_sqrt2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t trunc) 320 321 As per ``ifft_truncate`` except that the transform is twice the usual 322 length, i.e. length `4n` instead of `2n`. This is achieved by making use 323 of twiddles by powers of a square root of 2, not powers of 2 in the final 324 layer of the transform. 325 326 We require `nw` to be at least 64 and the three temporary space pointers 327 to point to blocks of size ``n*w + FLINT_BITS`` bits. 328 329 330Matrix Fourier Transforms 331-------------------------------------------------------------------------------- 332 333 334.. function:: void fft_butterfly_twiddle(mp_limb_t * u, mp_limb_t * v, mp_limb_t * s, mp_limb_t * t, mp_size_t limbs, flint_bitcnt_t b1, flint_bitcnt_t b2) 335 336 Set ``u = 2^b1*(s + t)``, ``v = 2^b2*(s - t)`` modulo 337 ``B^limbs + 1``. This is used to compute 338 ``u = 2^(ws*tw1)*(s + t)``,\\ ``v = 2^(w+ws*tw2)*(s - t)`` in the 339 matrix fourier algorithm, i.e. effectively computing an ordinary butterfly 340 with additional twiddles by ``z1^rc`` for row `r` and column `c` of the 341 matrix of coefficients. Aliasing is not allowed. 342 343.. function:: void ifft_butterfly_twiddle(mp_limb_t * u, mp_limb_t * v, mp_limb_t * s, mp_limb_t * t, mp_size_t limbs, flint_bitcnt_t b1, flint_bitcnt_t b2) 344 345 Set ``u = s/2^b1 + t/2^b1)``, ``v = s/2^b1 - t/2^b1`` modulo 346 ``B^limbs + 1``. This is used to compute 347 ``u = 2^(-ws*tw1)*s + 2^(-ws*tw2)*t)``,\\ 348 ``v = 2^(-ws*tw1)*s + 2^(-ws*tw2)*t)`` in the matrix fourier algorithm, 349 i.e. effectively computing an ordinary butterfly with additional twiddles 350 by ``z1^(-rc)`` for row `r` and column `c` of the matrix of 351 coefficients. Aliasing is not allowed. 352 353.. function:: void fft_radix2_twiddle(mp_limb_t ** ii, mp_size_t is, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t ws, mp_size_t r, mp_size_t c, mp_size_t rs) 354 355 As for ``fft_radix2`` except that the coefficients are spaced by 356 ``is`` in the array ``ii`` and an additional twist by ``z^c*i`` 357 is applied to each coefficient where `i` starts at `r` and increases by 358 ``rs`` as one moves from one coefficient to the next. Here ``z`` 359 corresponds to multiplication by ``2^ws``. 360 361.. function:: void ifft_radix2_twiddle(mp_limb_t ** ii, mp_size_t is, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t ws, mp_size_t r, mp_size_t c, mp_size_t rs) 362 363 As for ``ifft_radix2`` except that the coefficients are spaced by 364 ``is`` in the array ``ii`` and an additional twist by 365 ``z^(-c*i)`` is applied to each coefficient where `i` starts at `r` 366 and increases by ``rs`` as one moves from one coefficient to the next. 367 Here ``z`` corresponds to multiplication by ``2^ws``. 368 369.. function:: void fft_truncate1_twiddle(mp_limb_t ** ii, mp_size_t is, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t ws, mp_size_t r, mp_size_t c, mp_size_t rs, mp_size_t trunc) 370 371 As per ``fft_radix2_twiddle`` except that the transform is truncated 372 as per\\ ``fft_truncate1``. 373 374.. function:: void ifft_truncate1_twiddle(mp_limb_t ** ii, mp_size_t is, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_size_t ws, mp_size_t r, mp_size_t c, mp_size_t rs, mp_size_t trunc) 375 376 As per ``ifft_radix2_twiddle`` except that the transform is truncated 377 as per\\ ``ifft_truncate1``. 378 379.. function:: void fft_mfa_truncate_sqrt2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc) 380 381 This is as per the ``fft_truncate_sqrt2`` function except that the 382 matrix fourier algorithm is used for the left and right FFTs. The total 383 transform length is `4n` where ``n = 2^depth`` so that the left and 384 right transforms are both length `2n`. We require ``trunc > 2*n`` and 385 that ``trunc`` is divisible by ``2*n1`` (explained below). 386 387 The matrix fourier algorithm, which is applied to each transform of length 388 `2n`, works as follows. We set ``n1`` to a power of 2 about the square 389 root of `n`. The data is then thought of as a set of ``n2`` rows each 390 with ``n1`` columns (so that ``n1*n2 = 2n``). 391 392 The length `2n` transform is then computed using a whole pile of short 393 transforms. These comprise ``n1`` column transforms of length ``n2`` 394 followed by some twiddles by roots of unity (namely ``z^rc`` where `r` 395 is the row and `c` the column within the data) followed by ``n2`` 396 row transforms of length ``n1``. Along the way the data needs to be 397 rearranged due to the fact that the short transforms output the data in 398 binary reversed order compared with what is needed. 399 400 The matrix fourier algorithm provides better cache locality by decomposing 401 the long length `2n` transforms into many transforms of about the square 402 root of the original length. 403 404 For better cache locality the sqrt2 layer of the full length `4n` 405 transform is folded in with the column FFTs performed as part of the first 406 matrix fourier algorithm on the left half of the data. 407 408 The second half of the data requires a truncated version of the matrix 409 fourier algorithm. This is achieved by truncating to an exact multiple of 410 the row length so that the row transforms are full length. Moreover, the 411 column transforms will then be truncated transforms and their truncated 412 length needs to be a multiple of 2. This explains the condition on 413 ``trunc`` given above. 414 415 To improve performance, the extra twiddles by roots of unity are combined 416 with the butterflies performed at the last layer of the column transforms. 417 418 We require `nw` to be at least 64 and the three temporary space pointers 419 to point to blocks of size ``n*w + FLINT_BITS`` bits. 420 421.. function:: void ifft_mfa_truncate_sqrt2(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc) 422 423 This is as per the ``ifft_truncate_sqrt2`` function except that the 424 matrix fourier algorithm is used for the left and right IFFTs. The total 425 transform length is `4n` where ``n = 2^depth`` so that the left and 426 right transforms are both length `2n`. We require ``trunc > 2*n`` and 427 that ``trunc`` is divisible by ``2*n1``. 428 429 We set ``n1`` to a power of 2 about the square root of `n`. 430 431 As per the matrix fourier FFT the sqrt2 layer is folded into the the 432 final column IFFTs for better cache locality and the extra twiddles that 433 occur in the matrix fourier algorithm are combined with the butterflied 434 performed at the first layer of the final column transforms. 435 436 We require `nw` to be at least 64 and the three temporary space pointers 437 to point to blocks of size ``n*w + FLINT_BITS`` bits. 438 439.. function:: void fft_mfa_truncate_sqrt2_outer(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc) 440 441 Just the outer layers of ``fft_mfa_truncate_sqrt2``. 442 443.. function:: void fft_mfa_truncate_sqrt2_inner(mp_limb_t ** ii, mp_limb_t ** jj, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc, mp_limb_t * tt) 444 445 The inner layers of ``fft_mfa_truncate_sqrt2`` and 446 ``ifft_mfa_truncate_sqrt2`` combined with pointwise mults. 447 448.. function:: void ifft_mfa_truncate_sqrt2_outer(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp, mp_size_t n1, mp_size_t trunc) 449 450 The outer layers of ``ifft_mfa_truncate_sqrt2`` combined with 451 normalisation. 452 453 454Negacyclic multiplication 455-------------------------------------------------------------------------------- 456 457 458.. function:: void fft_negacyclic(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp) 459 460 As per ``fft_radix2`` except that it performs a sqrt2 negacyclic 461 transform of length `2n`. This is the same as the radix 2 transform 462 except that the `i`-th coefficient of the input is first multiplied by 463 `\sqrt{2}^{iw}`. 464 465 We require `nw` to be at least 64 and the two temporary space pointers to 466 point to blocks of size ``n*w + FLINT_BITS`` bits. 467 468.. function:: void ifft_negacyclic(mp_limb_t ** ii, mp_size_t n, flint_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** temp) 469 470 As per ``ifft_radix2`` except that it performs a sqrt2 negacyclic 471 inverse transform of length `2n`. This is the same as the radix 2 inverse 472 transform except that the `i`-th coefficient of the output is finally 473 divided by `\sqrt{2}^{iw}`. 474 475 We require `nw` to be at least 64 and the two temporary space pointers to 476 point to blocks of size ``n*w + FLINT_BITS`` bits. 477 478.. function:: void fft_naive_convolution_1(mp_limb_t * r, mp_limb_t * ii, mp_limb_t * jj, mp_size_t m) 479 480 Performs a naive negacyclic convolution of ``ii`` with ``jj``, 481 both of length `m` and sets `r` to the result. This is essentially 482 multiplication of polynomials modulo `x^m + 1`. 483 484.. function:: void _fft_mulmod_2expp1(mp_limb_t * r1, mp_limb_t * i1, mp_limb_t * i2, mp_size_t r_limbs, flint_bitcnt_t depth, flint_bitcnt_t w) 485 486 Multiply ``i1`` by ``i2`` modulo ``B^r_limbs + 1`` where 487 ``r_limbs = nw/FLINT_BITS`` with ``n = 2^depth``. Uses the 488 negacyclic FFT convolution CRT'd with a 1 limb naive convolution. We 489 require that ``depth`` and ``w`` have been selected as per the 490 wrapper ``fft_mulmod_2expp1`` below. 491 492.. function:: slong fft_adjust_limbs(mp_size_t limbs) 493 494 Given a number of limbs, returns a new number of limbs (no more than 495 the next power of 2) which will work with the Nussbaumer code. It is only 496 necessary to make this adjustment if 497 ``limbs > FFT_MULMOD_2EXPP1_CUTOFF``. 498 499.. function:: void fft_mulmod_2expp1(mp_limb_t * r, mp_limb_t * i1, mp_limb_t * i2, mp_size_t n, mp_size_t w, mp_limb_t * tt) 500 501 As per ``_fft_mulmod_2expp1`` but with a tuned cutoff below which more 502 classical methods are used for the convolution. The temporary space is 503 required to fit ``n*w + FLINT_BITS`` bits. There are no restrictions 504 on `n`, but if ``limbs = n*w/FLINT_BITS`` then if ``limbs`` exceeds 505 ``FFT_MULMOD_2EXPP1_CUTOFF`` the function ``fft_adjust_limbs`` must 506 be called to increase the number of limbs to an appropriate value. 507 508 509Integer multiplication 510-------------------------------------------------------------------------------- 511 512 513.. function:: void mul_truncate_sqrt2(mp_ptr r1, mp_srcptr i1, mp_size_t n1, mp_srcptr i2, mp_size_t n2, flint_bitcnt_t depth, flint_bitcnt_t w) 514 515 Integer multiplication using the radix 2 truncated sqrt2 transforms. 516 517 Set ``(r1, n1 + n2)`` to the product of ``(i1, n1)`` by 518 ``(i2, n2)``. This is achieved through an FFT convolution of length at 519 most ``2^(depth + 2)`` with coefficients of size `nw` bits where 520 ``n = 2^depth``. We require ``depth >= 6``. The input data is 521 broken into chunks of data not exceeding ``(nw - (depth + 1))/2`` 522 bits. If breaking the first integer into chunks of this size results in 523 ``j1`` coefficients and breaking the second integer results in 524 ``j2`` chunks then ``j1 + j2 - 1 <= 2^(depth + 2)``. 525 526 If ``n = 2^depth`` then we require `nw` to be at least 64. 527 528.. function:: void mul_mfa_truncate_sqrt2(mp_ptr r1, mp_srcptr i1, mp_size_t n1, mp_srcptr i2, mp_size_t n2, flint_bitcnt_t depth, flint_bitcnt_t w) 529 530 As for ``mul_truncate_sqrt2`` except that the cache friendly matrix 531 fourier algorithm is used. 532 533 If ``n = 2^depth`` then we require `nw` to be at least 64. Here we 534 also require `w` to be `2^i` for some `i \geq 0`. 535 536.. function:: void flint_mpn_mul_fft_main(mp_ptr r1, mp_srcptr i1, mp_size_t n1, mp_srcptr i2, mp_size_t n2) 537 538 The main integer multiplication routine. Sets ``(r1, n1 + n2)`` to 539 ``(i1, n1)`` times ``(i2, n2)``. We require ``n1 >= n2 > 0``. 540 541 542Convolution 543-------------------------------------------------------------------------------- 544 545 546.. function:: void fft_convolution(mp_limb_t ** ii, mp_limb_t ** jj, slong depth, slong limbs, slong trunc, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** s1, mp_limb_t * tt) 547 548 Perform an FFT convolution of ``ii`` with ``jj``, both of length 549 ``4*n`` where ``n = 2^depth``. Assume that all but the first 550 ``trunc`` coefficients of the output (placed in ``ii``) are zero. 551 Each coefficient is taken modulo ``B^limbs + 1``. The temporary 552 spaces ``t1``, ``t2`` and ``s1`` must have ``limbs + 1`` 553 limbs of space and ``tt`` must have ``2*(limbs + 1)`` of free 554 space. 555 556FFT Precaching 557------------------------------------------------------------------------------- 558 559 560.. function:: void fft_precache(mp_limb_t ** jj, slong depth, slong limbs, slong trunc, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** s1) 561 562 Precompute the FFT of ``jj`` for use with precache functions. The 563 parameters are as for ``fft_convolution``. 564 565.. function:: void fft_convolution_precache(mp_limb_t ** ii, mp_limb_t ** jj, slong depth, slong limbs, slong trunc, mp_limb_t ** t1, mp_limb_t ** t2, mp_limb_t ** s1, mp_limb_t ** tt) 566 567 As per ``fft_convolution`` except that it is assumed ``fft_precache`` has 568 been called on ``jj`` with the same parameters. This will then run faster 569 than if ``fft_convolution`` had been run with the original ``jj``. 570