1@node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top 2@chapter Distributed-memory FFTW with MPI 3@cindex MPI 4 5@cindex parallel transform 6In this chapter we document the parallel FFTW routines for parallel 7systems supporting the MPI message-passing interface. Unlike the 8shared-memory threads described in the previous chapter, MPI allows 9you to use @emph{distributed-memory} parallelism, where each CPU has 10its own separate memory, and which can scale up to clusters of many 11thousands of processors. This capability comes at a price, however: 12each process only stores a @emph{portion} of the data to be 13transformed, which means that the data structures and 14programming-interface are quite different from the serial or threads 15versions of FFTW. 16@cindex data distribution 17 18 19Distributed-memory parallelism is especially useful when you are 20transforming arrays so large that they do not fit into the memory of a 21single processor. The storage per-process required by FFTW's MPI 22routines is proportional to the total array size divided by the number 23of processes. Conversely, distributed-memory parallelism can easily 24pose an unacceptably high communications overhead for small problems; 25the threshold problem size for which parallelism becomes advantageous 26will depend on the precise problem you are interested in, your 27hardware, and your MPI implementation. 28 29A note on terminology: in MPI, you divide the data among a set of 30``processes'' which each run in their own memory address space. 31Generally, each process runs on a different physical processor, but 32this is not required. A set of processes in MPI is described by an 33opaque data structure called a ``communicator,'' the most common of 34which is the predefined communicator @code{MPI_COMM_WORLD} which 35refers to @emph{all} processes. For more information on these and 36other concepts common to all MPI programs, we refer the reader to the 37documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home 38page}. 39@cindex MPI communicator 40@ctindex MPI_COMM_WORLD 41 42 43We assume in this chapter that the reader is familiar with the usage 44of the serial (uniprocessor) FFTW, and focus only on the concepts new 45to the MPI interface. 46 47@menu 48* FFTW MPI Installation:: 49* Linking and Initializing MPI FFTW:: 50* 2d MPI example:: 51* MPI Data Distribution:: 52* Multi-dimensional MPI DFTs of Real Data:: 53* Other Multi-dimensional Real-data MPI Transforms:: 54* FFTW MPI Transposes:: 55* FFTW MPI Wisdom:: 56* Avoiding MPI Deadlocks:: 57* FFTW MPI Performance Tips:: 58* Combining MPI and Threads:: 59* FFTW MPI Reference:: 60* FFTW MPI Fortran Interface:: 61@end menu 62 63@c ------------------------------------------------------------ 64@node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI 65@section FFTW MPI Installation 66 67All of the FFTW MPI code is located in the @code{mpi} subdirectory of 68the FFTW package. On Unix systems, the FFTW MPI libraries and header 69files are automatically configured, compiled, and installed along with 70the uniprocessor FFTW libraries simply by including 71@code{--enable-mpi} in the flags to the @code{configure} script 72(@pxref{Installation on Unix}). 73@fpindex configure 74 75 76Any implementation of the MPI standard, version 1 or later, should 77work with FFTW. The @code{configure} script will attempt to 78automatically detect how to compile and link code using your MPI 79implementation. In some cases, especially if you have multiple 80different MPI implementations installed or have an unusual MPI 81software package, you may need to provide this information explicitly. 82 83Most commonly, one compiles MPI code by invoking a special compiler 84command, typically @code{mpicc} for C code. The @code{configure} 85script knows the most common names for this command, but you can 86specify the MPI compilation command explicitly by setting the 87@code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}. 88@fpindex mpicc 89 90 91If, instead of a special compiler command, you need to link a certain 92library, you can specify the link command via the @code{MPILIBS} 93variable, as in @samp{./configure MPILIBS=-lmpi ...}. Note that if 94your MPI library is installed in a non-standard location (one the 95compiler does not know about by default), you may also have to specify 96the location of the library and header files via @code{LDFLAGS} and 97@code{CPPFLAGS} variables, respectively, as in @samp{./configure 98LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}. 99 100@c ------------------------------------------------------------ 101@node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI 102@section Linking and Initializing MPI FFTW 103 104Programs using the MPI FFTW routines should be linked with 105@code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision, 106@code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on 107(@pxref{Precision}). You will also need to link with whatever library 108is responsible for MPI on your system; in most MPI implementations, 109there is a special compiler alias named @code{mpicc} to compile and 110link MPI code. 111@fpindex mpicc 112@cindex linking on Unix 113@cindex precision 114 115 116@findex fftw_init_threads 117Before calling any FFTW routines except possibly 118@code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling 119@code{MPI_Init}, you should call the function: 120 121@example 122void fftw_mpi_init(void); 123@end example 124@findex fftw_mpi_init 125 126If, at the end of your program, you want to get rid of all memory and 127other resources allocated internally by FFTW, for both the serial and 128MPI routines, you can call: 129 130@example 131void fftw_mpi_cleanup(void); 132@end example 133@findex fftw_mpi_cleanup 134 135which is much like the @code{fftw_cleanup()} function except that it 136also gets rid of FFTW's MPI-related data. You must @emph{not} execute 137any previously created plans after calling this function. 138 139@c ------------------------------------------------------------ 140@node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI 141@section 2d MPI example 142 143Before we document the FFTW MPI interface in detail, we begin with a 144simple example outlining how one would perform a two-dimensional 145@code{N0} by @code{N1} complex DFT. 146 147@example 148#include <fftw3-mpi.h> 149 150int main(int argc, char **argv) 151@{ 152 const ptrdiff_t N0 = ..., N1 = ...; 153 fftw_plan plan; 154 fftw_complex *data; 155 ptrdiff_t alloc_local, local_n0, local_0_start, i, j; 156 157 MPI_Init(&argc, &argv); 158 fftw_mpi_init(); 159 160 /* @r{get local data size and allocate} */ 161 alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD, 162 &local_n0, &local_0_start); 163 data = fftw_alloc_complex(alloc_local); 164 165 /* @r{create plan for in-place forward DFT} */ 166 plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD, 167 FFTW_FORWARD, FFTW_ESTIMATE); 168 169 /* @r{initialize data to some function} my_function(x,y) */ 170 for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j) 171 data[i*N1 + j] = my_function(local_0_start + i, j); 172 173 /* @r{compute transforms, in-place, as many times as desired} */ 174 fftw_execute(plan); 175 176 fftw_destroy_plan(plan); 177 178 MPI_Finalize(); 179@} 180@end example 181 182As can be seen above, the MPI interface follows the same basic style 183of allocate/plan/execute/destroy as the serial FFTW routines. All of 184the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead 185of @samp{fftw_}. There are a few important differences, however: 186 187First, we must call @code{fftw_mpi_init()} after calling 188@code{MPI_Init} (required in all MPI programs) and before calling any 189other @samp{fftw_mpi_} routine. 190@findex MPI_Init 191@findex fftw_mpi_init 192 193 194Second, when we create the plan with @code{fftw_mpi_plan_dft_2d}, 195analogous to @code{fftw_plan_dft_2d}, we pass an additional argument: 196the communicator, indicating which processes will participate in the 197transform (here @code{MPI_COMM_WORLD}, indicating all processes). 198Whenever you create, execute, or destroy a plan for an MPI transform, 199you must call the corresponding FFTW routine on @emph{all} processes 200in the communicator for that transform. (That is, these are 201@emph{collective} calls.) Note that the plan for the MPI transform 202uses the standard @code{fftw_execute} and @code{fftw_destroy} routines 203(on the other hand, there are MPI-specific new-array execute functions 204documented below). 205@cindex collective function 206@findex fftw_mpi_plan_dft_2d 207@ctindex MPI_COMM_WORLD 208 209 210Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments 211instead of @code{int} as for the serial FFTW. @code{ptrdiff_t} is a 212standard C integer type which is (at least) 32 bits wide on a 32-bit 213machine and 64 bits wide on a 64-bit machine. This is to make it easy 214to specify very large parallel transforms on a 64-bit machine. (You 215can specify 64-bit transform sizes in the serial FFTW, too, but only 216by using the @samp{guru64} planner interface. @xref{64-bit Guru 217Interface}.) 218@tindex ptrdiff_t 219@cindex 64-bit architecture 220 221 222Fourth, and most importantly, you don't allocate the entire 223two-dimensional array on each process. Instead, you call 224@code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the 225array resides on each processor, and how much space to allocate. 226Here, the portion of the array on each process is a @code{local_n0} by 227@code{N1} slice of the total array, starting at index 228@code{local_0_start}. The total number of @code{fftw_complex} numbers 229to allocate is given by the @code{alloc_local} return value, which 230@emph{may} be greater than @code{local_n0 * N1} (in case some 231intermediate calculations require additional storage). The data 232distribution in FFTW's MPI interface is described in more detail by 233the next section. 234@findex fftw_mpi_local_size_2d 235@cindex data distribution 236 237 238Given the portion of the array that resides on the local process, it 239is straightforward to initialize the data (here to a function 240@code{myfunction}) and otherwise manipulate it. Of course, at the end 241of the program you may want to output the data somehow, but 242synchronizing this output is up to you and is beyond the scope of this 243manual. (One good way to output a large multi-dimensional distributed 244array in MPI to a portable binary file is to use the free HDF5 245library; see the @uref{http://www.hdfgroup.org/, HDF home page}.) 246@cindex HDF5 247@cindex MPI I/O 248 249@c ------------------------------------------------------------ 250@node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI 251@section MPI Data Distribution 252@cindex data distribution 253 254The most important concept to understand in using FFTW's MPI interface 255is the data distribution. With a serial or multithreaded FFT, all of 256the inputs and outputs are stored as a single contiguous chunk of 257memory. With a distributed-memory FFT, the inputs and outputs are 258broken into disjoint blocks, one per process. 259 260In particular, FFTW uses a @emph{1d block distribution} of the data, 261distributed along the @emph{first dimension}. For example, if you 262want to perform a @twodims{100,200} complex DFT, distributed over 4 263processes, each process will get a @twodims{25,200} slice of the data. 264That is, process 0 will get rows 0 through 24, process 1 will get rows 26525 through 49, process 2 will get rows 50 through 74, and process 3 266will get rows 75 through 99. If you take the same array but 267distribute it over 3 processes, then it is not evenly divisible so the 268different processes will have unequal chunks. FFTW's default choice 269in this case is to assign 34 rows to processes 0 and 1, and 32 rows to 270process 2. 271@cindex block distribution 272 273 274FFTW provides several @samp{fftw_mpi_local_size} routines that you can 275call to find out what portion of an array is stored on the current 276process. In most cases, you should use the default block sizes picked 277by FFTW, but it is also possible to specify your own block size. For 278example, with a @twodims{100,200} array on three processes, you can 279tell FFTW to use a block size of 40, which would assign 40 rows to 280processes 0 and 1, and 20 rows to process 2. FFTW's default is to 281divide the data equally among the processes if possible, and as best 282it can otherwise. The rows are always assigned in ``rank order,'' 283i.e. process 0 gets the first block of rows, then process 1, and so 284on. (You can change this by using @code{MPI_Comm_split} to create a 285new communicator with re-ordered processes.) However, you should 286always call the @samp{fftw_mpi_local_size} routines, if possible, 287rather than trying to predict FFTW's distribution choices. 288 289In particular, it is critical that you allocate the storage size that 290is returned by @samp{fftw_mpi_local_size}, which is @emph{not} 291necessarily the size of the local slice of the array. The reason is 292that intermediate steps of FFTW's algorithms involve transposing the 293array and redistributing the data, so at these intermediate steps FFTW 294may require more local storage space (albeit always proportional to 295the total size divided by the number of processes). The 296@samp{fftw_mpi_local_size} functions know how much storage is required 297for these intermediate steps and tell you the correct amount to 298allocate. 299 300@menu 301* Basic and advanced distribution interfaces:: 302* Load balancing:: 303* Transposed distributions:: 304* One-dimensional distributions:: 305@end menu 306 307@node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution 308@subsection Basic and advanced distribution interfaces 309 310As with the planner interface, the @samp{fftw_mpi_local_size} 311distribution interface is broken into basic and advanced 312(@samp{_many}) interfaces, where the latter allows you to specify the 313block size manually and also to request block sizes when computing 314multiple transforms simultaneously. These functions are documented 315more exhaustively by the FFTW MPI Reference, but we summarize the 316basic ideas here using a couple of two-dimensional examples. 317 318For the @twodims{100,200} complex-DFT example, above, we would find 319the distribution by calling the following function in the basic 320interface: 321 322@example 323ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, 324 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); 325@end example 326@findex fftw_mpi_local_size_2d 327 328Given the total size of the data to be transformed (here, @code{n0 = 329100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this 330function provides three numbers. 331 332First, it describes the shape of the local data: the current process 333should store a @code{local_n0} by @code{n1} slice of the overall 334dataset, in row-major order (@code{n1} dimension contiguous), starting 335at index @code{local_0_start}. That is, if the total dataset is 336viewed as a @code{n0} by @code{n1} matrix, the current process should 337store the rows @code{local_0_start} to 338@code{local_0_start+local_n0-1}. Obviously, if you are running with 339only a single MPI process, that process will store the entire array: 340@code{local_0_start} will be zero and @code{local_n0} will be 341@code{n0}. @xref{Row-major Format}. 342@cindex row-major 343 344 345Second, the return value is the total number of data elements (e.g., 346complex numbers for a complex DFT) that should be allocated for the 347input and output arrays on the current process (ideally with 348@code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal 349alignment). It might seem that this should always be equal to 350@code{local_n0 * n1}, but this is @emph{not} the case. FFTW's 351distributed FFT algorithms require data redistributions at 352intermediate stages of the transform, and in some circumstances this 353may require slightly larger local storage. This is discussed in more 354detail below, under @ref{Load balancing}. 355@findex fftw_malloc 356@findex fftw_alloc_complex 357 358 359@cindex advanced interface 360The advanced-interface @samp{local_size} function for multidimensional 361transforms returns the same three things (@code{local_n0}, 362@code{local_0_start}, and the total number of elements to allocate), 363but takes more inputs: 364 365@example 366ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, 367 ptrdiff_t howmany, 368 ptrdiff_t block0, 369 MPI_Comm comm, 370 ptrdiff_t *local_n0, 371 ptrdiff_t *local_0_start); 372@end example 373@findex fftw_mpi_local_size_many 374 375The two-dimensional case above corresponds to @code{rnk = 2} and an 376array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}. 377This routine is for any @code{rnk > 1}; one-dimensional transforms 378have their own interface because they work slightly differently, as 379discussed below. 380 381First, the advanced interface allows you to perform multiple 382transforms at once, of interleaved data, as specified by the 383@code{howmany} parameter. (@code{hoamany} is 1 for a single 384transform.) 385 386Second, here you can specify your desired block size in the @code{n0} 387dimension, @code{block0}. To use FFTW's default block size, pass 388@code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}. Otherwise, on 389@code{P} processes, FFTW will return @code{local_n0} equal to 390@code{block0} on the first @code{P / block0} processes (rounded down), 391return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on 392the next process, and @code{local_n0} equal to zero on any remaining 393processes. In general, we recommend using the default block size 394(which corresponds to @code{n0 / P}, rounded up). 395@ctindex FFTW_MPI_DEFAULT_BLOCK 396@cindex block distribution 397 398 399For example, suppose you have @code{P = 4} processes and @code{n0 = 40021}. The default will be a block size of @code{6}, which will give 401@code{local_n0 = 6} on the first three processes and @code{local_n0 = 4023} on the last process. Instead, however, you could specify 403@code{block0 = 5} if you wanted, which would give @code{local_n0 = 5} 404on processes 0 to 2, @code{local_n0 = 6} on process 3. (This choice, 405while it may look superficially more ``balanced,'' has the same 406critical path as FFTW's default but requires more communications.) 407 408@node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution 409@subsection Load balancing 410@cindex load balancing 411 412Ideally, when you parallelize a transform over some @math{P} 413processes, each process should end up with work that takes equal time. 414Otherwise, all of the processes end up waiting on whichever process is 415slowest. This goal is known as ``load balancing.'' In this section, 416we describe the circumstances under which FFTW is able to load-balance 417well, and in particular how you should choose your transform size in 418order to load balance. 419 420Load balancing is especially difficult when you are parallelizing over 421heterogeneous machines; for example, if one of your processors is a 422old 486 and another is a Pentium IV, obviously you should give the 423Pentium more work to do than the 486 since the latter is much slower. 424FFTW does not deal with this problem, however---it assumes that your 425processes run on hardware of comparable speed, and that the goal is 426therefore to divide the problem as equally as possible. 427 428For a multi-dimensional complex DFT, FFTW can divide the problem 429equally among the processes if: (i) the @emph{first} dimension 430@code{n0} is divisible by @math{P}; and (ii), the @emph{product} of 431the subsequent dimensions is divisible by @math{P}. (For the advanced 432interface, where you can specify multiple simultaneous transforms via 433some ``vector'' length @code{howmany}, a factor of @code{howmany} is 434included in the product of the subsequent dimensions.) 435 436For a one-dimensional complex DFT, the length @code{N} of the data 437should be divisible by @math{P} @emph{squared} to be able to divide 438the problem equally among the processes. 439 440@node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution 441@subsection Transposed distributions 442 443Internally, FFTW's MPI transform algorithms work by first computing 444transforms of the data local to each process, then by globally 445@emph{transposing} the data in some fashion to redistribute the data 446among the processes, transforming the new data local to each process, 447and transposing back. For example, a two-dimensional @code{n0} by 448@code{n1} array, distributed across the @code{n0} dimension, is 449transformd by: (i) transforming the @code{n1} dimension, which are 450local to each process; (ii) transposing to an @code{n1} by @code{n0} 451array, distributed across the @code{n1} dimension; (iii) transforming 452the @code{n0} dimension, which is now local to each process; (iv) 453transposing back. 454@cindex transpose 455 456 457However, in many applications it is acceptable to compute a 458multidimensional DFT whose results are produced in transposed order 459(e.g., @code{n1} by @code{n0} in two dimensions). This provides a 460significant performance advantage, because it means that the final 461transposition step can be omitted. FFTW supports this optimization, 462which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT} 463to the planner routines. To compute the inverse transform of 464transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell 465it that the input is transposed. In this section, we explain how to 466interpret the output format of such a transform. 467@ctindex FFTW_MPI_TRANSPOSED_OUT 468@ctindex FFTW_MPI_TRANSPOSED_IN 469 470 471Suppose you have are transforming multi-dimensional data with (at 472least two) dimensions @ndims{}. As always, it is distributed along 473the first dimension @dimk{0}. Now, if we compute its DFT with the 474@code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored 475with the first @emph{two} dimensions transposed: @ndimstrans{}, 476distributed along the @dimk{1} dimension. Conversely, if we take the 477@ndimstrans{} data and transform it with the 478@code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the 479original @ndims{} array. 480 481There are two ways to find the portion of the transposed array that 482resides on the current process. First, you can simply call the 483appropriate @samp{local_size} function, passing @ndimstrans{} (the 484transposed dimensions). This would mean calling the @samp{local_size} 485function twice, once for the transposed and once for the 486non-transposed dimensions. Alternatively, you can call one of the 487@samp{local_size_transposed} functions, which returns both the 488non-transposed and transposed data distribution from a single call. 489For example, for a 3d transform with transposed output (or input), you 490might call: 491 492@example 493ptrdiff_t fftw_mpi_local_size_3d_transposed( 494 ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, 495 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, 496 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); 497@end example 498@findex fftw_mpi_local_size_3d_transposed 499 500Here, @code{local_n0} and @code{local_0_start} give the size and 501starting index of the @code{n0} dimension for the 502@emph{non}-transposed data, as in the previous sections. For 503@emph{transposed} data (e.g. the output for 504@code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and 505@code{local_1_start} give the size and starting index of the @code{n1} 506dimension, which is the first dimension of the transposed data 507(@code{n1} by @code{n0} by @code{n2}). 508 509(Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to 510performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two 511dimensions to the planner in reverse order, or vice versa. If you 512pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and 513@code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the 514first two dimensions passed to the planner and passing @emph{neither} 515flag.) 516 517@node One-dimensional distributions, , Transposed distributions, MPI Data Distribution 518@subsection One-dimensional distributions 519 520For one-dimensional distributed DFTs using FFTW, matters are slightly 521more complicated because the data distribution is more closely tied to 522how the algorithm works. In particular, you can no longer pass an 523arbitrary block size and must accept FFTW's default; also, the block 524sizes may be different for input and output. Also, the data 525distribution depends on the flags and transform direction, in order 526for forward and backward transforms to work correctly. 527 528@example 529ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm, 530 int sign, unsigned flags, 531 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, 532 ptrdiff_t *local_no, ptrdiff_t *local_o_start); 533@end example 534@findex fftw_mpi_local_size_1d 535 536This function computes the data distribution for a 1d transform of 537size @code{n0} with the given transform @code{sign} and @code{flags}. 538Both input and output data use block distributions. The input on the 539current process will consist of @code{local_ni} numbers starting at 540index @code{local_i_start}; e.g. if only a single process is used, 541then @code{local_ni} will be @code{n0} and @code{local_i_start} will 542be @code{0}. Similarly for the output, with @code{local_no} numbers 543starting at index @code{local_o_start}. The return value of 544@code{fftw_mpi_local_size_1d} will be the total number of elements to 545allocate on the current process (which might be slightly larger than 546the local size due to intermediate steps in the algorithm). 547 548As mentioned above (@pxref{Load balancing}), the data will be divided 549equally among the processes if @code{n0} is divisible by the 550@emph{square} of the number of processes. In this case, 551@code{local_ni} will equal @code{local_no}. Otherwise, they may be 552different. 553 554For some applications, such as convolutions, the order of the output 555data is irrelevant. In this case, performance can be improved by 556specifying that the output data be stored in an FFTW-defined 557``scrambled'' format. (In particular, this is the analogue of 558transposed output in the multidimensional case: scrambled output saves 559a communications step.) If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in 560the flags, then the output is stored in this (undocumented) scrambled 561order. Conversely, to perform the inverse transform of data in 562scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag. 563@ctindex FFTW_MPI_SCRAMBLED_OUT 564@ctindex FFTW_MPI_SCRAMBLED_IN 565 566 567In MPI FFTW, only composite sizes @code{n0} can be parallelized; we 568have not yet implemented a parallel algorithm for large prime sizes. 569 570@c ------------------------------------------------------------ 571@node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI 572@section Multi-dimensional MPI DFTs of Real Data 573 574FFTW's MPI interface also supports multi-dimensional DFTs of real 575data, similar to the serial r2c and c2r interfaces. (Parallel 576one-dimensional real-data DFTs are not currently supported; you must 577use a complex transform and set the imaginary parts of the inputs to 578zero.) 579 580The key points to understand for r2c and c2r MPI transforms (compared 581to the MPI complex DFTs or the serial r2c/c2r transforms), are: 582 583@itemize @bullet 584 585@item 586Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real 587data to/from @ndimshalf{} complex data: the last dimension of the 588complex data is cut in half (rounded down), plus one. As for the 589serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and 590@samp{plan_dft_c2r} are the @ndims{} dimensions of the real data. 591 592@item 593@cindex padding 594Although the real data is @emph{conceptually} @ndims{}, it is 595@emph{physically} stored as an @ndimspad{} array, where the last 596dimension has been @emph{padded} to make it the same size as the 597complex output. This is much like the in-place serial r2c/c2r 598interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that 599in MPI the padding is required even for out-of-place data. The extra 600padding numbers are ignored by FFTW (they are @emph{not} like 601zero-padding the transform to a larger size); they are only used to 602determine the data layout. 603 604@item 605@cindex data distribution 606The data distribution in MPI for @emph{both} the real and complex data 607is determined by the shape of the @emph{complex} data. That is, you 608call the appropriate @samp{local size} function for the @ndimshalf{} 609complex data, and then use the @emph{same} distribution for the real 610data except that the last complex dimension is replaced by a (padded) 611real dimension of twice the length. 612 613@end itemize 614 615For example suppose we are performing an out-of-place r2c transform of 616@threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}], 617resulting in @threedims{L,M,N/2+1} complex data. Similar to the 618example in @ref{2d MPI example}, we might do something like: 619 620@example 621#include <fftw3-mpi.h> 622 623int main(int argc, char **argv) 624@{ 625 const ptrdiff_t L = ..., M = ..., N = ...; 626 fftw_plan plan; 627 double *rin; 628 fftw_complex *cout; 629 ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k; 630 631 MPI_Init(&argc, &argv); 632 fftw_mpi_init(); 633 634 /* @r{get local data size and allocate} */ 635 alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD, 636 &local_n0, &local_0_start); 637 rin = fftw_alloc_real(2 * alloc_local); 638 cout = fftw_alloc_complex(alloc_local); 639 640 /* @r{create plan for out-of-place r2c DFT} */ 641 plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD, 642 FFTW_MEASURE); 643 644 /* @r{initialize rin to some function} my_func(x,y,z) */ 645 for (i = 0; i < local_n0; ++i) 646 for (j = 0; j < M; ++j) 647 for (k = 0; k < N; ++k) 648 rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k); 649 650 /* @r{compute transforms as many times as desired} */ 651 fftw_execute(plan); 652 653 fftw_destroy_plan(plan); 654 655 MPI_Finalize(); 656@} 657@end example 658 659@findex fftw_alloc_real 660@cindex row-major 661Note that we allocated @code{rin} using @code{fftw_alloc_real} with an 662argument of @code{2 * alloc_local}: since @code{alloc_local} is the 663number of @emph{complex} values to allocate, the number of @emph{real} 664values is twice as many. The @code{rin} array is then 665@threedims{local_n0,M,2(N/2+1)} in row-major order, so its 666@code{(i,j,k)} element is at the index @code{(i*M + j) * (2*(N/2+1)) + 667k} (@pxref{Multi-dimensional Array Format }). 668 669@cindex transpose 670@ctindex FFTW_TRANSPOSED_OUT 671@ctindex FFTW_TRANSPOSED_IN 672As for the complex transforms, improved performance can be obtained by 673specifying that the output is the transpose of the input or vice versa 674(@pxref{Transposed distributions}). In our @threedims{L,M,N} r2c 675example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that 676the input would be a padded @threedims{L,M,2(N/2+1)} real array 677distributed over the @code{L} dimension, while the output would be a 678@threedims{M,L,N/2+1} complex array distributed over the @code{M} 679dimension. To perform the inverse c2r transform with the same data 680distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag. 681 682@c ------------------------------------------------------------ 683@node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI 684@section Other multi-dimensional Real-Data MPI Transforms 685 686@cindex r2r 687FFTW's MPI interface also supports multi-dimensional @samp{r2r} 688transforms of all kinds supported by the serial interface 689(e.g. discrete cosine and sine transforms, discrete Hartley 690transforms, etc.). Only multi-dimensional @samp{r2r} transforms, not 691one-dimensional transforms, are currently parallelized. 692 693@tindex fftw_r2r_kind 694These are used much like the multidimensional complex DFTs discussed 695above, except that the data is real rather than complex, and one needs 696to pass an r2r transform kind (@code{fftw_r2r_kind}) for each 697dimension as in the serial FFTW (@pxref{More DFTs of Real Data}). 698 699For example, one might perform a two-dimensional @twodims{L,M} that is 700an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in 701the second dimension with code like: 702 703@example 704 const ptrdiff_t L = ..., M = ...; 705 fftw_plan plan; 706 double *data; 707 ptrdiff_t alloc_local, local_n0, local_0_start, i, j; 708 709 /* @r{get local data size and allocate} */ 710 alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD, 711 &local_n0, &local_0_start); 712 data = fftw_alloc_real(alloc_local); 713 714 /* @r{create plan for in-place REDFT10 x RODFT10} */ 715 plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD, 716 FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE); 717 718 /* @r{initialize data to some function} my_function(x,y) */ 719 for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j) 720 data[i*M + j] = my_function(local_0_start + i, j); 721 722 /* @r{compute transforms, in-place, as many times as desired} */ 723 fftw_execute(plan); 724 725 fftw_destroy_plan(plan); 726@end example 727 728@findex fftw_alloc_real 729Notice that we use the same @samp{local_size} functions as we did for 730complex data, only now we interpret the sizes in terms of real rather 731than complex values, and correspondingly use @code{fftw_alloc_real}. 732 733@c ------------------------------------------------------------ 734@node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI 735@section FFTW MPI Transposes 736@cindex transpose 737 738The FFTW's MPI Fourier transforms rely on one or more @emph{global 739transposition} step for their communications. For example, the 740multidimensional transforms work by transforming along some 741dimensions, then transposing to make the first dimension local and 742transforming that, then transposing back. Because global 743transposition of a block-distributed matrix has many other potential 744uses besides FFTs, FFTW's transpose routines can be called directly, 745as documented in this section. 746 747@menu 748* Basic distributed-transpose interface:: 749* Advanced distributed-transpose interface:: 750* An improved replacement for MPI_Alltoall:: 751@end menu 752 753@node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes 754@subsection Basic distributed-transpose interface 755 756In particular, suppose that we have an @code{n0} by @code{n1} array in 757row-major order, block-distributed across the @code{n0} dimension. To 758transpose this into an @code{n1} by @code{n0} array block-distributed 759across the @code{n1} dimension, we would create a plan by calling the 760following function: 761 762@example 763fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, 764 double *in, double *out, 765 MPI_Comm comm, unsigned flags); 766@end example 767@findex fftw_mpi_plan_transpose 768 769The input and output arrays (@code{in} and @code{out}) can be the 770same. The transpose is actually executed by calling 771@code{fftw_execute} on the plan, as usual. 772@findex fftw_execute 773 774 775The @code{flags} are the usual FFTW planner flags, but support 776two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or 777@code{FFTW_MPI_TRANSPOSED_IN}. What these flags indicate, for 778transpose plans, is that the output and/or input, respectively, are 779@emph{locally} transposed. That is, on each process input data is 780normally stored as a @code{local_n0} by @code{n1} array in row-major 781order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is 782stored as @code{n1} by @code{local_n0} in row-major order. Similarly, 783@code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by 784@code{local_n1} instead of @code{local_n1} by @code{n0}. 785@ctindex FFTW_MPI_TRANSPOSED_OUT 786@ctindex FFTW_MPI_TRANSPOSED_IN 787 788 789To determine the local size of the array on each process before and 790after the transpose, as well as the amount of storage that must be 791allocated, one should call @code{fftw_mpi_local_size_2d_transposed}, 792just as for a 2d DFT as described in the previous section: 793@cindex data distribution 794 795@example 796ptrdiff_t fftw_mpi_local_size_2d_transposed 797 (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, 798 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, 799 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); 800@end example 801@findex fftw_mpi_local_size_2d_transposed 802 803Again, the return value is the local storage to allocate, which in 804this case is the number of @emph{real} (@code{double}) values rather 805than complex numbers as in the previous examples. 806 807@node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes 808@subsection Advanced distributed-transpose interface 809 810The above routines are for a transpose of a matrix of numbers (of type 811@code{double}), using FFTW's default block sizes. More generally, one 812can perform transposes of @emph{tuples} of numbers, with 813user-specified block sizes for the input and output: 814 815@example 816fftw_plan fftw_mpi_plan_many_transpose 817 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, 818 ptrdiff_t block0, ptrdiff_t block1, 819 double *in, double *out, MPI_Comm comm, unsigned flags); 820@end example 821@findex fftw_mpi_plan_many_transpose 822 823In this case, one is transposing an @code{n0} by @code{n1} matrix of 824@code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers). 825The input is distributed along the @code{n0} dimension with block size 826@code{block0}, and the @code{n1} by @code{n0} output is distributed 827along the @code{n1} dimension with block size @code{block1}. If 828@code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW 829uses its default block size. To get the local size of the data on 830each process, you should then call @code{fftw_mpi_local_size_many_transposed}. 831@ctindex FFTW_MPI_DEFAULT_BLOCK 832@findex fftw_mpi_local_size_many_transposed 833 834@node An improved replacement for MPI_Alltoall, , Advanced distributed-transpose interface, FFTW MPI Transposes 835@subsection An improved replacement for MPI_Alltoall 836 837We close this section by noting that FFTW's MPI transpose routines can 838be thought of as a generalization for the @code{MPI_Alltoall} function 839(albeit only for floating-point types), and in some circumstances can 840function as an improved replacement. 841@findex MPI_Alltoall 842 843 844@code{MPI_Alltoall} is defined by the MPI standard as: 845 846@example 847int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, 848 void *recvbuf, int recvcnt, MPI_Datatype recvtype, 849 MPI_Comm comm); 850@end example 851 852In particular, for @code{double*} arrays @code{in} and @code{out}, 853consider the call: 854 855@example 856MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm); 857@end example 858 859This is completely equivalent to: 860 861@example 862MPI_Comm_size(comm, &P); 863plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE); 864fftw_execute(plan); 865fftw_destroy_plan(plan); 866@end example 867 868That is, computing a @twodims{P,P} transpose on @code{P} processes, 869with a block size of 1, is just a standard all-to-all communication. 870 871However, using the FFTW routine instead of @code{MPI_Alltoall} may 872have certain advantages. First of all, FFTW's routine can operate 873in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only 874operate out-of-place. 875@cindex in-place 876 877 878Second, even for out-of-place plans, FFTW's routine may be faster, 879especially if you need to perform the all-to-all communication many 880times and can afford to use @code{FFTW_MEASURE} or 881@code{FFTW_PATIENT}. It should certainly be no slower, not including 882the time to create the plan, since one of the possible algorithms that 883FFTW uses for an out-of-place transpose @emph{is} simply to call 884@code{MPI_Alltoall}. However, FFTW also considers several other 885possible algorithms that, depending on your MPI implementation and 886your hardware, may be faster. 887@ctindex FFTW_MEASURE 888@ctindex FFTW_PATIENT 889 890@c ------------------------------------------------------------ 891@node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI 892@section FFTW MPI Wisdom 893@cindex wisdom 894@cindex saving plans to disk 895 896FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can 897be used to save MPI plans as well as to save uniprocessor plans. 898However, for MPI there are several unavoidable complications. 899 900@cindex MPI I/O 901First, the MPI standard does not guarantee that every process can 902perform file I/O (at least, not using C stdio routines)---in general, 903we may only assume that process 0 is capable of I/O.@footnote{In fact, 904even this assumption is not technically guaranteed by the standard, 905although it seems to be universal in actual MPI implementations and is 906widely assumed by MPI-using software. Technically, you need to query 907the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with 908@code{MPI_Attr_get}. If this attribute is @code{MPI_PROC_NULL}, no 909I/O is possible. If it is @code{MPI_ANY_SOURCE}, any process can 910perform I/O. Otherwise, it is the rank of a process that can perform 911I/O ... but since it is not guaranteed to yield the @emph{same} rank 912on all processes, you have to do an @code{MPI_Allreduce} of some kind 913if you want all processes to agree about which is going to do I/O. 914And even then, the standard only guarantees that this process can 915perform output, but not input. See e.g. @cite{Parallel Programming 916with MPI} by P. S. Pacheco, section 8.1.3. Needless to say, in our 917experience virtually no MPI programmers worry about this.} So, if we 918want to export the wisdom from a single process to a file, we must 919first export the wisdom to a string, then send it to process 0, then 920write it to a file. 921 922Second, in principle we may want to have separate wisdom for every 923process, since in general the processes may run on different hardware 924even for a single MPI program. However, in practice FFTW's MPI code 925is designed for the case of homogeneous hardware (@pxref{Load 926balancing}), and in this case it is convenient to use the same wisdom 927for every process. Thus, we need a mechanism to synchronize the wisdom. 928 929To address both of these problems, FFTW provides the following two 930functions: 931 932@example 933void fftw_mpi_broadcast_wisdom(MPI_Comm comm); 934void fftw_mpi_gather_wisdom(MPI_Comm comm); 935@end example 936@findex fftw_mpi_gather_wisdom 937@findex fftw_mpi_broadcast_wisdom 938 939Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom} 940will broadcast the wisdom from process 0 to all other processes. 941Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all 942processes onto process 0. (If the plans created for the same problem 943by different processes are not the same, @code{fftw_mpi_gather_wisdom} 944will arbitrarily choose one of the plans.) Both of these functions 945may result in suboptimal plans for different processes if the 946processes are running on non-identical hardware. Both of these 947functions are @emph{collective} calls, which means that they must be 948executed by all processes in the communicator. 949@cindex collective function 950 951 952So, for example, a typical code snippet to import wisdom from a file 953and use it on all processes would be: 954 955@example 956@{ 957 int rank; 958 959 fftw_mpi_init(); 960 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 961 if (rank == 0) fftw_import_wisdom_from_filename("mywisdom"); 962 fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD); 963@} 964@end example 965 966(Note that we must call @code{fftw_mpi_init} before importing any 967wisdom that might contain MPI plans.) Similarly, a typical code 968snippet to export wisdom from all processes to a file is: 969@findex fftw_mpi_init 970 971@example 972@{ 973 int rank; 974 975 fftw_mpi_gather_wisdom(MPI_COMM_WORLD); 976 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 977 if (rank == 0) fftw_export_wisdom_to_filename("mywisdom"); 978@} 979@end example 980 981@c ------------------------------------------------------------ 982@node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI 983@section Avoiding MPI Deadlocks 984@cindex deadlock 985 986An MPI program can @emph{deadlock} if one process is waiting for a 987message from another process that never gets sent. To avoid deadlocks 988when using FFTW's MPI routines, it is important to know which 989functions are @emph{collective}: that is, which functions must 990@emph{always} be called in the @emph{same order} from @emph{every} 991process in a given communicator. (For example, @code{MPI_Barrier} is 992the canonical example of a collective function in the MPI standard.) 993@cindex collective function 994@findex MPI_Barrier 995 996 997The functions in FFTW that are @emph{always} collective are: every 998function beginning with @samp{fftw_mpi_plan}, as well as 999@code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}. 1000Also, the following functions from the ordinary FFTW interface are 1001collective when they are applied to a plan created by an 1002@samp{fftw_mpi_plan} function: @code{fftw_execute}, 1003@code{fftw_destroy_plan}, and @code{fftw_flops}. 1004@findex fftw_execute 1005@findex fftw_destroy_plan 1006@findex fftw_flops 1007 1008@c ------------------------------------------------------------ 1009@node FFTW MPI Performance Tips, Combining MPI and Threads, Avoiding MPI Deadlocks, Distributed-memory FFTW with MPI 1010@section FFTW MPI Performance Tips 1011 1012In this section, we collect a few tips on getting the best performance 1013out of FFTW's MPI transforms. 1014 1015First, because of the 1d block distribution, FFTW's parallelization is 1016currently limited by the size of the first dimension. 1017(Multidimensional block distributions may be supported by a future 1018version.) More generally, you should ideally arrange the dimensions so 1019that FFTW can divide them equally among the processes. @xref{Load 1020balancing}. 1021@cindex block distribution 1022@cindex load balancing 1023 1024 1025Second, if it is not too inconvenient, you should consider working 1026with transposed output for multidimensional plans, as this saves a 1027considerable amount of communications. @xref{Transposed distributions}. 1028@cindex transpose 1029 1030 1031Third, the fastest choices are generally either an in-place transform 1032or an out-of-place transform with the @code{FFTW_DESTROY_INPUT} flag 1033(which allows the input array to be used as scratch space). In-place 1034is especially beneficial if the amount of data per process is large. 1035@ctindex FFTW_DESTROY_INPUT 1036 1037 1038Fourth, if you have multiple arrays to transform at once, rather than 1039calling FFTW's MPI transforms several times it usually seems to be 1040faster to interleave the data and use the advanced interface. (This 1041groups the communications together instead of requiring separate 1042messages for each transform.) 1043 1044@c ------------------------------------------------------------ 1045@node Combining MPI and Threads, FFTW MPI Reference, FFTW MPI Performance Tips, Distributed-memory FFTW with MPI 1046@section Combining MPI and Threads 1047@cindex threads 1048 1049In certain cases, it may be advantageous to combine MPI 1050(distributed-memory) and threads (shared-memory) parallelization. 1051FFTW supports this, with certain caveats. For example, if you have a 1052cluster of 4-processor shared-memory nodes, you may want to use 1053threads within the nodes and MPI between the nodes, instead of MPI for 1054all parallelization. 1055 1056In particular, it is possible to seamlessly combine the MPI FFTW 1057routines with the multi-threaded FFTW routines (@pxref{Multi-threaded 1058FFTW}). However, some care must be taken in the initialization code, 1059which should look something like this: 1060 1061@example 1062int threads_ok; 1063 1064int main(int argc, char **argv) 1065@{ 1066 int provided; 1067 MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); 1068 threads_ok = provided >= MPI_THREAD_FUNNELED; 1069 1070 if (threads_ok) threads_ok = fftw_init_threads(); 1071 fftw_mpi_init(); 1072 1073 ... 1074 if (threads_ok) fftw_plan_with_nthreads(...); 1075 ... 1076 1077 MPI_Finalize(); 1078@} 1079@end example 1080@findex fftw_mpi_init 1081@findex fftw_init_threads 1082@findex fftw_plan_with_nthreads 1083 1084First, note that instead of calling @code{MPI_Init}, you should call 1085@code{MPI_Init_threads}, which is the initialization routine defined 1086by the MPI-2 standard to indicate to MPI that your program will be 1087multithreaded. We pass @code{MPI_THREAD_FUNNELED}, which indicates 1088that we will only call MPI routines from the main thread. (FFTW will 1089launch additional threads internally, but the extra threads will not 1090call MPI code.) (You may also pass @code{MPI_THREAD_SERIALIZED} or 1091@code{MPI_THREAD_MULTIPLE}, which requests additional multithreading 1092support from the MPI implementation, but this is not required by 1093FFTW.) The @code{provided} parameter returns what level of threads 1094support is actually supported by your MPI implementation; this 1095@emph{must} be at least @code{MPI_THREAD_FUNNELED} if you want to call 1096the FFTW threads routines, so we define a global variable 1097@code{threads_ok} to record this. You should only call 1098@code{fftw_init_threads} or @code{fftw_plan_with_nthreads} if 1099@code{threads_ok} is true. For more information on thread safety in 1100MPI, see the 1101@uref{http://www.mpi-forum.org/docs/mpi-20-html/node162.htm, MPI and 1102Threads} section of the MPI-2 standard. 1103@cindex thread safety 1104 1105 1106Second, we must call @code{fftw_init_threads} @emph{before} 1107@code{fftw_mpi_init}. This is critical for technical reasons having 1108to do with how FFTW initializes its list of algorithms. 1109 1110Then, if you call @code{fftw_plan_with_nthreads(N)}, @emph{every} MPI 1111process will launch (up to) @code{N} threads to parallelize its transforms. 1112 1113For example, in the hypothetical cluster of 4-processor nodes, you 1114might wish to launch only a single MPI process per node, and then call 1115@code{fftw_plan_with_nthreads(4)} on each process to use all 1116processors in the nodes. 1117 1118This may or may not be faster than simply using as many MPI processes 1119as you have processors, however. On the one hand, using threads 1120within a node eliminates the need for explicit message passing within 1121the node. On the other hand, FFTW's transpose routines are not 1122multi-threaded, and this means that the communications that do take 1123place will not benefit from parallelization within the node. 1124Moreover, many MPI implementations already have optimizations to 1125exploit shared memory when it is available, so adding the 1126multithreaded FFTW on top of this may be superfluous. 1127@cindex transpose 1128 1129@c ------------------------------------------------------------ 1130@node FFTW MPI Reference, FFTW MPI Fortran Interface, Combining MPI and Threads, Distributed-memory FFTW with MPI 1131@section FFTW MPI Reference 1132 1133This chapter provides a complete reference to all FFTW MPI functions, 1134datatypes, and constants. See also @ref{FFTW Reference} for information 1135on functions and types in common with the serial interface. 1136 1137@menu 1138* MPI Files and Data Types:: 1139* MPI Initialization:: 1140* Using MPI Plans:: 1141* MPI Data Distribution Functions:: 1142* MPI Plan Creation:: 1143* MPI Wisdom Communication:: 1144@end menu 1145 1146@node MPI Files and Data Types, MPI Initialization, FFTW MPI Reference, FFTW MPI Reference 1147@subsection MPI Files and Data Types 1148 1149All programs using FFTW's MPI support should include its header file: 1150 1151@example 1152#include <fftw3-mpi.h> 1153@end example 1154 1155Note that this header file includes the serial-FFTW @code{fftw3.h} 1156header file, and also the @code{mpi.h} header file for MPI, so you 1157need not include those files separately. 1158 1159You must also link to @emph{both} the FFTW MPI library and to the 1160serial FFTW library. On Unix, this means adding @code{-lfftw3_mpi 1161-lfftw3 -lm} at the end of the link command. 1162 1163@cindex precision 1164Different precisions are handled as in the serial interface: 1165@xref{Precision}. That is, @samp{fftw_} functions become 1166@code{fftwf_} (in single precision) etcetera, and the libraries become 1167@code{-lfftw3f_mpi -lfftw3f -lm} etcetera on Unix. Long-double 1168precision is supported in MPI, but quad precision (@samp{fftwq_}) is 1169not due to the lack of MPI support for this type. 1170 1171@node MPI Initialization, Using MPI Plans, MPI Files and Data Types, FFTW MPI Reference 1172@subsection MPI Initialization 1173 1174Before calling any other FFTW MPI (@samp{fftw_mpi_}) function, and 1175before importing any wisdom for MPI problems, you must call: 1176 1177@findex fftw_mpi_init 1178@example 1179void fftw_mpi_init(void); 1180@end example 1181 1182@findex fftw_init_threads 1183If FFTW threads support is used, however, @code{fftw_mpi_init} should 1184be called @emph{after} @code{fftw_init_threads} (@pxref{Combining MPI 1185and Threads}). Calling @code{fftw_mpi_init} additional times (before 1186@code{fftw_mpi_cleanup}) has no effect. 1187 1188 1189If you want to deallocate all persistent data and reset FFTW to the 1190pristine state it was in when you started your program, you can call: 1191 1192@findex fftw_mpi_cleanup 1193@example 1194void fftw_mpi_cleanup(void); 1195@end example 1196 1197@findex fftw_cleanup 1198(This calls @code{fftw_cleanup}, so you need not call the serial 1199cleanup routine too, although it is safe to do so.) After calling 1200@code{fftw_mpi_cleanup}, all existing plans become undefined, and you 1201should not attempt to execute or destroy them. You must call 1202@code{fftw_mpi_init} again after @code{fftw_mpi_cleanup} if you want 1203to resume using the MPI FFTW routines. 1204 1205@node Using MPI Plans, MPI Data Distribution Functions, MPI Initialization, FFTW MPI Reference 1206@subsection Using MPI Plans 1207 1208Once an MPI plan is created, you can execute and destroy it using 1209@code{fftw_execute}, @code{fftw_destroy_plan}, and the other functions 1210in the serial interface that operate on generic plans (@pxref{Using 1211Plans}). 1212 1213@cindex collective function 1214@cindex MPI communicator 1215The @code{fftw_execute} and @code{fftw_destroy_plan} functions, applied to 1216MPI plans, are @emph{collective} calls: they must be called for all processes 1217in the communicator that was used to create the plan. 1218 1219@cindex new-array execution 1220You must @emph{not} use the serial new-array plan-execution functions 1221@code{fftw_execute_dft} and so on (@pxref{New-array Execute 1222Functions}) with MPI plans. Such functions are specialized to the 1223problem type, and there are specific new-array execute functions for MPI plans: 1224 1225@findex fftw_mpi_execute_dft 1226@findex fftw_mpi_execute_dft_r2c 1227@findex fftw_mpi_execute_dft_c2r 1228@findex fftw_mpi_execute_r2r 1229@example 1230void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out); 1231void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out); 1232void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out); 1233void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out); 1234@end example 1235 1236@cindex alignment 1237@findex fftw_malloc 1238These functions have the same restrictions as those of the serial 1239new-array execute functions. They are @emph{always} safe to apply to 1240the @emph{same} @code{in} and @code{out} arrays that were used to 1241create the plan. They can only be applied to new arrarys if those 1242arrays have the same types, dimensions, in-placeness, and alignment as 1243the original arrays, where the best way to ensure the same alignment 1244is to use FFTW's @code{fftw_malloc} and related allocation functions 1245for all arrays (@pxref{Memory Allocation}). Note that distributed 1246transposes (@pxref{FFTW MPI Transposes}) use 1247@code{fftw_mpi_execute_r2r}, since they count as rank-zero r2r plans 1248from FFTW's perspective. 1249 1250@node MPI Data Distribution Functions, MPI Plan Creation, Using MPI Plans, FFTW MPI Reference 1251@subsection MPI Data Distribution Functions 1252 1253@cindex data distribution 1254As described above (@pxref{MPI Data Distribution}), in order to 1255allocate your arrays, @emph{before} creating a plan, you must first 1256call one of the following routines to determine the required 1257allocation size and the portion of the array locally stored on a given 1258process. The @code{MPI_Comm} communicator passed here must be 1259equivalent to the communicator used below for plan creation. 1260 1261The basic interface for multidimensional transforms consists of the 1262functions: 1263 1264@findex fftw_mpi_local_size_2d 1265@findex fftw_mpi_local_size_3d 1266@findex fftw_mpi_local_size 1267@findex fftw_mpi_local_size_2d_transposed 1268@findex fftw_mpi_local_size_3d_transposed 1269@findex fftw_mpi_local_size_transposed 1270@example 1271ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, 1272 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); 1273ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, 1274 MPI_Comm comm, 1275 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); 1276ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm, 1277 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); 1278 1279ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, 1280 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, 1281 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); 1282ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, 1283 MPI_Comm comm, 1284 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, 1285 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); 1286ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm, 1287 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, 1288 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); 1289@end example 1290 1291These functions return the number of elements to allocate (complex 1292numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas 1293the @code{local_n0} and @code{local_0_start} return the portion 1294(@code{local_0_start} to @code{local_0_start + local_n0 - 1}) of the 1295first dimension of an @ndims{} array that is stored on the local 1296process. @xref{Basic and advanced distribution interfaces}. For 1297@code{FFTW_MPI_TRANSPOSED_OUT} plans, the @samp{_transposed} variants 1298are useful in order to also return the local portion of the first 1299dimension in the @ndimstrans{} transposed output. 1300@xref{Transposed distributions}. 1301The advanced interface for multidimensional transforms is: 1302 1303@cindex advanced interface 1304@findex fftw_mpi_local_size_many 1305@findex fftw_mpi_local_size_many_transposed 1306@example 1307ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, 1308 ptrdiff_t block0, MPI_Comm comm, 1309 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); 1310ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, 1311 ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm, 1312 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, 1313 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); 1314@end example 1315 1316These differ from the basic interface in only two ways. First, they 1317allow you to specify block sizes @code{block0} and @code{block1} (the 1318latter for the transposed output); you can pass 1319@code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in 1320the basic interface. Second, you can pass a @code{howmany} parameter, 1321corresponding to the advanced planning interface below: this is for 1322transforms of contiguous @code{howmany}-tuples of numbers 1323(@code{howmany = 1} in the basic interface). 1324 1325The corresponding basic and advanced routines for one-dimensional 1326transforms (currently only complex DFTs) are: 1327 1328@findex fftw_mpi_local_size_1d 1329@findex fftw_mpi_local_size_many_1d 1330@example 1331ptrdiff_t fftw_mpi_local_size_1d( 1332 ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, 1333 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, 1334 ptrdiff_t *local_no, ptrdiff_t *local_o_start); 1335ptrdiff_t fftw_mpi_local_size_many_1d( 1336 ptrdiff_t n0, ptrdiff_t howmany, 1337 MPI_Comm comm, int sign, unsigned flags, 1338 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, 1339 ptrdiff_t *local_no, ptrdiff_t *local_o_start); 1340@end example 1341 1342@ctindex FFTW_MPI_SCRAMBLED_OUT 1343@ctindex FFTW_MPI_SCRAMBLED_IN 1344As above, the return value is the number of elements to allocate 1345(complex numbers, for complex DFTs). The @code{local_ni} and 1346@code{local_i_start} arguments return the portion 1347(@code{local_i_start} to @code{local_i_start + local_ni - 1}) of the 13481d array that is stored on this process for the transform 1349@emph{input}, and @code{local_no} and @code{local_o_start} are the 1350corresponding quantities for the input. The @code{sign} 1351(@code{FFTW_FORWARD} or @code{FFTW_BACKWARD}) and @code{flags} must 1352match the arguments passed when creating a plan. Although the inputs 1353and outputs have different data distributions in general, it is 1354guaranteed that the @emph{output} data distribution of an 1355@code{FFTW_FORWARD} plan will match the @emph{input} data distribution 1356of an @code{FFTW_BACKWARD} plan and vice versa; similarly for the 1357@code{FFTW_MPI_SCRAMBLED_OUT} and @code{FFTW_MPI_SCRAMBLED_IN} flags. 1358@xref{One-dimensional distributions}. 1359 1360@node MPI Plan Creation, MPI Wisdom Communication, MPI Data Distribution Functions, FFTW MPI Reference 1361@subsection MPI Plan Creation 1362 1363@subsubheading Complex-data MPI DFTs 1364 1365Plans for complex-data DFTs (@pxref{2d MPI example}) are created by: 1366 1367@findex fftw_mpi_plan_dft_1d 1368@findex fftw_mpi_plan_dft_2d 1369@findex fftw_mpi_plan_dft_3d 1370@findex fftw_mpi_plan_dft 1371@findex fftw_mpi_plan_many_dft 1372@example 1373fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out, 1374 MPI_Comm comm, int sign, unsigned flags); 1375fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1, 1376 fftw_complex *in, fftw_complex *out, 1377 MPI_Comm comm, int sign, unsigned flags); 1378fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, 1379 fftw_complex *in, fftw_complex *out, 1380 MPI_Comm comm, int sign, unsigned flags); 1381fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n, 1382 fftw_complex *in, fftw_complex *out, 1383 MPI_Comm comm, int sign, unsigned flags); 1384fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n, 1385 ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock, 1386 fftw_complex *in, fftw_complex *out, 1387 MPI_Comm comm, int sign, unsigned flags); 1388@end example 1389 1390@cindex MPI communicator 1391@cindex collective function 1392These are similar to their serial counterparts (@pxref{Complex DFTs}) 1393in specifying the dimensions, sign, and flags of the transform. The 1394@code{comm} argument gives an MPI communicator that specifies the set 1395of processes to participate in the transform; plan creation is a 1396collective function that must be called for all processes in the 1397communicator. The @code{in} and @code{out} pointers refer only to a 1398portion of the overall transform data (@pxref{MPI Data Distribution}) 1399as specified by the @samp{local_size} functions in the previous 1400section. Unless @code{flags} contains @code{FFTW_ESTIMATE}, these 1401arrays are overwritten during plan creation as for the serial 1402interface. For multi-dimensional transforms, any dimensions @code{> 14031} are supported; for one-dimensional transforms, only composite 1404(non-prime) @code{n0} are currently supported (unlike the serial 1405FFTW). Requesting an unsupported transform size will yield a 1406@code{NULL} plan. (As in the serial interface, highly composite sizes 1407generally yield the best performance.) 1408 1409@cindex advanced interface 1410@ctindex FFTW_MPI_DEFAULT_BLOCK 1411@cindex stride 1412The advanced-interface @code{fftw_mpi_plan_many_dft} additionally 1413allows you to specify the block sizes for the first dimension 1414(@code{block}) of the @ndims{} input data and the first dimension 1415(@code{tblock}) of the @ndimstrans{} transposed data (at intermediate 1416steps of the transform, and for the output if 1417@code{FFTW_TRANSPOSED_OUT} is specified in @code{flags}). These must 1418be the same block sizes as were passed to the corresponding 1419@samp{local_size} function; you can pass @code{FFTW_MPI_DEFAULT_BLOCK} 1420to use FFTW's default block size as in the basic interface. Also, the 1421@code{howmany} parameter specifies that the transform is of contiguous 1422@code{howmany}-tuples rather than individual complex numbers; this 1423corresponds to the same parameter in the serial advanced interface 1424(@pxref{Advanced Complex DFTs}) with @code{stride = howmany} and 1425@code{dist = 1}. 1426 1427@subsubheading MPI flags 1428 1429The @code{flags} can be any of those for the serial FFTW 1430(@pxref{Planner Flags}), and in addition may include one or more of 1431the following MPI-specific flags, which improve performance at the 1432cost of changing the output or input data formats. 1433 1434@itemize @bullet 1435 1436@item 1437@ctindex FFTW_MPI_SCRAMBLED_OUT 1438@ctindex FFTW_MPI_SCRAMBLED_IN 1439@code{FFTW_MPI_SCRAMBLED_OUT}, @code{FFTW_MPI_SCRAMBLED_IN}: valid for 14401d transforms only, these flags indicate that the output/input of the 1441transform are in an undocumented ``scrambled'' order. A forward 1442@code{FFTW_MPI_SCRAMBLED_OUT} transform can be inverted by a backward 1443@code{FFTW_MPI_SCRAMBLED_IN} (times the usual 1/@i{N} normalization). 1444@xref{One-dimensional distributions}. 1445 1446@item 1447@ctindex FFTW_MPI_TRANSPOSED_OUT 1448@ctindex FFTW_MPI_TRANSPOSED_IN 1449@code{FFTW_MPI_TRANSPOSED_OUT}, @code{FFTW_MPI_TRANSPOSED_IN}: valid 1450for multidimensional (@code{rnk > 1}) transforms only, these flags 1451specify that the output or input of an @ndims{} transform is 1452transposed to @ndimstrans{}. @xref{Transposed distributions}. 1453 1454@end itemize 1455 1456@subsubheading Real-data MPI DFTs 1457 1458@cindex r2c 1459Plans for real-input/output (r2c/c2r) DFTs (@pxref{Multi-dimensional 1460MPI DFTs of Real Data}) are created by: 1461 1462@findex fftw_mpi_plan_dft_r2c_2d 1463@findex fftw_mpi_plan_dft_r2c_2d 1464@findex fftw_mpi_plan_dft_r2c_3d 1465@findex fftw_mpi_plan_dft_r2c 1466@findex fftw_mpi_plan_dft_c2r_2d 1467@findex fftw_mpi_plan_dft_c2r_2d 1468@findex fftw_mpi_plan_dft_c2r_3d 1469@findex fftw_mpi_plan_dft_c2r 1470@example 1471fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, 1472 double *in, fftw_complex *out, 1473 MPI_Comm comm, unsigned flags); 1474fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, 1475 double *in, fftw_complex *out, 1476 MPI_Comm comm, unsigned flags); 1477fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, 1478 double *in, fftw_complex *out, 1479 MPI_Comm comm, unsigned flags); 1480fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n, 1481 double *in, fftw_complex *out, 1482 MPI_Comm comm, unsigned flags); 1483fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, 1484 fftw_complex *in, double *out, 1485 MPI_Comm comm, unsigned flags); 1486fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, 1487 fftw_complex *in, double *out, 1488 MPI_Comm comm, unsigned flags); 1489fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, 1490 fftw_complex *in, double *out, 1491 MPI_Comm comm, unsigned flags); 1492fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n, 1493 fftw_complex *in, double *out, 1494 MPI_Comm comm, unsigned flags); 1495@end example 1496 1497Similar to the serial interface (@pxref{Real-data DFTs}), these 1498transform logically @ndims{} real data to/from @ndimshalf{} complex 1499data, representing the non-redundant half of the conjugate-symmetry 1500output of a real-input DFT (@pxref{Multi-dimensional Transforms}). 1501However, the real array must be stored within a padded @ndimspad{} 1502array (much like the in-place serial r2c transforms, but here for 1503out-of-place transforms as well). Currently, only multi-dimensional 1504(@code{rnk > 1}) r2c/c2r transforms are supported (requesting a plan 1505for @code{rnk = 1} will yield @code{NULL}). As explained above 1506(@pxref{Multi-dimensional MPI DFTs of Real Data}), the data 1507distribution of both the real and complex arrays is given by the 1508@samp{local_size} function called for the dimensions of the 1509@emph{complex} array. Similar to the other planning functions, the 1510input and output arrays are overwritten when the plan is created 1511except in @code{FFTW_ESTIMATE} mode. 1512 1513As for the complex DFTs above, there is an advance interface that 1514allows you to manually specify block sizes and to transform contiguous 1515@code{howmany}-tuples of real/complex numbers: 1516 1517@findex fftw_mpi_plan_many_dft_r2c 1518@findex fftw_mpi_plan_many_dft_c2r 1519@example 1520fftw_plan fftw_mpi_plan_many_dft_r2c 1521 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, 1522 ptrdiff_t iblock, ptrdiff_t oblock, 1523 double *in, fftw_complex *out, 1524 MPI_Comm comm, unsigned flags); 1525fftw_plan fftw_mpi_plan_many_dft_c2r 1526 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, 1527 ptrdiff_t iblock, ptrdiff_t oblock, 1528 fftw_complex *in, double *out, 1529 MPI_Comm comm, unsigned flags); 1530@end example 1531 1532@subsubheading MPI r2r transforms 1533 1534@cindex r2r 1535There are corresponding plan-creation routines for r2r 1536transforms (@pxref{More DFTs of Real Data}), currently supporting 1537multidimensional (@code{rnk > 1}) transforms only (@code{rnk = 1} will 1538yield a @code{NULL} plan): 1539 1540@example 1541fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1, 1542 double *in, double *out, 1543 MPI_Comm comm, 1544 fftw_r2r_kind kind0, fftw_r2r_kind kind1, 1545 unsigned flags); 1546fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, 1547 double *in, double *out, 1548 MPI_Comm comm, 1549 fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2, 1550 unsigned flags); 1551fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n, 1552 double *in, double *out, 1553 MPI_Comm comm, const fftw_r2r_kind *kind, 1554 unsigned flags); 1555fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n, 1556 ptrdiff_t iblock, ptrdiff_t oblock, 1557 double *in, double *out, 1558 MPI_Comm comm, const fftw_r2r_kind *kind, 1559 unsigned flags); 1560@end example 1561 1562The parameters are much the same as for the complex DFTs above, except 1563that the arrays are of real numbers (and hence the outputs of the 1564@samp{local_size} data-distribution functions should be interpreted as 1565counts of real rather than complex numbers). Also, the @code{kind} 1566parameters specify the r2r kinds along each dimension as for the 1567serial interface (@pxref{Real-to-Real Transform Kinds}). @xref{Other 1568Multi-dimensional Real-data MPI Transforms}. 1569 1570@subsubheading MPI transposition 1571@cindex transpose 1572 1573FFTW also provides routines to plan a transpose of a distributed 1574@code{n0} by @code{n1} array of real numbers, or an array of 1575@code{howmany}-tuples of real numbers with specified block sizes 1576(@pxref{FFTW MPI Transposes}): 1577 1578@findex fftw_mpi_plan_transpose 1579@findex fftw_mpi_plan_many_transpose 1580@example 1581fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, 1582 double *in, double *out, 1583 MPI_Comm comm, unsigned flags); 1584fftw_plan fftw_mpi_plan_many_transpose 1585 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, 1586 ptrdiff_t block0, ptrdiff_t block1, 1587 double *in, double *out, MPI_Comm comm, unsigned flags); 1588@end example 1589 1590@cindex new-array execution 1591@findex fftw_mpi_execute_r2r 1592These plans are used with the @code{fftw_mpi_execute_r2r} new-array 1593execute function (@pxref{Using MPI Plans }), since they count as (rank 1594zero) r2r plans from FFTW's perspective. 1595 1596@node MPI Wisdom Communication, , MPI Plan Creation, FFTW MPI Reference 1597@subsection MPI Wisdom Communication 1598 1599To facilitate synchronizing wisdom among the different MPI processes, 1600we provide two functions: 1601 1602@findex fftw_mpi_gather_wisdom 1603@findex fftw_mpi_broadcast_wisdom 1604@example 1605void fftw_mpi_gather_wisdom(MPI_Comm comm); 1606void fftw_mpi_broadcast_wisdom(MPI_Comm comm); 1607@end example 1608 1609The @code{fftw_mpi_gather_wisdom} function gathers all wisdom in the 1610given communicator @code{comm} to the process of rank 0 in the 1611communicator: that process obtains the union of all wisdom on all the 1612processes. As a side effect, some other processes will gain 1613additional wisdom from other processes, but only process 0 will gain 1614the complete union. 1615 1616The @code{fftw_mpi_broadcast_wisdom} does the reverse: it exports 1617wisdom from process 0 in @code{comm} to all other processes in the 1618communicator, replacing any wisdom they currently have. 1619 1620@xref{FFTW MPI Wisdom}. 1621 1622@c ------------------------------------------------------------ 1623@node FFTW MPI Fortran Interface, , FFTW MPI Reference, Distributed-memory FFTW with MPI 1624@section FFTW MPI Fortran Interface 1625@cindex Fortran interface 1626 1627@cindex iso_c_binding 1628The FFTW MPI interface is callable from modern Fortran compilers 1629supporting the Fortran 2003 @code{iso_c_binding} standard for calling 1630C functions. As described in @ref{Calling FFTW from Modern Fortran}, 1631this means that you can directly call FFTW's C interface from Fortran 1632with only minor changes in syntax. There are, however, a few things 1633specific to the MPI interface to keep in mind: 1634 1635@itemize @bullet 1636 1637@item 1638Instead of including @code{fftw3.f03} as in @ref{Overview of Fortran 1639interface }, you should @code{include 'fftw3-mpi.f03'} (after 1640@code{use, intrinsic :: iso_c_binding} as before). The 1641@code{fftw3-mpi.f03} file includes @code{fftw3.f03}, so you should 1642@emph{not} @code{include} them both yourself. (You will also want to 1643include the MPI header file, usually via @code{include 'mpif.h'} or 1644similar, although though this is not needed by @code{fftw3-mpi.f03} 1645@i{per se}.) (To use the @samp{fftwl_} @code{long double} extended-precision routines in supporting compilers, you should include @code{fftw3f-mpi.f03} in @emph{addition} to @code{fftw3-mpi.f03}. @xref{Extended and quadruple precision in Fortran}.) 1646 1647@item 1648Because of the different storage conventions between C and Fortran, 1649you reverse the order of your array dimensions when passing them to 1650FFTW (@pxref{Reversing array dimensions}). This is merely a 1651difference in notation and incurs no performance overhead. However, 1652it means that, whereas in C the @emph{first} dimension is distributed, 1653in Fortran the @emph{last} dimension of your array is distributed. 1654 1655@item 1656@cindex MPI communicator 1657In Fortran, communicators are stored as @code{integer} types; there is 1658no @code{MPI_Comm} type, nor is there any way to access a C 1659@code{MPI_Comm}. Fortunately, this is taken care of for you by the 1660FFTW Fortran interface: whenever the C interface expects an 1661@code{MPI_Comm} type, you should pass the Fortran communicator as an 1662@code{integer}.@footnote{Technically, this is because you aren't 1663actually calling the C functions directly. You are calling wrapper 1664functions that translate the communicator with @code{MPI_Comm_f2c} 1665before calling the ordinary C interface. This is all done 1666transparently, however, since the @code{fftw3-mpi.f03} interface file 1667renames the wrappers so that they are called in Fortran with the same 1668names as the C interface functions.} 1669 1670@item 1671Because you need to call the @samp{local_size} function to find out 1672how much space to allocate, and this may be @emph{larger} than the 1673local portion of the array (@pxref{MPI Data Distribution}), you should 1674@emph{always} allocate your arrays dynamically using FFTW's allocation 1675routines as described in @ref{Allocating aligned memory in Fortran}. 1676(Coincidentally, this also provides the best performance by 1677guaranteeding proper data alignment.) 1678 1679@item 1680Because all sizes in the MPI FFTW interface are declared as 1681@code{ptrdiff_t} in C, you should use @code{integer(C_INTPTR_T)} in 1682Fortran (@pxref{FFTW Fortran type reference}). 1683 1684@item 1685@findex fftw_execute_dft 1686@findex fftw_mpi_execute_dft 1687@cindex new-array execution 1688In Fortran, because of the language semantics, we generally recommend 1689using the new-array execute functions for all plans, even in the 1690common case where you are executing the plan on the same arrays for 1691which the plan was created (@pxref{Plan execution in Fortran}). 1692However, note that in the MPI interface these functions are changed: 1693@code{fftw_execute_dft} becomes @code{fftw_mpi_execute_dft}, 1694etcetera. @xref{Using MPI Plans}. 1695 1696@end itemize 1697 1698For example, here is a Fortran code snippet to perform a distributed 1699@twodims{L,M} complex DFT in-place. (This assumes you have already 1700initialized MPI with @code{MPI_init} and have also performed 1701@code{call fftw_mpi_init}.) 1702 1703@example 1704 use, intrinsic :: iso_c_binding 1705 include 'fftw3-mpi.f03' 1706 integer(C_INTPTR_T), parameter :: L = ... 1707 integer(C_INTPTR_T), parameter :: M = ... 1708 type(C_PTR) :: plan, cdata 1709 complex(C_DOUBLE_COMPLEX), pointer :: data(:,:) 1710 integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset 1711 1712! @r{get local data size and allocate (note dimension reversal)} 1713 alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, & 1714 local_M, local_j_offset) 1715 cdata = fftw_alloc_complex(alloc_local) 1716 call c_f_pointer(cdata, data, [L,local_M]) 1717 1718! @r{create MPI plan for in-place forward DFT (note dimension reversal)} 1719 plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, & 1720 FFTW_FORWARD, FFTW_MEASURE) 1721 1722! @r{initialize data to some function} my_function(i,j) 1723 do j = 1, local_M 1724 do i = 1, L 1725 data(i, j) = my_function(i, j + local_j_offset) 1726 end do 1727 end do 1728 1729! @r{compute transform (as many times as desired)} 1730 call fftw_mpi_execute_dft(plan, data, data) 1731 1732 call fftw_destroy_plan(plan) 1733 call fftw_free(cdata) 1734@end example 1735 1736Note that when we called @code{fftw_mpi_local_size_2d} and 1737@code{fftw_mpi_plan_dft_2d} with the dimensions in reversed order, 1738since a @twodims{L,M} Fortran array is viewed by FFTW in C as a 1739@twodims{M, L} array. This means that the array was distributed over 1740the @code{M} dimension, the local portion of which is a 1741@twodims{L,local_M} array in Fortran. (You must @emph{not} use an 1742@code{allocate} statement to allocate an @twodims{L,local_M} array, 1743however; you must allocate @code{alloc_local} complex numbers, which 1744may be greater than @code{L * local_M}, in order to reserve space for 1745intermediate steps of the transform.) Finally, we mention that 1746because C's array indices are zero-based, the @code{local_j_offset} 1747argument can conveniently be interpreted as an offset in the 1-based 1748@code{j} index (rather than as a starting index as in C). 1749 1750If instead you had used the @code{ior(FFTW_MEASURE, 1751FFTW_MPI_TRANSPOSED_OUT)} flag, the output of the transform would be a 1752transposed @twodims{M,local_L} array, associated with the @emph{same} 1753@code{cdata} allocation (since the transform is in-place), and which 1754you could declare with: 1755 1756@example 1757 complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:) 1758 ... 1759 call c_f_pointer(cdata, tdata, [M,local_L]) 1760@end example 1761 1762where @code{local_L} would have been obtained by changing the 1763@code{fftw_mpi_local_size_2d} call to: 1764 1765@example 1766 alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, & 1767 local_M, local_j_offset, local_L, local_i_offset) 1768@end example 1769