1@node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top
2@chapter Distributed-memory FFTW with MPI
3@cindex MPI
4
5@cindex parallel transform
6In this chapter we document the parallel FFTW routines for parallel
7systems supporting the MPI message-passing interface.  Unlike the
8shared-memory threads described in the previous chapter, MPI allows
9you to use @emph{distributed-memory} parallelism, where each CPU has
10its own separate memory, and which can scale up to clusters of many
11thousands of processors.  This capability comes at a price, however:
12each process only stores a @emph{portion} of the data to be
13transformed, which means that the data structures and
14programming-interface are quite different from the serial or threads
15versions of FFTW.
16@cindex data distribution
17
18
19Distributed-memory parallelism is especially useful when you are
20transforming arrays so large that they do not fit into the memory of a
21single processor.  The storage per-process required by FFTW's MPI
22routines is proportional to the total array size divided by the number
23of processes.  Conversely, distributed-memory parallelism can easily
24pose an unacceptably high communications overhead for small problems;
25the threshold problem size for which parallelism becomes advantageous
26will depend on the precise problem you are interested in, your
27hardware, and your MPI implementation.
28
29A note on terminology: in MPI, you divide the data among a set of
30``processes'' which each run in their own memory address space.
31Generally, each process runs on a different physical processor, but
32this is not required.  A set of processes in MPI is described by an
33opaque data structure called a ``communicator,'' the most common of
34which is the predefined communicator @code{MPI_COMM_WORLD} which
35refers to @emph{all} processes.  For more information on these and
36other concepts common to all MPI programs, we refer the reader to the
37documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home
38page}.
39@cindex MPI communicator
40@ctindex MPI_COMM_WORLD
41
42
43We assume in this chapter that the reader is familiar with the usage
44of the serial (uniprocessor) FFTW, and focus only on the concepts new
45to the MPI interface.
46
47@menu
48* FFTW MPI Installation::
49* Linking and Initializing MPI FFTW::
50* 2d MPI example::
51* MPI Data Distribution::
52* Multi-dimensional MPI DFTs of Real Data::
53* Other Multi-dimensional Real-data MPI Transforms::
54* FFTW MPI Transposes::
55* FFTW MPI Wisdom::
56* Avoiding MPI Deadlocks::
57* FFTW MPI Performance Tips::
58* Combining MPI and Threads::
59* FFTW MPI Reference::
60* FFTW MPI Fortran Interface::
61@end menu
62
63@c ------------------------------------------------------------
64@node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI
65@section FFTW MPI Installation
66
67All of the FFTW MPI code is located in the @code{mpi} subdirectory of
68the FFTW package.  On Unix systems, the FFTW MPI libraries and header
69files are automatically configured, compiled, and installed along with
70the uniprocessor FFTW libraries simply by including
71@code{--enable-mpi} in the flags to the @code{configure} script
72(@pxref{Installation on Unix}).
73@fpindex configure
74
75
76Any implementation of the MPI standard, version 1 or later, should
77work with FFTW.  The @code{configure} script will attempt to
78automatically detect how to compile and link code using your MPI
79implementation.  In some cases, especially if you have multiple
80different MPI implementations installed or have an unusual MPI
81software package, you may need to provide this information explicitly.
82
83Most commonly, one compiles MPI code by invoking a special compiler
84command, typically @code{mpicc} for C code.  The @code{configure}
85script knows the most common names for this command, but you can
86specify the MPI compilation command explicitly by setting the
87@code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}.
88@fpindex mpicc
89
90
91If, instead of a special compiler command, you need to link a certain
92library, you can specify the link command via the @code{MPILIBS}
93variable, as in @samp{./configure MPILIBS=-lmpi ...}.  Note that if
94your MPI library is installed in a non-standard location (one the
95compiler does not know about by default), you may also have to specify
96the location of the library and header files via @code{LDFLAGS} and
97@code{CPPFLAGS} variables, respectively, as in @samp{./configure
98LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}.
99
100@c ------------------------------------------------------------
101@node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI
102@section Linking and Initializing MPI FFTW
103
104Programs using the MPI FFTW routines should be linked with
105@code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision,
106@code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on
107(@pxref{Precision}). You will also need to link with whatever library
108is responsible for MPI on your system; in most MPI implementations,
109there is a special compiler alias named @code{mpicc} to compile and
110link MPI code.
111@fpindex mpicc
112@cindex linking on Unix
113@cindex precision
114
115
116@findex fftw_init_threads
117Before calling any FFTW routines except possibly
118@code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling
119@code{MPI_Init}, you should call the function:
120
121@example
122void fftw_mpi_init(void);
123@end example
124@findex fftw_mpi_init
125
126If, at the end of your program, you want to get rid of all memory and
127other resources allocated internally by FFTW, for both the serial and
128MPI routines, you can call:
129
130@example
131void fftw_mpi_cleanup(void);
132@end example
133@findex fftw_mpi_cleanup
134
135which is much like the @code{fftw_cleanup()} function except that it
136also gets rid of FFTW's MPI-related data.  You must @emph{not} execute
137any previously created plans after calling this function.
138
139@c ------------------------------------------------------------
140@node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI
141@section 2d MPI example
142
143Before we document the FFTW MPI interface in detail, we begin with a
144simple example outlining how one would perform a two-dimensional
145@code{N0} by @code{N1} complex DFT.
146
147@example
148#include <fftw3-mpi.h>
149
150int main(int argc, char **argv)
151@{
152    const ptrdiff_t N0 = ..., N1 = ...;
153    fftw_plan plan;
154    fftw_complex *data;
155    ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
156
157    MPI_Init(&argc, &argv);
158    fftw_mpi_init();
159
160    /* @r{get local data size and allocate} */
161    alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
162                                         &local_n0, &local_0_start);
163    data = fftw_alloc_complex(alloc_local);
164
165    /* @r{create plan for in-place forward DFT} */
166    plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
167                                FFTW_FORWARD, FFTW_ESTIMATE);
168
169    /* @r{initialize data to some function} my_function(x,y) */
170    for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j)
171       data[i*N1 + j] = my_function(local_0_start + i, j);
172
173    /* @r{compute transforms, in-place, as many times as desired} */
174    fftw_execute(plan);
175
176    fftw_destroy_plan(plan);
177
178    MPI_Finalize();
179@}
180@end example
181
182As can be seen above, the MPI interface follows the same basic style
183of allocate/plan/execute/destroy as the serial FFTW routines.  All of
184the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead
185of @samp{fftw_}.  There are a few important differences, however:
186
187First, we must call @code{fftw_mpi_init()} after calling
188@code{MPI_Init} (required in all MPI programs) and before calling any
189other @samp{fftw_mpi_} routine.
190@findex MPI_Init
191@findex fftw_mpi_init
192
193
194Second, when we create the plan with @code{fftw_mpi_plan_dft_2d},
195analogous to @code{fftw_plan_dft_2d}, we pass an additional argument:
196the communicator, indicating which processes will participate in the
197transform (here @code{MPI_COMM_WORLD}, indicating all processes).
198Whenever you create, execute, or destroy a plan for an MPI transform,
199you must call the corresponding FFTW routine on @emph{all} processes
200in the communicator for that transform.  (That is, these are
201@emph{collective} calls.)  Note that the plan for the MPI transform
202uses the standard @code{fftw_execute} and @code{fftw_destroy} routines
203(on the other hand, there are MPI-specific new-array execute functions
204documented below).
205@cindex collective function
206@findex fftw_mpi_plan_dft_2d
207@ctindex MPI_COMM_WORLD
208
209
210Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments
211instead of @code{int} as for the serial FFTW.  @code{ptrdiff_t} is a
212standard C integer type which is (at least) 32 bits wide on a 32-bit
213machine and 64 bits wide on a 64-bit machine.  This is to make it easy
214to specify very large parallel transforms on a 64-bit machine.  (You
215can specify 64-bit transform sizes in the serial FFTW, too, but only
216by using the @samp{guru64} planner interface.  @xref{64-bit Guru
217Interface}.)
218@tindex ptrdiff_t
219@cindex 64-bit architecture
220
221
222Fourth, and most importantly, you don't allocate the entire
223two-dimensional array on each process.  Instead, you call
224@code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the
225array resides on each processor, and how much space to allocate.
226Here, the portion of the array on each process is a @code{local_n0} by
227@code{N1} slice of the total array, starting at index
228@code{local_0_start}.  The total number of @code{fftw_complex} numbers
229to allocate is given by the @code{alloc_local} return value, which
230@emph{may} be greater than @code{local_n0 * N1} (in case some
231intermediate calculations require additional storage).  The data
232distribution in FFTW's MPI interface is described in more detail by
233the next section.
234@findex fftw_mpi_local_size_2d
235@cindex data distribution
236
237
238Given the portion of the array that resides on the local process, it
239is straightforward to initialize the data (here to a function
240@code{myfunction}) and otherwise manipulate it.  Of course, at the end
241of the program you may want to output the data somehow, but
242synchronizing this output is up to you and is beyond the scope of this
243manual.  (One good way to output a large multi-dimensional distributed
244array in MPI to a portable binary file is to use the free HDF5
245library; see the @uref{http://www.hdfgroup.org/, HDF home page}.)
246@cindex HDF5
247@cindex MPI I/O
248
249@c ------------------------------------------------------------
250@node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI
251@section MPI Data Distribution
252@cindex data distribution
253
254The most important concept to understand in using FFTW's MPI interface
255is the data distribution.  With a serial or multithreaded FFT, all of
256the inputs and outputs are stored as a single contiguous chunk of
257memory.  With a distributed-memory FFT, the inputs and outputs are
258broken into disjoint blocks, one per process.
259
260In particular, FFTW uses a @emph{1d block distribution} of the data,
261distributed along the @emph{first dimension}.  For example, if you
262want to perform a @twodims{100,200} complex DFT, distributed over 4
263processes, each process will get a @twodims{25,200} slice of the data.
264That is, process 0 will get rows 0 through 24, process 1 will get rows
26525 through 49, process 2 will get rows 50 through 74, and process 3
266will get rows 75 through 99.  If you take the same array but
267distribute it over 3 processes, then it is not evenly divisible so the
268different processes will have unequal chunks.  FFTW's default choice
269in this case is to assign 34 rows to processes 0 and 1, and 32 rows to
270process 2.
271@cindex block distribution
272
273
274FFTW provides several @samp{fftw_mpi_local_size} routines that you can
275call to find out what portion of an array is stored on the current
276process.  In most cases, you should use the default block sizes picked
277by FFTW, but it is also possible to specify your own block size.  For
278example, with a @twodims{100,200} array on three processes, you can
279tell FFTW to use a block size of 40, which would assign 40 rows to
280processes 0 and 1, and 20 rows to process 2.  FFTW's default is to
281divide the data equally among the processes if possible, and as best
282it can otherwise.  The rows are always assigned in ``rank order,''
283i.e. process 0 gets the first block of rows, then process 1, and so
284on.  (You can change this by using @code{MPI_Comm_split} to create a
285new communicator with re-ordered processes.)  However, you should
286always call the @samp{fftw_mpi_local_size} routines, if possible,
287rather than trying to predict FFTW's distribution choices.
288
289In particular, it is critical that you allocate the storage size that
290is returned by @samp{fftw_mpi_local_size}, which is @emph{not}
291necessarily the size of the local slice of the array.  The reason is
292that intermediate steps of FFTW's algorithms involve transposing the
293array and redistributing the data, so at these intermediate steps FFTW
294may require more local storage space (albeit always proportional to
295the total size divided by the number of processes).  The
296@samp{fftw_mpi_local_size} functions know how much storage is required
297for these intermediate steps and tell you the correct amount to
298allocate.
299
300@menu
301* Basic and advanced distribution interfaces::
302* Load balancing::
303* Transposed distributions::
304* One-dimensional distributions::
305@end menu
306
307@node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution
308@subsection Basic and advanced distribution interfaces
309
310As with the planner interface, the @samp{fftw_mpi_local_size}
311distribution interface is broken into basic and advanced
312(@samp{_many}) interfaces, where the latter allows you to specify the
313block size manually and also to request block sizes when computing
314multiple transforms simultaneously.  These functions are documented
315more exhaustively by the FFTW MPI Reference, but we summarize the
316basic ideas here using a couple of two-dimensional examples.
317
318For the @twodims{100,200} complex-DFT example, above, we would find
319the distribution by calling the following function in the basic
320interface:
321
322@example
323ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
324                                 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
325@end example
326@findex fftw_mpi_local_size_2d
327
328Given the total size of the data to be transformed (here, @code{n0 =
329100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this
330function provides three numbers.
331
332First, it describes the shape of the local data: the current process
333should store a @code{local_n0} by @code{n1} slice of the overall
334dataset, in row-major order (@code{n1} dimension contiguous), starting
335at index @code{local_0_start}.  That is, if the total dataset is
336viewed as a @code{n0} by @code{n1} matrix, the current process should
337store the rows @code{local_0_start} to
338@code{local_0_start+local_n0-1}.  Obviously, if you are running with
339only a single MPI process, that process will store the entire array:
340@code{local_0_start} will be zero and @code{local_n0} will be
341@code{n0}.  @xref{Row-major Format}.
342@cindex row-major
343
344
345Second, the return value is the total number of data elements (e.g.,
346complex numbers for a complex DFT) that should be allocated for the
347input and output arrays on the current process (ideally with
348@code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal
349alignment).  It might seem that this should always be equal to
350@code{local_n0 * n1}, but this is @emph{not} the case.  FFTW's
351distributed FFT algorithms require data redistributions at
352intermediate stages of the transform, and in some circumstances this
353may require slightly larger local storage.  This is discussed in more
354detail below, under @ref{Load balancing}.
355@findex fftw_malloc
356@findex fftw_alloc_complex
357
358
359@cindex advanced interface
360The advanced-interface @samp{local_size} function for multidimensional
361transforms returns the same three things (@code{local_n0},
362@code{local_0_start}, and the total number of elements to allocate),
363but takes more inputs:
364
365@example
366ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n,
367                                   ptrdiff_t howmany,
368                                   ptrdiff_t block0,
369                                   MPI_Comm comm,
370                                   ptrdiff_t *local_n0,
371                                   ptrdiff_t *local_0_start);
372@end example
373@findex fftw_mpi_local_size_many
374
375The two-dimensional case above corresponds to @code{rnk = 2} and an
376array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}.
377This routine is for any @code{rnk > 1}; one-dimensional transforms
378have their own interface because they work slightly differently, as
379discussed below.
380
381First, the advanced interface allows you to perform multiple
382transforms at once, of interleaved data, as specified by the
383@code{howmany} parameter.  (@code{hoamany} is 1 for a single
384transform.)
385
386Second, here you can specify your desired block size in the @code{n0}
387dimension, @code{block0}.  To use FFTW's default block size, pass
388@code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}.  Otherwise, on
389@code{P} processes, FFTW will return @code{local_n0} equal to
390@code{block0} on the first @code{P / block0} processes (rounded down),
391return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on
392the next process, and @code{local_n0} equal to zero on any remaining
393processes.  In general, we recommend using the default block size
394(which corresponds to @code{n0 / P}, rounded up).
395@ctindex FFTW_MPI_DEFAULT_BLOCK
396@cindex block distribution
397
398
399For example, suppose you have @code{P = 4} processes and @code{n0 =
40021}.  The default will be a block size of @code{6}, which will give
401@code{local_n0 = 6} on the first three processes and @code{local_n0 =
4023} on the last process.  Instead, however, you could specify
403@code{block0 = 5} if you wanted, which would give @code{local_n0 = 5}
404on processes 0 to 2, @code{local_n0 = 6} on process 3.  (This choice,
405while it may look superficially more ``balanced,'' has the same
406critical path as FFTW's default but requires more communications.)
407
408@node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution
409@subsection Load balancing
410@cindex load balancing
411
412Ideally, when you parallelize a transform over some @math{P}
413processes, each process should end up with work that takes equal time.
414Otherwise, all of the processes end up waiting on whichever process is
415slowest.  This goal is known as ``load balancing.''  In this section,
416we describe the circumstances under which FFTW is able to load-balance
417well, and in particular how you should choose your transform size in
418order to load balance.
419
420Load balancing is especially difficult when you are parallelizing over
421heterogeneous machines; for example, if one of your processors is a
422old 486 and another is a Pentium IV, obviously you should give the
423Pentium more work to do than the 486 since the latter is much slower.
424FFTW does not deal with this problem, however---it assumes that your
425processes run on hardware of comparable speed, and that the goal is
426therefore to divide the problem as equally as possible.
427
428For a multi-dimensional complex DFT, FFTW can divide the problem
429equally among the processes if: (i) the @emph{first} dimension
430@code{n0} is divisible by @math{P}; and (ii), the @emph{product} of
431the subsequent dimensions is divisible by @math{P}.  (For the advanced
432interface, where you can specify multiple simultaneous transforms via
433some ``vector'' length @code{howmany}, a factor of @code{howmany} is
434included in the product of the subsequent dimensions.)
435
436For a one-dimensional complex DFT, the length @code{N} of the data
437should be divisible by @math{P} @emph{squared} to be able to divide
438the problem equally among the processes.
439
440@node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution
441@subsection Transposed distributions
442
443Internally, FFTW's MPI transform algorithms work by first computing
444transforms of the data local to each process, then by globally
445@emph{transposing} the data in some fashion to redistribute the data
446among the processes, transforming the new data local to each process,
447and transposing back.  For example, a two-dimensional @code{n0} by
448@code{n1} array, distributed across the @code{n0} dimension, is
449transformd by: (i) transforming the @code{n1} dimension, which are
450local to each process; (ii) transposing to an @code{n1} by @code{n0}
451array, distributed across the @code{n1} dimension; (iii) transforming
452the @code{n0} dimension, which is now local to each process; (iv)
453transposing back.
454@cindex transpose
455
456
457However, in many applications it is acceptable to compute a
458multidimensional DFT whose results are produced in transposed order
459(e.g., @code{n1} by @code{n0} in two dimensions).  This provides a
460significant performance advantage, because it means that the final
461transposition step can be omitted.  FFTW supports this optimization,
462which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT}
463to the planner routines.  To compute the inverse transform of
464transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell
465it that the input is transposed.  In this section, we explain how to
466interpret the output format of such a transform.
467@ctindex FFTW_MPI_TRANSPOSED_OUT
468@ctindex FFTW_MPI_TRANSPOSED_IN
469
470
471Suppose you have are transforming multi-dimensional data with (at
472least two) dimensions @ndims{}.  As always, it is distributed along
473the first dimension @dimk{0}.  Now, if we compute its DFT with the
474@code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored
475with the first @emph{two} dimensions transposed: @ndimstrans{},
476distributed along the @dimk{1} dimension.  Conversely, if we take the
477@ndimstrans{} data and transform it with the
478@code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the
479original @ndims{} array.
480
481There are two ways to find the portion of the transposed array that
482resides on the current process.  First, you can simply call the
483appropriate @samp{local_size} function, passing @ndimstrans{} (the
484transposed dimensions).  This would mean calling the @samp{local_size}
485function twice, once for the transposed and once for the
486non-transposed dimensions.  Alternatively, you can call one of the
487@samp{local_size_transposed} functions, which returns both the
488non-transposed and transposed data distribution from a single call.
489For example, for a 3d transform with transposed output (or input), you
490might call:
491
492@example
493ptrdiff_t fftw_mpi_local_size_3d_transposed(
494                ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
495                ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
496                ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
497@end example
498@findex fftw_mpi_local_size_3d_transposed
499
500Here, @code{local_n0} and @code{local_0_start} give the size and
501starting index of the @code{n0} dimension for the
502@emph{non}-transposed data, as in the previous sections.  For
503@emph{transposed} data (e.g. the output for
504@code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and
505@code{local_1_start} give the size and starting index of the @code{n1}
506dimension, which is the first dimension of the transposed data
507(@code{n1} by @code{n0} by @code{n2}).
508
509(Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to
510performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two
511dimensions to the planner in reverse order, or vice versa.  If you
512pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and
513@code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the
514first two dimensions passed to the planner and passing @emph{neither}
515flag.)
516
517@node One-dimensional distributions,  , Transposed distributions, MPI Data Distribution
518@subsection One-dimensional distributions
519
520For one-dimensional distributed DFTs using FFTW, matters are slightly
521more complicated because the data distribution is more closely tied to
522how the algorithm works.  In particular, you can no longer pass an
523arbitrary block size and must accept FFTW's default; also, the block
524sizes may be different for input and output.  Also, the data
525distribution depends on the flags and transform direction, in order
526for forward and backward transforms to work correctly.
527
528@example
529ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
530                int sign, unsigned flags,
531                ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
532                ptrdiff_t *local_no, ptrdiff_t *local_o_start);
533@end example
534@findex fftw_mpi_local_size_1d
535
536This function computes the data distribution for a 1d transform of
537size @code{n0} with the given transform @code{sign} and @code{flags}.
538Both input and output data use block distributions.  The input on the
539current process will consist of @code{local_ni} numbers starting at
540index @code{local_i_start}; e.g. if only a single process is used,
541then @code{local_ni} will be @code{n0} and @code{local_i_start} will
542be @code{0}.  Similarly for the output, with @code{local_no} numbers
543starting at index @code{local_o_start}.  The return value of
544@code{fftw_mpi_local_size_1d} will be the total number of elements to
545allocate on the current process (which might be slightly larger than
546the local size due to intermediate steps in the algorithm).
547
548As mentioned above (@pxref{Load balancing}), the data will be divided
549equally among the processes if @code{n0} is divisible by the
550@emph{square} of the number of processes.  In this case,
551@code{local_ni} will equal @code{local_no}.  Otherwise, they may be
552different.
553
554For some applications, such as convolutions, the order of the output
555data is irrelevant.  In this case, performance can be improved by
556specifying that the output data be stored in an FFTW-defined
557``scrambled'' format.  (In particular, this is the analogue of
558transposed output in the multidimensional case: scrambled output saves
559a communications step.)  If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in
560the flags, then the output is stored in this (undocumented) scrambled
561order.  Conversely, to perform the inverse transform of data in
562scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag.
563@ctindex FFTW_MPI_SCRAMBLED_OUT
564@ctindex FFTW_MPI_SCRAMBLED_IN
565
566
567In MPI FFTW, only composite sizes @code{n0} can be parallelized; we
568have not yet implemented a parallel algorithm for large prime sizes.
569
570@c ------------------------------------------------------------
571@node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI
572@section Multi-dimensional MPI DFTs of Real Data
573
574FFTW's MPI interface also supports multi-dimensional DFTs of real
575data, similar to the serial r2c and c2r interfaces.  (Parallel
576one-dimensional real-data DFTs are not currently supported; you must
577use a complex transform and set the imaginary parts of the inputs to
578zero.)
579
580The key points to understand for r2c and c2r MPI transforms (compared
581to the MPI complex DFTs or the serial r2c/c2r transforms), are:
582
583@itemize @bullet
584
585@item
586Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real
587data to/from @ndimshalf{} complex data: the last dimension of the
588complex data is cut in half (rounded down), plus one.  As for the
589serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and
590@samp{plan_dft_c2r} are the @ndims{} dimensions of the real data.
591
592@item
593@cindex padding
594Although the real data is @emph{conceptually} @ndims{}, it is
595@emph{physically} stored as an @ndimspad{} array, where the last
596dimension has been @emph{padded} to make it the same size as the
597complex output.  This is much like the in-place serial r2c/c2r
598interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that
599in MPI the padding is required even for out-of-place data.  The extra
600padding numbers are ignored by FFTW (they are @emph{not} like
601zero-padding the transform to a larger size); they are only used to
602determine the data layout.
603
604@item
605@cindex data distribution
606The data distribution in MPI for @emph{both} the real and complex data
607is determined by the shape of the @emph{complex} data.  That is, you
608call the appropriate @samp{local size} function for the @ndimshalf{}
609complex data, and then use the @emph{same} distribution for the real
610data except that the last complex dimension is replaced by a (padded)
611real dimension of twice the length.
612
613@end itemize
614
615For example suppose we are performing an out-of-place r2c transform of
616@threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}],
617resulting in @threedims{L,M,N/2+1} complex data.  Similar to the
618example in @ref{2d MPI example}, we might do something like:
619
620@example
621#include <fftw3-mpi.h>
622
623int main(int argc, char **argv)
624@{
625    const ptrdiff_t L = ..., M = ..., N = ...;
626    fftw_plan plan;
627    double *rin;
628    fftw_complex *cout;
629    ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k;
630
631    MPI_Init(&argc, &argv);
632    fftw_mpi_init();
633
634    /* @r{get local data size and allocate} */
635    alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD,
636                                         &local_n0, &local_0_start);
637    rin = fftw_alloc_real(2 * alloc_local);
638    cout = fftw_alloc_complex(alloc_local);
639
640    /* @r{create plan for out-of-place r2c DFT} */
641    plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD,
642                                    FFTW_MEASURE);
643
644    /* @r{initialize rin to some function} my_func(x,y,z) */
645    for (i = 0; i < local_n0; ++i)
646       for (j = 0; j < M; ++j)
647         for (k = 0; k < N; ++k)
648       rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k);
649
650    /* @r{compute transforms as many times as desired} */
651    fftw_execute(plan);
652
653    fftw_destroy_plan(plan);
654
655    MPI_Finalize();
656@}
657@end example
658
659@findex fftw_alloc_real
660@cindex row-major
661Note that we allocated @code{rin} using @code{fftw_alloc_real} with an
662argument of @code{2 * alloc_local}: since @code{alloc_local} is the
663number of @emph{complex} values to allocate, the number of @emph{real}
664values is twice as many.  The @code{rin} array is then
665@threedims{local_n0,M,2(N/2+1)} in row-major order, so its
666@code{(i,j,k)} element is at the index @code{(i*M + j) * (2*(N/2+1)) +
667k} (@pxref{Multi-dimensional Array Format }).
668
669@cindex transpose
670@ctindex FFTW_TRANSPOSED_OUT
671@ctindex FFTW_TRANSPOSED_IN
672As for the complex transforms, improved performance can be obtained by
673specifying that the output is the transpose of the input or vice versa
674(@pxref{Transposed distributions}).  In our @threedims{L,M,N} r2c
675example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that
676the input would be a padded @threedims{L,M,2(N/2+1)} real array
677distributed over the @code{L} dimension, while the output would be a
678@threedims{M,L,N/2+1} complex array distributed over the @code{M}
679dimension.  To perform the inverse c2r transform with the same data
680distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag.
681
682@c ------------------------------------------------------------
683@node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI
684@section Other multi-dimensional Real-Data MPI Transforms
685
686@cindex r2r
687FFTW's MPI interface also supports multi-dimensional @samp{r2r}
688transforms of all kinds supported by the serial interface
689(e.g. discrete cosine and sine transforms, discrete Hartley
690transforms, etc.).  Only multi-dimensional @samp{r2r} transforms, not
691one-dimensional transforms, are currently parallelized.
692
693@tindex fftw_r2r_kind
694These are used much like the multidimensional complex DFTs discussed
695above, except that the data is real rather than complex, and one needs
696to pass an r2r transform kind (@code{fftw_r2r_kind}) for each
697dimension as in the serial FFTW (@pxref{More DFTs of Real Data}).
698
699For example, one might perform a two-dimensional @twodims{L,M} that is
700an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in
701the second dimension with code like:
702
703@example
704    const ptrdiff_t L = ..., M = ...;
705    fftw_plan plan;
706    double *data;
707    ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
708
709    /* @r{get local data size and allocate} */
710    alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD,
711                                         &local_n0, &local_0_start);
712    data = fftw_alloc_real(alloc_local);
713
714    /* @r{create plan for in-place REDFT10 x RODFT10} */
715    plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD,
716                                FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE);
717
718    /* @r{initialize data to some function} my_function(x,y) */
719    for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j)
720       data[i*M + j] = my_function(local_0_start + i, j);
721
722    /* @r{compute transforms, in-place, as many times as desired} */
723    fftw_execute(plan);
724
725    fftw_destroy_plan(plan);
726@end example
727
728@findex fftw_alloc_real
729Notice that we use the same @samp{local_size} functions as we did for
730complex data, only now we interpret the sizes in terms of real rather
731than complex values, and correspondingly use @code{fftw_alloc_real}.
732
733@c ------------------------------------------------------------
734@node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI
735@section FFTW MPI Transposes
736@cindex transpose
737
738The FFTW's MPI Fourier transforms rely on one or more @emph{global
739transposition} step for their communications.  For example, the
740multidimensional transforms work by transforming along some
741dimensions, then transposing to make the first dimension local and
742transforming that, then transposing back.  Because global
743transposition of a block-distributed matrix has many other potential
744uses besides FFTs, FFTW's transpose routines can be called directly,
745as documented in this section.
746
747@menu
748* Basic distributed-transpose interface::
749* Advanced distributed-transpose interface::
750* An improved replacement for MPI_Alltoall::
751@end menu
752
753@node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes
754@subsection Basic distributed-transpose interface
755
756In particular, suppose that we have an @code{n0} by @code{n1} array in
757row-major order, block-distributed across the @code{n0} dimension.  To
758transpose this into an @code{n1} by @code{n0} array block-distributed
759across the @code{n1} dimension, we would create a plan by calling the
760following function:
761
762@example
763fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
764                                  double *in, double *out,
765                                  MPI_Comm comm, unsigned flags);
766@end example
767@findex fftw_mpi_plan_transpose
768
769The input and output arrays (@code{in} and @code{out}) can be the
770same.  The transpose is actually executed by calling
771@code{fftw_execute} on the plan, as usual.
772@findex fftw_execute
773
774
775The @code{flags} are the usual FFTW planner flags, but support
776two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or
777@code{FFTW_MPI_TRANSPOSED_IN}.  What these flags indicate, for
778transpose plans, is that the output and/or input, respectively, are
779@emph{locally} transposed.  That is, on each process input data is
780normally stored as a @code{local_n0} by @code{n1} array in row-major
781order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is
782stored as @code{n1} by @code{local_n0} in row-major order.  Similarly,
783@code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by
784@code{local_n1} instead of @code{local_n1} by @code{n0}.
785@ctindex FFTW_MPI_TRANSPOSED_OUT
786@ctindex FFTW_MPI_TRANSPOSED_IN
787
788
789To determine the local size of the array on each process before and
790after the transpose, as well as the amount of storage that must be
791allocated, one should call @code{fftw_mpi_local_size_2d_transposed},
792just as for a 2d DFT as described in the previous section:
793@cindex data distribution
794
795@example
796ptrdiff_t fftw_mpi_local_size_2d_transposed
797                (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
798                 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
799                 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
800@end example
801@findex fftw_mpi_local_size_2d_transposed
802
803Again, the return value is the local storage to allocate, which in
804this case is the number of @emph{real} (@code{double}) values rather
805than complex numbers as in the previous examples.
806
807@node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes
808@subsection Advanced distributed-transpose interface
809
810The above routines are for a transpose of a matrix of numbers (of type
811@code{double}), using FFTW's default block sizes.  More generally, one
812can perform transposes of @emph{tuples} of numbers, with
813user-specified block sizes for the input and output:
814
815@example
816fftw_plan fftw_mpi_plan_many_transpose
817                (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
818                 ptrdiff_t block0, ptrdiff_t block1,
819                 double *in, double *out, MPI_Comm comm, unsigned flags);
820@end example
821@findex fftw_mpi_plan_many_transpose
822
823In this case, one is transposing an @code{n0} by @code{n1} matrix of
824@code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers).
825The input is distributed along the @code{n0} dimension with block size
826@code{block0}, and the @code{n1} by @code{n0} output is distributed
827along the @code{n1} dimension with block size @code{block1}.  If
828@code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW
829uses its default block size.  To get the local size of the data on
830each process, you should then call @code{fftw_mpi_local_size_many_transposed}.
831@ctindex FFTW_MPI_DEFAULT_BLOCK
832@findex fftw_mpi_local_size_many_transposed
833
834@node An improved replacement for MPI_Alltoall,  , Advanced distributed-transpose interface, FFTW MPI Transposes
835@subsection An improved replacement for MPI_Alltoall
836
837We close this section by noting that FFTW's MPI transpose routines can
838be thought of as a generalization for the @code{MPI_Alltoall} function
839(albeit only for floating-point types), and in some circumstances can
840function as an improved replacement.
841@findex MPI_Alltoall
842
843
844@code{MPI_Alltoall} is defined by the MPI standard as:
845
846@example
847int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
848                 void *recvbuf, int recvcnt, MPI_Datatype recvtype,
849                 MPI_Comm comm);
850@end example
851
852In particular, for @code{double*} arrays @code{in} and @code{out},
853consider the call:
854
855@example
856MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
857@end example
858
859This is completely equivalent to:
860
861@example
862MPI_Comm_size(comm, &P);
863plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
864fftw_execute(plan);
865fftw_destroy_plan(plan);
866@end example
867
868That is, computing a @twodims{P,P} transpose on @code{P} processes,
869with a block size of 1, is just a standard all-to-all communication.
870
871However, using the FFTW routine instead of @code{MPI_Alltoall} may
872have certain advantages.  First of all, FFTW's routine can operate
873in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only
874operate out-of-place.
875@cindex in-place
876
877
878Second, even for out-of-place plans, FFTW's routine may be faster,
879especially if you need to perform the all-to-all communication many
880times and can afford to use @code{FFTW_MEASURE} or
881@code{FFTW_PATIENT}.  It should certainly be no slower, not including
882the time to create the plan, since one of the possible algorithms that
883FFTW uses for an out-of-place transpose @emph{is} simply to call
884@code{MPI_Alltoall}.  However, FFTW also considers several other
885possible algorithms that, depending on your MPI implementation and
886your hardware, may be faster.
887@ctindex FFTW_MEASURE
888@ctindex FFTW_PATIENT
889
890@c ------------------------------------------------------------
891@node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI
892@section FFTW MPI Wisdom
893@cindex wisdom
894@cindex saving plans to disk
895
896FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can
897be used to save MPI plans as well as to save uniprocessor plans.
898However, for MPI there are several unavoidable complications.
899
900@cindex MPI I/O
901First, the MPI standard does not guarantee that every process can
902perform file I/O (at least, not using C stdio routines)---in general,
903we may only assume that process 0 is capable of I/O.@footnote{In fact,
904even this assumption is not technically guaranteed by the standard,
905although it seems to be universal in actual MPI implementations and is
906widely assumed by MPI-using software.  Technically, you need to query
907the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with
908@code{MPI_Attr_get}.  If this attribute is @code{MPI_PROC_NULL}, no
909I/O is possible.  If it is @code{MPI_ANY_SOURCE}, any process can
910perform I/O.  Otherwise, it is the rank of a process that can perform
911I/O ... but since it is not guaranteed to yield the @emph{same} rank
912on all processes, you have to do an @code{MPI_Allreduce} of some kind
913if you want all processes to agree about which is going to do I/O.
914And even then, the standard only guarantees that this process can
915perform output, but not input. See e.g. @cite{Parallel Programming
916with MPI} by P. S. Pacheco, section 8.1.3.  Needless to say, in our
917experience virtually no MPI programmers worry about this.} So, if we
918want to export the wisdom from a single process to a file, we must
919first export the wisdom to a string, then send it to process 0, then
920write it to a file.
921
922Second, in principle we may want to have separate wisdom for every
923process, since in general the processes may run on different hardware
924even for a single MPI program.  However, in practice FFTW's MPI code
925is designed for the case of homogeneous hardware (@pxref{Load
926balancing}), and in this case it is convenient to use the same wisdom
927for every process.  Thus, we need a mechanism to synchronize the wisdom.
928
929To address both of these problems, FFTW provides the following two
930functions:
931
932@example
933void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
934void fftw_mpi_gather_wisdom(MPI_Comm comm);
935@end example
936@findex fftw_mpi_gather_wisdom
937@findex fftw_mpi_broadcast_wisdom
938
939Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom}
940will broadcast the wisdom from process 0 to all other processes.
941Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all
942processes onto process 0.  (If the plans created for the same problem
943by different processes are not the same, @code{fftw_mpi_gather_wisdom}
944will arbitrarily choose one of the plans.)  Both of these functions
945may result in suboptimal plans for different processes if the
946processes are running on non-identical hardware.  Both of these
947functions are @emph{collective} calls, which means that they must be
948executed by all processes in the communicator.
949@cindex collective function
950
951
952So, for example, a typical code snippet to import wisdom from a file
953and use it on all processes would be:
954
955@example
956@{
957    int rank;
958
959    fftw_mpi_init();
960    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
961    if (rank == 0) fftw_import_wisdom_from_filename("mywisdom");
962    fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD);
963@}
964@end example
965
966(Note that we must call @code{fftw_mpi_init} before importing any
967wisdom that might contain MPI plans.)  Similarly, a typical code
968snippet to export wisdom from all processes to a file is:
969@findex fftw_mpi_init
970
971@example
972@{
973    int rank;
974
975    fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
976    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
977    if (rank == 0) fftw_export_wisdom_to_filename("mywisdom");
978@}
979@end example
980
981@c ------------------------------------------------------------
982@node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI
983@section Avoiding MPI Deadlocks
984@cindex deadlock
985
986An MPI program can @emph{deadlock} if one process is waiting for a
987message from another process that never gets sent.  To avoid deadlocks
988when using FFTW's MPI routines, it is important to know which
989functions are @emph{collective}: that is, which functions must
990@emph{always} be called in the @emph{same order} from @emph{every}
991process in a given communicator.  (For example, @code{MPI_Barrier} is
992the canonical example of a collective function in the MPI standard.)
993@cindex collective function
994@findex MPI_Barrier
995
996
997The functions in FFTW that are @emph{always} collective are: every
998function beginning with @samp{fftw_mpi_plan}, as well as
999@code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}.
1000Also, the following functions from the ordinary FFTW interface are
1001collective when they are applied to a plan created by an
1002@samp{fftw_mpi_plan} function: @code{fftw_execute},
1003@code{fftw_destroy_plan}, and @code{fftw_flops}.
1004@findex fftw_execute
1005@findex fftw_destroy_plan
1006@findex fftw_flops
1007
1008@c ------------------------------------------------------------
1009@node FFTW MPI Performance Tips, Combining MPI and Threads, Avoiding MPI Deadlocks, Distributed-memory FFTW with MPI
1010@section FFTW MPI Performance Tips
1011
1012In this section, we collect a few tips on getting the best performance
1013out of FFTW's MPI transforms.
1014
1015First, because of the 1d block distribution, FFTW's parallelization is
1016currently limited by the size of the first dimension.
1017(Multidimensional block distributions may be supported by a future
1018version.) More generally, you should ideally arrange the dimensions so
1019that FFTW can divide them equally among the processes. @xref{Load
1020balancing}.
1021@cindex block distribution
1022@cindex load balancing
1023
1024
1025Second, if it is not too inconvenient, you should consider working
1026with transposed output for multidimensional plans, as this saves a
1027considerable amount of communications.  @xref{Transposed distributions}.
1028@cindex transpose
1029
1030
1031Third, the fastest choices are generally either an in-place transform
1032or an out-of-place transform with the @code{FFTW_DESTROY_INPUT} flag
1033(which allows the input array to be used as scratch space).  In-place
1034is especially beneficial if the amount of data per process is large.
1035@ctindex FFTW_DESTROY_INPUT
1036
1037
1038Fourth, if you have multiple arrays to transform at once, rather than
1039calling FFTW's MPI transforms several times it usually seems to be
1040faster to interleave the data and use the advanced interface.  (This
1041groups the communications together instead of requiring separate
1042messages for each transform.)
1043
1044@c ------------------------------------------------------------
1045@node Combining MPI and Threads, FFTW MPI Reference, FFTW MPI Performance Tips, Distributed-memory FFTW with MPI
1046@section Combining MPI and Threads
1047@cindex threads
1048
1049In certain cases, it may be advantageous to combine MPI
1050(distributed-memory) and threads (shared-memory) parallelization.
1051FFTW supports this, with certain caveats.  For example, if you have a
1052cluster of 4-processor shared-memory nodes, you may want to use
1053threads within the nodes and MPI between the nodes, instead of MPI for
1054all parallelization.
1055
1056In particular, it is possible to seamlessly combine the MPI FFTW
1057routines with the multi-threaded FFTW routines (@pxref{Multi-threaded
1058FFTW}). However, some care must be taken in the initialization code,
1059which should look something like this:
1060
1061@example
1062int threads_ok;
1063
1064int main(int argc, char **argv)
1065@{
1066    int provided;
1067    MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
1068    threads_ok = provided >= MPI_THREAD_FUNNELED;
1069
1070    if (threads_ok) threads_ok = fftw_init_threads();
1071    fftw_mpi_init();
1072
1073    ...
1074    if (threads_ok) fftw_plan_with_nthreads(...);
1075    ...
1076
1077    MPI_Finalize();
1078@}
1079@end example
1080@findex fftw_mpi_init
1081@findex fftw_init_threads
1082@findex fftw_plan_with_nthreads
1083
1084First, note that instead of calling @code{MPI_Init}, you should call
1085@code{MPI_Init_threads}, which is the initialization routine defined
1086by the MPI-2 standard to indicate to MPI that your program will be
1087multithreaded.  We pass @code{MPI_THREAD_FUNNELED}, which indicates
1088that we will only call MPI routines from the main thread.  (FFTW will
1089launch additional threads internally, but the extra threads will not
1090call MPI code.)  (You may also pass @code{MPI_THREAD_SERIALIZED} or
1091@code{MPI_THREAD_MULTIPLE}, which requests additional multithreading
1092support from the MPI implementation, but this is not required by
1093FFTW.)  The @code{provided} parameter returns what level of threads
1094support is actually supported by your MPI implementation; this
1095@emph{must} be at least @code{MPI_THREAD_FUNNELED} if you want to call
1096the FFTW threads routines, so we define a global variable
1097@code{threads_ok} to record this.  You should only call
1098@code{fftw_init_threads} or @code{fftw_plan_with_nthreads} if
1099@code{threads_ok} is true.  For more information on thread safety in
1100MPI, see the
1101@uref{http://www.mpi-forum.org/docs/mpi-20-html/node162.htm, MPI and
1102Threads} section of the MPI-2 standard.
1103@cindex thread safety
1104
1105
1106Second, we must call @code{fftw_init_threads} @emph{before}
1107@code{fftw_mpi_init}.  This is critical for technical reasons having
1108to do with how FFTW initializes its list of algorithms.
1109
1110Then, if you call @code{fftw_plan_with_nthreads(N)}, @emph{every} MPI
1111process will launch (up to) @code{N} threads to parallelize its transforms.
1112
1113For example, in the hypothetical cluster of 4-processor nodes, you
1114might wish to launch only a single MPI process per node, and then call
1115@code{fftw_plan_with_nthreads(4)} on each process to use all
1116processors in the nodes.
1117
1118This may or may not be faster than simply using as many MPI processes
1119as you have processors, however.  On the one hand, using threads
1120within a node eliminates the need for explicit message passing within
1121the node.  On the other hand, FFTW's transpose routines are not
1122multi-threaded, and this means that the communications that do take
1123place will not benefit from parallelization within the node.
1124Moreover, many MPI implementations already have optimizations to
1125exploit shared memory when it is available, so adding the
1126multithreaded FFTW on top of this may be superfluous.
1127@cindex transpose
1128
1129@c ------------------------------------------------------------
1130@node FFTW MPI Reference, FFTW MPI Fortran Interface, Combining MPI and Threads, Distributed-memory FFTW with MPI
1131@section FFTW MPI Reference
1132
1133This chapter provides a complete reference to all FFTW MPI functions,
1134datatypes, and constants.  See also @ref{FFTW Reference} for information
1135on functions and types in common with the serial interface.
1136
1137@menu
1138* MPI Files and Data Types::
1139* MPI Initialization::
1140* Using MPI Plans::
1141* MPI Data Distribution Functions::
1142* MPI Plan Creation::
1143* MPI Wisdom Communication::
1144@end menu
1145
1146@node MPI Files and Data Types, MPI Initialization, FFTW MPI Reference, FFTW MPI Reference
1147@subsection MPI Files and Data Types
1148
1149All programs using FFTW's MPI support should include its header file:
1150
1151@example
1152#include <fftw3-mpi.h>
1153@end example
1154
1155Note that this header file includes the serial-FFTW @code{fftw3.h}
1156header file, and also the @code{mpi.h} header file for MPI, so you
1157need not include those files separately.
1158
1159You must also link to @emph{both} the FFTW MPI library and to the
1160serial FFTW library.  On Unix, this means adding @code{-lfftw3_mpi
1161-lfftw3 -lm} at the end of the link command.
1162
1163@cindex precision
1164Different precisions are handled as in the serial interface:
1165@xref{Precision}.  That is, @samp{fftw_} functions become
1166@code{fftwf_} (in single precision) etcetera, and the libraries become
1167@code{-lfftw3f_mpi -lfftw3f -lm} etcetera on Unix.  Long-double
1168precision is supported in MPI, but quad precision (@samp{fftwq_}) is
1169not due to the lack of MPI support for this type.
1170
1171@node MPI Initialization, Using MPI Plans, MPI Files and Data Types, FFTW MPI Reference
1172@subsection MPI Initialization
1173
1174Before calling any other FFTW MPI (@samp{fftw_mpi_}) function, and
1175before importing any wisdom for MPI problems, you must call:
1176
1177@findex fftw_mpi_init
1178@example
1179void fftw_mpi_init(void);
1180@end example
1181
1182@findex fftw_init_threads
1183If FFTW threads support is used, however, @code{fftw_mpi_init} should
1184be called @emph{after} @code{fftw_init_threads} (@pxref{Combining MPI
1185and Threads}).  Calling @code{fftw_mpi_init} additional times (before
1186@code{fftw_mpi_cleanup}) has no effect.
1187
1188
1189If you want to deallocate all persistent data and reset FFTW to the
1190pristine state it was in when you started your program, you can call:
1191
1192@findex fftw_mpi_cleanup
1193@example
1194void fftw_mpi_cleanup(void);
1195@end example
1196
1197@findex fftw_cleanup
1198(This calls @code{fftw_cleanup}, so you need not call the serial
1199cleanup routine too, although it is safe to do so.)  After calling
1200@code{fftw_mpi_cleanup}, all existing plans become undefined, and you
1201should not attempt to execute or destroy them.  You must call
1202@code{fftw_mpi_init} again after @code{fftw_mpi_cleanup} if you want
1203to resume using the MPI FFTW routines.
1204
1205@node Using MPI Plans, MPI Data Distribution Functions, MPI Initialization, FFTW MPI Reference
1206@subsection Using MPI Plans
1207
1208Once an MPI plan is created, you can execute and destroy it using
1209@code{fftw_execute}, @code{fftw_destroy_plan}, and the other functions
1210in the serial interface that operate on generic plans (@pxref{Using
1211Plans}).
1212
1213@cindex collective function
1214@cindex MPI communicator
1215The @code{fftw_execute} and @code{fftw_destroy_plan} functions, applied to
1216MPI plans, are @emph{collective} calls: they must be called for all processes
1217in the communicator that was used to create the plan.
1218
1219@cindex new-array execution
1220You must @emph{not} use the serial new-array plan-execution functions
1221@code{fftw_execute_dft} and so on (@pxref{New-array Execute
1222Functions}) with MPI plans.  Such functions are specialized to the
1223problem type, and there are specific new-array execute functions for MPI plans:
1224
1225@findex fftw_mpi_execute_dft
1226@findex fftw_mpi_execute_dft_r2c
1227@findex fftw_mpi_execute_dft_c2r
1228@findex fftw_mpi_execute_r2r
1229@example
1230void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out);
1231void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out);
1232void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out);
1233void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out);
1234@end example
1235
1236@cindex alignment
1237@findex fftw_malloc
1238These functions have the same restrictions as those of the serial
1239new-array execute functions.  They are @emph{always} safe to apply to
1240the @emph{same} @code{in} and @code{out} arrays that were used to
1241create the plan.  They can only be applied to new arrarys if those
1242arrays have the same types, dimensions, in-placeness, and alignment as
1243the original arrays, where the best way to ensure the same alignment
1244is to use FFTW's @code{fftw_malloc} and related allocation functions
1245for all arrays (@pxref{Memory Allocation}).  Note that distributed
1246transposes (@pxref{FFTW MPI Transposes}) use
1247@code{fftw_mpi_execute_r2r}, since they count as rank-zero r2r plans
1248from FFTW's perspective.
1249
1250@node MPI Data Distribution Functions, MPI Plan Creation, Using MPI Plans, FFTW MPI Reference
1251@subsection MPI Data Distribution Functions
1252
1253@cindex data distribution
1254As described above (@pxref{MPI Data Distribution}), in order to
1255allocate your arrays, @emph{before} creating a plan, you must first
1256call one of the following routines to determine the required
1257allocation size and the portion of the array locally stored on a given
1258process.  The @code{MPI_Comm} communicator passed here must be
1259equivalent to the communicator used below for plan creation.
1260
1261The basic interface for multidimensional transforms consists of the
1262functions:
1263
1264@findex fftw_mpi_local_size_2d
1265@findex fftw_mpi_local_size_3d
1266@findex fftw_mpi_local_size
1267@findex fftw_mpi_local_size_2d_transposed
1268@findex fftw_mpi_local_size_3d_transposed
1269@findex fftw_mpi_local_size_transposed
1270@example
1271ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
1272                                 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
1273ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1274                                 MPI_Comm comm,
1275                                 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
1276ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm,
1277                              ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
1278
1279ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
1280                                            ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
1281                                            ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
1282ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1283                                            MPI_Comm comm,
1284                                            ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
1285                                            ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
1286ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm,
1287                                         ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
1288                                         ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
1289@end example
1290
1291These functions return the number of elements to allocate (complex
1292numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas
1293the @code{local_n0} and @code{local_0_start} return the portion
1294(@code{local_0_start} to @code{local_0_start + local_n0 - 1}) of the
1295first dimension of an @ndims{} array that is stored on the local
1296process.  @xref{Basic and advanced distribution interfaces}.  For
1297@code{FFTW_MPI_TRANSPOSED_OUT} plans, the @samp{_transposed} variants
1298are useful in order to also return the local portion of the first
1299dimension in the @ndimstrans{} transposed output.
1300@xref{Transposed distributions}.
1301The advanced interface for multidimensional transforms is:
1302
1303@cindex advanced interface
1304@findex fftw_mpi_local_size_many
1305@findex fftw_mpi_local_size_many_transposed
1306@example
1307ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
1308                                   ptrdiff_t block0, MPI_Comm comm,
1309                                   ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
1310ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
1311                                              ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm,
1312                                              ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
1313                                              ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
1314@end example
1315
1316These differ from the basic interface in only two ways.  First, they
1317allow you to specify block sizes @code{block0} and @code{block1} (the
1318latter for the transposed output); you can pass
1319@code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in
1320the basic interface.  Second, you can pass a @code{howmany} parameter,
1321corresponding to the advanced planning interface below: this is for
1322transforms of contiguous @code{howmany}-tuples of numbers
1323(@code{howmany = 1} in the basic interface).
1324
1325The corresponding basic and advanced routines for one-dimensional
1326transforms (currently only complex DFTs) are:
1327
1328@findex fftw_mpi_local_size_1d
1329@findex fftw_mpi_local_size_many_1d
1330@example
1331ptrdiff_t fftw_mpi_local_size_1d(
1332             ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags,
1333             ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
1334             ptrdiff_t *local_no, ptrdiff_t *local_o_start);
1335ptrdiff_t fftw_mpi_local_size_many_1d(
1336             ptrdiff_t n0, ptrdiff_t howmany,
1337             MPI_Comm comm, int sign, unsigned flags,
1338             ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
1339             ptrdiff_t *local_no, ptrdiff_t *local_o_start);
1340@end example
1341
1342@ctindex FFTW_MPI_SCRAMBLED_OUT
1343@ctindex FFTW_MPI_SCRAMBLED_IN
1344As above, the return value is the number of elements to allocate
1345(complex numbers, for complex DFTs).  The @code{local_ni} and
1346@code{local_i_start} arguments return the portion
1347(@code{local_i_start} to @code{local_i_start + local_ni - 1}) of the
13481d array that is stored on this process for the transform
1349@emph{input}, and @code{local_no} and @code{local_o_start} are the
1350corresponding quantities for the input.  The @code{sign}
1351(@code{FFTW_FORWARD} or @code{FFTW_BACKWARD}) and @code{flags} must
1352match the arguments passed when creating a plan.  Although the inputs
1353and outputs have different data distributions in general, it is
1354guaranteed that the @emph{output} data distribution of an
1355@code{FFTW_FORWARD} plan will match the @emph{input} data distribution
1356of an @code{FFTW_BACKWARD} plan and vice versa; similarly for the
1357@code{FFTW_MPI_SCRAMBLED_OUT} and @code{FFTW_MPI_SCRAMBLED_IN} flags.
1358@xref{One-dimensional distributions}.
1359
1360@node MPI Plan Creation, MPI Wisdom Communication, MPI Data Distribution Functions, FFTW MPI Reference
1361@subsection MPI Plan Creation
1362
1363@subsubheading Complex-data MPI DFTs
1364
1365Plans for complex-data DFTs (@pxref{2d MPI example}) are created by:
1366
1367@findex fftw_mpi_plan_dft_1d
1368@findex fftw_mpi_plan_dft_2d
1369@findex fftw_mpi_plan_dft_3d
1370@findex fftw_mpi_plan_dft
1371@findex fftw_mpi_plan_many_dft
1372@example
1373fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out,
1374                               MPI_Comm comm, int sign, unsigned flags);
1375fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1,
1376                               fftw_complex *in, fftw_complex *out,
1377                               MPI_Comm comm, int sign, unsigned flags);
1378fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1379                               fftw_complex *in, fftw_complex *out,
1380                               MPI_Comm comm, int sign, unsigned flags);
1381fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n,
1382                            fftw_complex *in, fftw_complex *out,
1383                            MPI_Comm comm, int sign, unsigned flags);
1384fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n,
1385                                 ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock,
1386                                 fftw_complex *in, fftw_complex *out,
1387                                 MPI_Comm comm, int sign, unsigned flags);
1388@end example
1389
1390@cindex MPI communicator
1391@cindex collective function
1392These are similar to their serial counterparts (@pxref{Complex DFTs})
1393in specifying the dimensions, sign, and flags of the transform.  The
1394@code{comm} argument gives an MPI communicator that specifies the set
1395of processes to participate in the transform; plan creation is a
1396collective function that must be called for all processes in the
1397communicator.  The @code{in} and @code{out} pointers refer only to a
1398portion of the overall transform data (@pxref{MPI Data Distribution})
1399as specified by the @samp{local_size} functions in the previous
1400section.  Unless @code{flags} contains @code{FFTW_ESTIMATE}, these
1401arrays are overwritten during plan creation as for the serial
1402interface.  For multi-dimensional transforms, any dimensions @code{>
14031} are supported; for one-dimensional transforms, only composite
1404(non-prime) @code{n0} are currently supported (unlike the serial
1405FFTW).  Requesting an unsupported transform size will yield a
1406@code{NULL} plan.  (As in the serial interface, highly composite sizes
1407generally yield the best performance.)
1408
1409@cindex advanced interface
1410@ctindex FFTW_MPI_DEFAULT_BLOCK
1411@cindex stride
1412The advanced-interface @code{fftw_mpi_plan_many_dft} additionally
1413allows you to specify the block sizes for the first dimension
1414(@code{block}) of the @ndims{} input data and the first dimension
1415(@code{tblock}) of the @ndimstrans{} transposed data (at intermediate
1416steps of the transform, and for the output if
1417@code{FFTW_TRANSPOSED_OUT} is specified in @code{flags}).  These must
1418be the same block sizes as were passed to the corresponding
1419@samp{local_size} function; you can pass @code{FFTW_MPI_DEFAULT_BLOCK}
1420to use FFTW's default block size as in the basic interface.  Also, the
1421@code{howmany} parameter specifies that the transform is of contiguous
1422@code{howmany}-tuples rather than individual complex numbers; this
1423corresponds to the same parameter in the serial advanced interface
1424(@pxref{Advanced Complex DFTs}) with @code{stride = howmany} and
1425@code{dist = 1}.
1426
1427@subsubheading MPI flags
1428
1429The @code{flags} can be any of those for the serial FFTW
1430(@pxref{Planner Flags}), and in addition may include one or more of
1431the following MPI-specific flags, which improve performance at the
1432cost of changing the output or input data formats.
1433
1434@itemize @bullet
1435
1436@item
1437@ctindex FFTW_MPI_SCRAMBLED_OUT
1438@ctindex FFTW_MPI_SCRAMBLED_IN
1439@code{FFTW_MPI_SCRAMBLED_OUT}, @code{FFTW_MPI_SCRAMBLED_IN}: valid for
14401d transforms only, these flags indicate that the output/input of the
1441transform are in an undocumented ``scrambled'' order.  A forward
1442@code{FFTW_MPI_SCRAMBLED_OUT} transform can be inverted by a backward
1443@code{FFTW_MPI_SCRAMBLED_IN} (times the usual 1/@i{N} normalization).
1444@xref{One-dimensional distributions}.
1445
1446@item
1447@ctindex FFTW_MPI_TRANSPOSED_OUT
1448@ctindex FFTW_MPI_TRANSPOSED_IN
1449@code{FFTW_MPI_TRANSPOSED_OUT}, @code{FFTW_MPI_TRANSPOSED_IN}: valid
1450for multidimensional (@code{rnk > 1}) transforms only, these flags
1451specify that the output or input of an @ndims{} transform is
1452transposed to @ndimstrans{}.  @xref{Transposed distributions}.
1453
1454@end itemize
1455
1456@subsubheading Real-data MPI DFTs
1457
1458@cindex r2c
1459Plans for real-input/output (r2c/c2r) DFTs (@pxref{Multi-dimensional
1460MPI DFTs of Real Data}) are created by:
1461
1462@findex fftw_mpi_plan_dft_r2c_2d
1463@findex fftw_mpi_plan_dft_r2c_2d
1464@findex fftw_mpi_plan_dft_r2c_3d
1465@findex fftw_mpi_plan_dft_r2c
1466@findex fftw_mpi_plan_dft_c2r_2d
1467@findex fftw_mpi_plan_dft_c2r_2d
1468@findex fftw_mpi_plan_dft_c2r_3d
1469@findex fftw_mpi_plan_dft_c2r
1470@example
1471fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
1472                                   double *in, fftw_complex *out,
1473                                   MPI_Comm comm, unsigned flags);
1474fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
1475                                   double *in, fftw_complex *out,
1476                                   MPI_Comm comm, unsigned flags);
1477fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1478                                   double *in, fftw_complex *out,
1479                                   MPI_Comm comm, unsigned flags);
1480fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n,
1481                                double *in, fftw_complex *out,
1482                                MPI_Comm comm, unsigned flags);
1483fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
1484                                   fftw_complex *in, double *out,
1485                                   MPI_Comm comm, unsigned flags);
1486fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
1487                                   fftw_complex *in, double *out,
1488                                   MPI_Comm comm, unsigned flags);
1489fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1490                                   fftw_complex *in, double *out,
1491                                   MPI_Comm comm, unsigned flags);
1492fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n,
1493                                fftw_complex *in, double *out,
1494                                MPI_Comm comm, unsigned flags);
1495@end example
1496
1497Similar to the serial interface (@pxref{Real-data DFTs}), these
1498transform logically @ndims{} real data to/from @ndimshalf{} complex
1499data, representing the non-redundant half of the conjugate-symmetry
1500output of a real-input DFT (@pxref{Multi-dimensional Transforms}).
1501However, the real array must be stored within a padded @ndimspad{}
1502array (much like the in-place serial r2c transforms, but here for
1503out-of-place transforms as well). Currently, only multi-dimensional
1504(@code{rnk > 1}) r2c/c2r transforms are supported (requesting a plan
1505for @code{rnk = 1} will yield @code{NULL}).  As explained above
1506(@pxref{Multi-dimensional MPI DFTs of Real Data}), the data
1507distribution of both the real and complex arrays is given by the
1508@samp{local_size} function called for the dimensions of the
1509@emph{complex} array.  Similar to the other planning functions, the
1510input and output arrays are overwritten when the plan is created
1511except in @code{FFTW_ESTIMATE} mode.
1512
1513As for the complex DFTs above, there is an advance interface that
1514allows you to manually specify block sizes and to transform contiguous
1515@code{howmany}-tuples of real/complex numbers:
1516
1517@findex fftw_mpi_plan_many_dft_r2c
1518@findex fftw_mpi_plan_many_dft_c2r
1519@example
1520fftw_plan fftw_mpi_plan_many_dft_r2c
1521              (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
1522               ptrdiff_t iblock, ptrdiff_t oblock,
1523               double *in, fftw_complex *out,
1524               MPI_Comm comm, unsigned flags);
1525fftw_plan fftw_mpi_plan_many_dft_c2r
1526              (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
1527               ptrdiff_t iblock, ptrdiff_t oblock,
1528               fftw_complex *in, double *out,
1529               MPI_Comm comm, unsigned flags);
1530@end example
1531
1532@subsubheading MPI r2r transforms
1533
1534@cindex r2r
1535There are corresponding plan-creation routines for r2r
1536transforms (@pxref{More DFTs of Real Data}), currently supporting
1537multidimensional (@code{rnk > 1}) transforms only (@code{rnk = 1} will
1538yield a @code{NULL} plan):
1539
1540@example
1541fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1,
1542                               double *in, double *out,
1543                               MPI_Comm comm,
1544                               fftw_r2r_kind kind0, fftw_r2r_kind kind1,
1545                               unsigned flags);
1546fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1547                               double *in, double *out,
1548                               MPI_Comm comm,
1549                               fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2,
1550                               unsigned flags);
1551fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n,
1552                            double *in, double *out,
1553                            MPI_Comm comm, const fftw_r2r_kind *kind,
1554                            unsigned flags);
1555fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n,
1556                                 ptrdiff_t iblock, ptrdiff_t oblock,
1557                                 double *in, double *out,
1558                                 MPI_Comm comm, const fftw_r2r_kind *kind,
1559                                 unsigned flags);
1560@end example
1561
1562The parameters are much the same as for the complex DFTs above, except
1563that the arrays are of real numbers (and hence the outputs of the
1564@samp{local_size} data-distribution functions should be interpreted as
1565counts of real rather than complex numbers).  Also, the @code{kind}
1566parameters specify the r2r kinds along each dimension as for the
1567serial interface (@pxref{Real-to-Real Transform Kinds}).  @xref{Other
1568Multi-dimensional Real-data MPI Transforms}.
1569
1570@subsubheading MPI transposition
1571@cindex transpose
1572
1573FFTW also provides routines to plan a transpose of a distributed
1574@code{n0} by @code{n1} array of real numbers, or an array of
1575@code{howmany}-tuples of real numbers with specified block sizes
1576(@pxref{FFTW MPI Transposes}):
1577
1578@findex fftw_mpi_plan_transpose
1579@findex fftw_mpi_plan_many_transpose
1580@example
1581fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
1582                                  double *in, double *out,
1583                                  MPI_Comm comm, unsigned flags);
1584fftw_plan fftw_mpi_plan_many_transpose
1585                (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
1586                 ptrdiff_t block0, ptrdiff_t block1,
1587                 double *in, double *out, MPI_Comm comm, unsigned flags);
1588@end example
1589
1590@cindex new-array execution
1591@findex fftw_mpi_execute_r2r
1592These plans are used with the @code{fftw_mpi_execute_r2r} new-array
1593execute function (@pxref{Using MPI Plans }), since they count as (rank
1594zero) r2r plans from FFTW's perspective.
1595
1596@node MPI Wisdom Communication,  , MPI Plan Creation, FFTW MPI Reference
1597@subsection MPI Wisdom Communication
1598
1599To facilitate synchronizing wisdom among the different MPI processes,
1600we provide two functions:
1601
1602@findex fftw_mpi_gather_wisdom
1603@findex fftw_mpi_broadcast_wisdom
1604@example
1605void fftw_mpi_gather_wisdom(MPI_Comm comm);
1606void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
1607@end example
1608
1609The @code{fftw_mpi_gather_wisdom} function gathers all wisdom in the
1610given communicator @code{comm} to the process of rank 0 in the
1611communicator: that process obtains the union of all wisdom on all the
1612processes.  As a side effect, some other processes will gain
1613additional wisdom from other processes, but only process 0 will gain
1614the complete union.
1615
1616The @code{fftw_mpi_broadcast_wisdom} does the reverse: it exports
1617wisdom from process 0 in @code{comm} to all other processes in the
1618communicator, replacing any wisdom they currently have.
1619
1620@xref{FFTW MPI Wisdom}.
1621
1622@c ------------------------------------------------------------
1623@node FFTW MPI Fortran Interface,  , FFTW MPI Reference, Distributed-memory FFTW with MPI
1624@section FFTW MPI Fortran Interface
1625@cindex Fortran interface
1626
1627@cindex iso_c_binding
1628The FFTW MPI interface is callable from modern Fortran compilers
1629supporting the Fortran 2003 @code{iso_c_binding} standard for calling
1630C functions.  As described in @ref{Calling FFTW from Modern Fortran},
1631this means that you can directly call FFTW's C interface from Fortran
1632with only minor changes in syntax.  There are, however, a few things
1633specific to the MPI interface to keep in mind:
1634
1635@itemize @bullet
1636
1637@item
1638Instead of including @code{fftw3.f03} as in @ref{Overview of Fortran
1639interface }, you should @code{include 'fftw3-mpi.f03'} (after
1640@code{use, intrinsic :: iso_c_binding} as before).  The
1641@code{fftw3-mpi.f03} file includes @code{fftw3.f03}, so you should
1642@emph{not} @code{include} them both yourself.  (You will also want to
1643include the MPI header file, usually via @code{include 'mpif.h'} or
1644similar, although though this is not needed by @code{fftw3-mpi.f03}
1645@i{per se}.)  (To use the @samp{fftwl_} @code{long double} extended-precision routines in supporting compilers, you should include @code{fftw3f-mpi.f03} in @emph{addition} to @code{fftw3-mpi.f03}. @xref{Extended and quadruple precision in Fortran}.)
1646
1647@item
1648Because of the different storage conventions between C and Fortran,
1649you reverse the order of your array dimensions when passing them to
1650FFTW (@pxref{Reversing array dimensions}).  This is merely a
1651difference in notation and incurs no performance overhead.  However,
1652it means that, whereas in C the @emph{first} dimension is distributed,
1653in Fortran the @emph{last} dimension of your array is distributed.
1654
1655@item
1656@cindex MPI communicator
1657In Fortran, communicators are stored as @code{integer} types; there is
1658no @code{MPI_Comm} type, nor is there any way to access a C
1659@code{MPI_Comm}.  Fortunately, this is taken care of for you by the
1660FFTW Fortran interface: whenever the C interface expects an
1661@code{MPI_Comm} type, you should pass the Fortran communicator as an
1662@code{integer}.@footnote{Technically, this is because you aren't
1663actually calling the C functions directly. You are calling wrapper
1664functions that translate the communicator with @code{MPI_Comm_f2c}
1665before calling the ordinary C interface.  This is all done
1666transparently, however, since the @code{fftw3-mpi.f03} interface file
1667renames the wrappers so that they are called in Fortran with the same
1668names as the C interface functions.}
1669
1670@item
1671Because you need to call the @samp{local_size} function to find out
1672how much space to allocate, and this may be @emph{larger} than the
1673local portion of the array (@pxref{MPI Data Distribution}), you should
1674@emph{always} allocate your arrays dynamically using FFTW's allocation
1675routines as described in @ref{Allocating aligned memory in Fortran}.
1676(Coincidentally, this also provides the best performance by
1677guaranteeding proper data alignment.)
1678
1679@item
1680Because all sizes in the MPI FFTW interface are declared as
1681@code{ptrdiff_t} in C, you should use @code{integer(C_INTPTR_T)} in
1682Fortran (@pxref{FFTW Fortran type reference}).
1683
1684@item
1685@findex fftw_execute_dft
1686@findex fftw_mpi_execute_dft
1687@cindex new-array execution
1688In Fortran, because of the language semantics, we generally recommend
1689using the new-array execute functions for all plans, even in the
1690common case where you are executing the plan on the same arrays for
1691which the plan was created (@pxref{Plan execution in Fortran}).
1692However, note that in the MPI interface these functions are changed:
1693@code{fftw_execute_dft} becomes @code{fftw_mpi_execute_dft},
1694etcetera. @xref{Using MPI Plans}.
1695
1696@end itemize
1697
1698For example, here is a Fortran code snippet to perform a distributed
1699@twodims{L,M} complex DFT in-place.  (This assumes you have already
1700initialized MPI with @code{MPI_init} and have also performed
1701@code{call fftw_mpi_init}.)
1702
1703@example
1704  use, intrinsic :: iso_c_binding
1705  include 'fftw3-mpi.f03'
1706  integer(C_INTPTR_T), parameter :: L = ...
1707  integer(C_INTPTR_T), parameter :: M = ...
1708  type(C_PTR) :: plan, cdata
1709  complex(C_DOUBLE_COMPLEX), pointer :: data(:,:)
1710  integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset
1711
1712!   @r{get local data size and allocate (note dimension reversal)}
1713  alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, &
1714                                       local_M, local_j_offset)
1715  cdata = fftw_alloc_complex(alloc_local)
1716  call c_f_pointer(cdata, data, [L,local_M])
1717
1718!   @r{create MPI plan for in-place forward DFT (note dimension reversal)}
1719  plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, &
1720                              FFTW_FORWARD, FFTW_MEASURE)
1721
1722! @r{initialize data to some function} my_function(i,j)
1723  do j = 1, local_M
1724    do i = 1, L
1725      data(i, j) = my_function(i, j + local_j_offset)
1726    end do
1727  end do
1728
1729! @r{compute transform (as many times as desired)}
1730  call fftw_mpi_execute_dft(plan, data, data)
1731
1732  call fftw_destroy_plan(plan)
1733  call fftw_free(cdata)
1734@end example
1735
1736Note that when we called @code{fftw_mpi_local_size_2d} and
1737@code{fftw_mpi_plan_dft_2d} with the dimensions in reversed order,
1738since a @twodims{L,M} Fortran array is viewed by FFTW in C as a
1739@twodims{M, L} array.  This means that the array was distributed over
1740the @code{M} dimension, the local portion of which is a
1741@twodims{L,local_M} array in Fortran.  (You must @emph{not} use an
1742@code{allocate} statement to allocate an @twodims{L,local_M} array,
1743however; you must allocate @code{alloc_local} complex numbers, which
1744may be greater than @code{L * local_M}, in order to reserve space for
1745intermediate steps of the transform.)  Finally, we mention that
1746because C's array indices are zero-based, the @code{local_j_offset}
1747argument can conveniently be interpreted as an offset in the 1-based
1748@code{j} index (rather than as a starting index as in C).
1749
1750If instead you had used the @code{ior(FFTW_MEASURE,
1751FFTW_MPI_TRANSPOSED_OUT)} flag, the output of the transform would be a
1752transposed @twodims{M,local_L} array, associated with the @emph{same}
1753@code{cdata} allocation (since the transform is in-place), and which
1754you could declare with:
1755
1756@example
1757  complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:)
1758  ...
1759  call c_f_pointer(cdata, tdata, [M,local_L])
1760@end example
1761
1762where @code{local_L} would have been obtained by changing the
1763@code{fftw_mpi_local_size_2d} call to:
1764
1765@example
1766  alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, &
1767                           local_M, local_j_offset, local_L, local_i_offset)
1768@end example
1769