1Overview
2========
3
4.. currentmodule:: mpi4py.MPI
5
6MPI for Python provides an object oriented approach to message passing
7which grounds on the standard MPI-2 C++ bindings. The interface was
8designed with focus in translating MPI syntax and semantics of
9standard MPI-2 bindings for C++ to Python. Any user of the standard
10C/C++ MPI bindings should be able to use this module without need of
11learning a new interface.
12
13Communicating Python Objects and Array Data
14-------------------------------------------
15
16The Python standard library supports different mechanisms for data
17persistence. Many of them rely on disk storage, but *pickling* and
18*marshaling* can also work with memory buffers.
19
20The :mod:`pickle` modules provide user-extensible facilities to
21serialize general Python objects using ASCII or binary formats. The
22:mod:`marshal` module provides facilities to serialize built-in Python
23objects using a binary format specific to Python, but independent of
24machine architecture issues.
25
26*MPI for Python* can communicate any built-in or user-defined Python
27object taking advantage of the features provided by the :mod:`pickle`
28module. These facilities will be routinely used to build binary
29representations of objects to communicate (at sending processes), and
30restoring them back (at receiving processes).
31
32Although simple and general, the serialization approach (i.e.,
33*pickling* and *unpickling*) previously discussed imposes important
34overheads in memory as well as processor usage, especially in the
35scenario of objects with large memory footprints being
36communicated. Pickling general Python objects, ranging from primitive
37or container built-in types to user-defined classes, necessarily
38requires computer resources.  Processing is also needed for
39dispatching the appropriate serialization method (that depends on the
40type of the object) and doing the actual packing. Additional memory is
41always needed, and if its total amount is not known *a priori*, many
42reallocations can occur.  Indeed, in the case of large numeric arrays,
43this is certainly unacceptable and precludes communication of objects
44occupying half or more of the available memory resources.
45
46*MPI for Python* supports direct communication of any object exporting
47the single-segment buffer interface. This interface is a standard
48Python mechanism provided by some types (e.g., strings and numeric
49arrays), allowing access in the C side to a contiguous memory buffer
50(i.e., address and length) containing the relevant data. This feature,
51in conjunction with the capability of constructing user-defined MPI
52datatypes describing complicated memory layouts, enables the
53implementation of many algorithms involving multidimensional numeric
54arrays (e.g., image processing, fast Fourier transforms, finite
55difference schemes on structured Cartesian grids) directly in Python,
56with negligible overhead, and almost as fast as compiled Fortran, C,
57or C++ codes.
58
59
60Communicators
61-------------
62
63In *MPI for Python*, `Comm` is the base class of communicators. The
64`Intracomm` and `Intercomm` classes are sublcasses of the `Comm`
65class.  The `Comm.Is_inter` method (and `Comm.Is_intra`, provided for
66convenience but not part of the MPI specification) is defined for
67communicator objects and can be used to determine the particular
68communicator class.
69
70The two predefined intracommunicator instances are available:
71`COMM_SELF` and `COMM_WORLD`. From them, new communicators can be
72created as needed.
73
74The number of processes in a communicator and the calling process rank
75can be respectively obtained with methods `Comm.Get_size` and
76`Comm.Get_rank`. The associated process group can be retrieved from a
77communicator by calling the `Comm.Get_group` method, which returns an
78instance of the `Group` class. Set operations with `Group` objects
79like like `Group.Union`, `Group.Intersection` and `Group.Difference`
80are fully supported, as well as the creation of new communicators from
81these groups using `Comm.Create` and `Comm.Create_group`.
82
83New communicator instances can be obtained with the `Comm.Clone`,
84`Comm.Dup` and `Comm.Split` methods, as well methods
85`Intracomm.Create_intercomm` and `Intercomm.Merge`.
86
87Virtual topologies (`Cartcomm`, `Graphcomm` and `Distgraphcomm`
88classes, which are specializations of the `Intracomm` class) are fully
89supported. New instances can be obtained from intracommunicator
90instances with factory methods `Intracomm.Create_cart` and
91`Intracomm.Create_graph`.
92
93
94Point-to-Point Communications
95-----------------------------
96
97Point to point communication is a fundamental capability of message
98passing systems. This mechanism enables the transmission of data
99between a pair of processes, one side sending, the other receiving.
100
101MPI provides a set of *send* and *receive* functions allowing the
102communication of *typed* data with an associated *tag*.  The type
103information enables the conversion of data representation from one
104architecture to another in the case of heterogeneous computing
105environments; additionally, it allows the representation of
106non-contiguous data layouts and user-defined datatypes, thus avoiding
107the overhead of (otherwise unavoidable) packing/unpacking
108operations. The tag information allows selectivity of messages at the
109receiving end.
110
111
112Blocking Communications
113^^^^^^^^^^^^^^^^^^^^^^^
114
115MPI provides basic send and receive functions that are *blocking*.
116These functions block the caller until the data buffers involved in
117the communication can be safely reused by the application program.
118
119In *MPI for Python*, the `Comm.Send`, `Comm.Recv` and `Comm.Sendrecv`
120methods of communicator objects provide support for blocking
121point-to-point communications within `Intracomm` and `Intercomm`
122instances. These methods can communicate memory buffers. The variants
123`Comm.send`, `Comm.recv` and `Comm.sendrecv` can communicate general
124Python objects.
125
126Nonblocking Communications
127^^^^^^^^^^^^^^^^^^^^^^^^^^
128
129On many systems, performance can be significantly increased by
130overlapping communication and computation. This is particularly true
131on systems where communication can be executed autonomously by an
132intelligent, dedicated communication controller.
133
134MPI provides *nonblocking* send and receive functions. They allow the
135possible overlap of communication and computation.  Non-blocking
136communication always come in two parts: posting functions, which begin
137the requested operation; and test-for-completion functions, which
138allow to discover whether the requested operation has completed.
139
140In *MPI for Python*, the `Comm.Isend` and `Comm.Irecv` methods
141initiate send and receive operations, respectively. These methods
142return a `Request` instance, uniquely identifying the started
143operation.  Its completion can be managed using the `Request.Test`,
144`Request.Wait` and `Request.Cancel` methods. The management of
145`Request` objects and associated memory buffers involved in
146communication requires a careful, rather low-level coordination. Users
147must ensure that objects exposing their memory buffers are not
148accessed at the Python level while they are involved in nonblocking
149message-passing operations.
150
151Persistent Communications
152^^^^^^^^^^^^^^^^^^^^^^^^^
153
154Often a communication with the same argument list is repeatedly
155executed within an inner loop. In such cases, communication can be
156further optimized by using persistent communication, a particular case
157of nonblocking communication allowing the reduction of the overhead
158between processes and communication controllers. Furthermore , this
159kind of optimization can also alleviate the extra call overheads
160associated to interpreted, dynamic languages like Python.
161
162In *MPI for Python*, the `Comm.Send_init` and `Comm.Recv_init` methods
163create persistent requests for a send and receive operation,
164respectively.  These methods return an instance of the `Prequest`
165class, a subclass of the `Request` class. The actual communication can
166be effectively started using the `Prequest.Start` method, and its
167completion can be managed as previously described.
168
169
170Collective Communications
171--------------------------
172
173Collective communications allow the transmittal of data between
174multiple processes of a group simultaneously. The syntax and semantics
175of collective functions is consistent with point-to-point
176communication. Collective functions communicate *typed* data, but
177messages are not paired with an associated *tag*; selectivity of
178messages is implied in the calling order. Additionally, collective
179functions come in blocking versions only.
180
181The more commonly used collective communication operations are the
182following.
183
184* Barrier synchronization across all group members.
185
186* Global communication functions
187
188  + Broadcast data from one member to all members of a group.
189
190  + Gather data from all members to one member of a group.
191
192  + Scatter data from one member to all members of a group.
193
194* Global reduction operations such as sum, maximum, minimum, etc.
195
196In *MPI for Python*, the `Comm.Bcast`, `Comm.Scatter`, `Comm.Gather`,
197`Comm.Allgather`, `Comm.Alltoall` methods provide support for
198collective communications of memory buffers. The lower-case variants
199`Comm.bcast`, `Comm.scatter`, `Comm.gather`, `Comm.allgather` and
200`Comm.alltoall` can communicate general Python objects.  The vector
201variants (which can communicate different amounts of data to each
202process) `Comm.Scatterv`, `Comm.Gatherv`, `Comm.Allgatherv`,
203`Comm.Alltoallv` and `Comm.Alltoallw` are also supported, they can
204only communicate objects exposing memory buffers.
205
206Global reducion operations on memory buffers are accessible through
207the `Comm.Reduce`, `Comm.Reduce_scatter`, `Comm.Allreduce`,
208`Intracomm.Scan` and `Intracomm.Exscan` methods. The lower-case
209variants `Comm.reduce`, `Comm.allreduce`, `Intracomm.scan` and
210`Intracomm.exscan` can communicate general Python objects; however,
211the actual required reduction computations are performed sequentially
212at some process. All the predefined (i.e., `SUM`, `PROD`, `MAX`, etc.)
213reduction operations can be applied.
214
215
216Support for GPU-aware MPI
217-------------------------
218
219Several MPI implementations, including Open MPI and MVAPICH, support
220passing GPU pointers to MPI calls to avoid explict data movement
221between the host and the device. On the Python side, GPU arrays have
222been implemented by many libraries that need GPU computation, such as
223CuPy, Numba, PyTorch, and PyArrow. In order to increase library
224interoperability, two kinds of zero-copy data exchange protocols are
225defined and agreed upon: `DLPack`_ and `CUDA Array Interface`_. For
226example, a CuPy array can be passed to a Numba CUDA-jit kernel.
227
228.. _DLPack: https://data-apis.org/array-api/latest/design_topics/data_interchange.html
229.. _CUDA Array Interface: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
230
231*MPI for Python* provides an experimental support for GPU-aware MPI.
232This feature requires:
233
2341. mpi4py is built against a GPU-aware MPI library.
235
2362. The Python GPU arrays are compliant with either of the protocols.
237
238See the :doc:`tutorial` section for further information. We note that
239
240* Whether or not a MPI call can work for GPU arrays depends on the
241  underlying MPI implementation, not on mpi4py.
242
243* This support is currently experimental and subject to change in the
244  future.
245
246
247Dynamic Process Management
248--------------------------
249
250In the context of the MPI-1 specification, a parallel application is
251static; that is, no processes can be added to or deleted from a
252running application after it has been started. Fortunately, this
253limitation was addressed in MPI-2. The new specification added a
254process management model providing a basic interface between an
255application and external resources and process managers.
256
257This MPI-2 extension can be really useful, especially for sequential
258applications built on top of parallel modules, or parallel
259applications with a client/server model. The MPI-2 process model
260provides a mechanism to create new processes and establish
261communication between them and the existing MPI application. It also
262provides mechanisms to establish communication between two existing
263MPI applications, even when one did not *start* the other.
264
265In *MPI for Python*, new independent process groups can be created by
266calling the `Intracomm.Spawn` method within an intracommunicator.
267This call returns a new intercommunicator (i.e., an `Intercomm`
268instance) at the parent process group. The child process group can
269retrieve the matching intercommunicator by calling the
270`Comm.Get_parent` class method. At each side, the new
271intercommunicator can be used to perform point to point and collective
272communications between the parent and child groups of processes.
273
274Alternatively, disjoint groups of processes can establish
275communication using a client/server approach. Any server application
276must first call the `Open_port` function to open a *port* and the
277`Publish_name` function to publish a provided *service*, and next call
278the `Intracomm.Accept` method.  Any client applications can first find
279a published *service* by calling the `Lookup_name` function, which
280returns the *port* where a server can be contacted; and next call the
281`Intracomm.Connect` method. Both `Intracomm.Accept` and
282`Intracomm.Connect` methods return an `Intercomm` instance. When
283connection between client/server processes is no longer needed, all of
284them must cooperatively call the `Comm.Disconnect`
285method. Additionally, server applications should release resources by
286calling the `Unpublish_name` and `Close_port` functions.
287
288
289One-Sided Communications
290------------------------
291
292One-sided communications (also called *Remote Memory Access*, *RMA*)
293supplements the traditional two-sided, send/receive based MPI
294communication model with a one-sided, put/get based
295interface. One-sided communication that can take advantage of the
296capabilities of highly specialized network hardware. Additionally,
297this extension lowers latency and software overhead in applications
298written using a shared-memory-like paradigm.
299
300The MPI specification revolves around the use of objects called
301*windows*; they intuitively specify regions of a process's memory that
302have been made available for remote read and write operations.  The
303published memory blocks can be accessed through three functions for
304put (remote send), get (remote write), and accumulate (remote update
305or reduction) data items. A much larger number of functions support
306different synchronization styles; the semantics of these
307synchronization operations are fairly complex.
308
309In *MPI for Python*, one-sided operations are available by using
310instances of the `Win` class. New window objects are created by
311calling the `Win.Create` method at all processes within a communicator
312and specifying a memory buffer . When a window instance is no longer
313needed, the `Win.Free` method should be called.
314
315The three one-sided MPI operations for remote write, read and
316reduction are available through calling the methods `Win.Put`,
317`Win.Get`, and `Win.Accumulate` respectively within a `Win` instance.
318These methods need an integer rank identifying the target process and
319an integer offset relative the base address of the remote memory block
320being accessed.
321
322The one-sided operations read, write, and reduction are implicitly
323nonblocking, and must be synchronized by using two primary modes.
324Active target synchronization requires the origin process to call the
325`Win.Start` and `Win.Complete` methods at the origin process, and
326target process cooperates by calling the `Win.Post` and `Win.Wait`
327methods. There is also a collective variant provided by the
328`Win.Fence` method. Passive target synchronization is more lenient,
329only the origin process calls the `Win.Lock` and `Win.Unlock`
330methods. Locks are used to protect remote accesses to the locked
331remote window and to protect local load/store accesses to a locked
332local window.
333
334
335Parallel Input/Output
336---------------------
337
338The POSIX standard provides a model of a widely portable file
339system. However, the optimization needed for parallel input/output
340cannot be achieved with this generic interface. In order to ensure
341efficiency and scalability, the underlying parallel input/output
342system must provide a high-level interface supporting partitioning of
343file data among processes and a collective interface supporting
344complete transfers of global data structures between process memories
345and files. Additionally, further efficiencies can be gained via
346support for asynchronous input/output, strided accesses to data, and
347control over physical file layout on storage devices. This scenario
348motivated the inclusion in the MPI-2 standard of a custom interface in
349order to support more elaborated parallel input/output operations.
350
351The MPI specification for parallel input/output revolves around the
352use objects called *files*. As defined by MPI, files are not just
353contiguous byte streams. Instead, they are regarded as ordered
354collections of *typed* data items. MPI supports sequential or random
355access to any integral set of these items. Furthermore, files are
356opened collectively by a group of processes.
357
358The common patterns for accessing a shared file (broadcast, scatter,
359gather, reduction) is expressed by using user-defined datatypes.
360Compared to the communication patterns of point-to-point and
361collective communications, this approach has the advantage of added
362flexibility and expressiveness. Data access operations (read and
363write) are defined for different kinds of positioning (using explicit
364offsets, individual file pointers, and shared file pointers),
365coordination (non-collective and collective), and synchronism
366(blocking, nonblocking, and split collective with begin/end phases).
367
368In *MPI for Python*, all MPI input/output operations are performed
369through instances of the `File` class. File handles are obtained by
370calling the `File.Open` method at all processes within a communicator
371and providing a file name and the intended access mode.  After use,
372they must be closed by calling the `File.Close` method.  Files even
373can be deleted by calling method `File.Delete`.
374
375After creation, files are typically associated with a per-process
376*view*. The view defines the current set of data visible and
377accessible from an open file as an ordered set of elementary
378datatypes. This data layout can be set and queried with the
379`File.Set_view` and `File.Get_view` methods respectively.
380
381Actual input/output operations are achieved by many methods combining
382read and write calls with different behavior regarding positioning,
383coordination, and synchronism. Summing up, *MPI for Python* provides
384the thirty (30) methods defined in MPI-2 for reading from or writing
385to files using explicit offsets or file pointers (individual or
386shared), in blocking or nonblocking and collective or noncollective
387versions.
388
389Environmental Management
390------------------------
391
392Initialization and Exit
393^^^^^^^^^^^^^^^^^^^^^^^
394
395Module functions `Init` or `Init_thread` and `Finalize` provide MPI
396initialization and finalization respectively. Module functions
397`Is_initialized` and `Is_finalized` provide the respective tests for
398initialization and finalization.
399
400.. note::
401
402   :c:func:`MPI_Init` or :c:func:`MPI_Init_thread` is actually called
403   when you import the :mod:`~mpi4py.MPI` module from the
404   :mod:`mpi4py` package, but only if MPI is not already
405   initialized. In such case, calling `Init` or `Init_thread` from
406   Python is expected to generate an MPI error, and in turn an
407   exception will be raised.
408
409.. note::
410
411   :c:func:`MPI_Finalize` is registered (by using Python C/API
412   function :c:func:`Py_AtExit`) for being automatically called when
413   Python processes exit, but only if :mod:`mpi4py` actually
414   initialized MPI. Therefore, there is no need to call `Finalize`
415   from Python to ensure MPI finalization.
416
417Implementation Information
418^^^^^^^^^^^^^^^^^^^^^^^^^^
419
420* The MPI version number can be retrieved from module function
421  `Get_version`. It returns a two-integer tuple ``(version,
422  subversion)``.
423
424* The `Get_processor_name` function can be used to access the
425  processor name.
426
427* The values of predefined attributes attached to the world
428  communicator can be obtained by calling the `Comm.Get_attr` method
429  within the `COMM_WORLD` instance.
430
431Timers
432^^^^^^
433
434MPI timer functionalities are available through the `Wtime` and
435`Wtick` functions.
436
437Error Handling
438^^^^^^^^^^^^^^
439
440In order facilitate handle sharing with other Python modules
441interfacing MPI-based parallel libraries, the predefined MPI error
442handlers `ERRORS_RETURN` and `ERRORS_ARE_FATAL` can be assigned to and
443retrieved from communicators using methods `Comm.Set_errhandler` and
444`Comm.Get_errhandler`, and similarly for windows and files.
445
446When the predefined error handler `ERRORS_RETURN` is set, errors
447returned from MPI calls within Python code will raise an instance of
448the exception class `Exception`, which is a subclass of the standard
449Python exception `python:RuntimeError`.
450
451.. note::
452
453   After import, mpi4py overrides the default MPI rules governing
454   inheritance of error handlers. The `ERRORS_RETURN` error handler is
455   set in the predefined `COMM_SELF` and `COMM_WORLD` communicators,
456   as well as any new `Comm`, `Win`, or `File` instance created
457   through mpi4py. If you ever pass such handles to C/C++/Fortran
458   library code, it is recommended to set the `ERRORS_ARE_FATAL` error
459   handler on them to ensure MPI errors do not pass silently.
460
461.. warning::
462
463   Importing with ``from mpi4py.MPI import *`` will cause a name
464   clashing with the standard Python `python:Exception` base class.
465