1Related Work
2============
3
4Writing the "related work" for a project called "distributed", is a Sisyphean
5task.  We'll list a few notable projects that you've probably already heard of
6down below.
7
8You may also find the `dask comparison with spark`_ of interest.
9
10.. _`dask comparison with spark`: http://docs.dask.org/en/latest/spark.html
11
12
13Big Data World
14--------------
15
16*   The venerable Hadoop_ provides batch processing with the MapReduce
17    programming paradigm.  Python users typically use `Hadoop Streaming`_ or
18    MRJob_.
19*   Spark builds on top of HDFS systems with a nicer API and in-memory
20    processing.  Python users typically use PySpark_.
21*   Storm_ provides streaming computation.  Python users typically use
22    streamparse_.
23
24This is a woefully inadequate representation of the excellent work blossoming
25in this space.  A variety of projects have come into this space and rival or
26complement the projects above.  Still, most "Big Data" processing hype probably
27centers around the three projects above, or their derivatives.
28
29.. _Hadoop: https://hadoop.apache.org/
30.. _MRJob: https://pythonhosted.org/mrjob/
31.. _`Hadoop Streaming`: https://hadoop.apache.org/docs/r1.2.1/streaming.html
32.. _Spark: http://spark.apache.org/
33.. _PySpark: http://spark.apache.org/docs/latest/api/python/
34.. _storm: http://storm.apache.org/
35.. _streamparse: https://streamparse.readthedocs.io/en/latest/index.html
36.. _Disco: http://discoproject.org/
37
38Python Projects
39---------------
40
41There are dozens of Python projects for distributed computing.  Here we list a
42few of the more prominent projects that we see in active use today.
43
44Task scheduling
45~~~~~~~~~~~~~~~
46
47*   Celery_: An asynchronous task scheduler, focusing on real-time processing.
48*   Luigi_: A bulk big-data/batch task scheduler, with hooks to a variety of
49    interesting data sources.
50
51Ad hoc computation
52~~~~~~~~~~~~~~~~~~
53
54*   `IPython Parallel`_: Allows for stateful remote control of several running
55    ipython sessions.
56*   Scoop_: Implements the `concurrent.futures`_ API on distributed workers.
57    Notably allows tasks to spawn more tasks.
58
59Direct Communication
60~~~~~~~~~~~~~~~~~~~~
61
62*   MPI4Py_: Wraps the Message Passing Interface popular in high performance
63    computing.
64*   PyZMQ_: Wraps ZeroMQ, the high-performance asynchronous messaging library.
65
66Venerable
67~~~~~~~~~
68
69There are a couple of older projects that often get mentioned
70
71*   Dispy_: Embarrassingly parallel function evaluation
72*   Pyro_:  Remote objects / RPC
73
74.. _Luigi: https://luigi.readthedocs.io/en/latest/
75.. _MPI4Py: http://mpi4py.readthedocs.io/en/stable/
76.. _PyZMQ: https://github.com/zeromq/pyzmq
77.. _Celery: http://www.celeryproject.org/
78.. _`IPython Parallel`: https://ipyparallel.readthedocs.io/en/latest/
79.. _Scoop: https://github.com/soravux/scoop/
80.. _`concurrent.futures`: https://docs.python.org/3/library/concurrent.futures.html
81.. _Dispy: http://dispy.sourceforge.net/
82.. _Pyro: https://pythonhosted.org/Pyro4/
83
84Relationship
85------------
86
87In relation to these projects ``distributed``...
88
89*  Supports data-local computation like Hadoop and Spark
90*  Uses a task graph with data dependencies abstraction like Luigi
91*  In support of ad-hoc applications, like IPython Parallel and Scoop
92
93
94In depth comparison to particular projects
95------------------------------------------
96
97IPython Parallel
98~~~~~~~~~~~~~~~~
99
100**Short Description**
101
102`IPython Parallel`_ is a distributed computing framework from the IPython
103project.  It uses a centralized hub to farm out jobs to several ``ipengine``
104processes running on remote workers.  It communicates over ZeroMQ sockets and
105centralizes communication through the central hub.
106
107IPython parallel has been around for a while and, while not particularly fancy,
108is quite stable and robust.
109
110IPython Parallel offers parallel ``map`` and remote ``apply`` functions that
111route computations to remote workers
112
113.. code-block:: python
114
115   >>> view = Client(...)[:]
116   >>> results = view.map(func, sequence)
117   >>> result = view.apply(func, *args, **kwargs)
118   >>> future = view.apply_async(func, *args, **kwargs)
119
120It also provides direct execution of code in the remote process and collection
121of data from the remote namespace.
122
123.. code-block:: python
124
125   >>> view.execute('x = 1 + 2')
126   >>> view['x']
127   [3, 3, 3, 3, 3, 3]
128
129**Brief Comparison**
130
131Distributed and IPython Parallel are similar in that they provide ``map`` and
132``apply/submit`` abstractions over distributed worker processes running Python.
133Both manage the remote namespaces of those worker processes.
134
135They are dissimilar in terms of their maturity, how worker nodes communicate to
136each other, and in the complexity of algorithms that they enable.
137
138**Distributed Advantages**
139
140The primary advantages of ``distributed`` over IPython Parallel include
141
1421.  Peer-to-peer communication between workers
1432.  Dynamic task scheduling
144
145``Distributed`` workers share data in a peer-to-peer fashion, without having to
146send intermediate results through a central bottleneck.  This allows
147``distributed`` to be more effective for more complex algorithms and to manage
148larger datasets in a more natural manner.  IPython parallel does not provide a
149mechanism for workers to communicate with each other, except by using the
150central node as an intermediary for data transfer or by relying on some other
151medium, like a shared file system.  Data transfer through the central node can
152easily become a bottleneck and so IPython parallel has been mostly helpful in
153embarrassingly parallel work (the bulk of applications) but has not been used
154extensively for more sophisticated algorithms that require non-trivial
155communication patterns.
156
157The distributed client includes a dynamic task scheduler capable of managing
158deep data dependencies between tasks.  The IPython parallel docs include `a
159recipe`_ for executing task graphs with data dependencies.  This same idea is
160core to all of ``distributed``, which uses a dynamic task scheduler for all
161operations.  Notably, ``distributed.Future`` objects can be used within
162``submit/map/get`` calls before they have completed.
163
164.. code-block:: python
165
166   >>> x = client.submit(f, 1)  # returns a future
167   >>> y = client.submit(f, 2)  # returns a future
168   >>> z = client.submit(add, x, y)  # consumes futures
169
170The ability to use futures cheaply within ``submit`` and ``map`` methods
171enables the construction of very sophisticated data pipelines with simple code.
172Additionally, distributed can serve as a full dask task scheduler, enabling
173support for distributed arrays, dataframes, machine learning pipelines, and any
174other application build on dask graphs.  The dynamic task schedulers within
175``distributed`` are adapted from the dask_ task schedulers and so are fairly
176sophisticated/efficient.
177
178**IPython Parallel Advantages**
179
180IPython Parallel has the following advantages over ``distributed``
181
1821.  Maturity:  IPython Parallel has been around for a while.
1832.  Explicit control over the worker processes:  IPython parallel
184    allows you to execute arbitrary statements on the workers, allowing it to
185    serve in system administration tasks.
1863.  Deployment help:  IPython Parallel has mechanisms built-in to aid
187    deployment on SGE, MPI, etc..  Distributed does not have any such sugar,
188    though is fairly simple to `set up <https://docs.dask.org/en/latest/setup.html>`_ by hand.
1894.  Various other advantages:  Over the years IPython parallel has accrued a
190    variety of helpful features like IPython interaction magics, ``@parallel``
191    decorators, etc..
192
193.. _`a recipe`: https://ipython.org/ipython-doc/3/parallel/dag_dependencies.html#dag-dependencies
194.. _dask: https://dask.org/
195
196
197concurrent.futures
198~~~~~~~~~~~~~~~~~~
199
200The :class:`distributed.Client` API is modeled after :mod:`concurrent.futures`
201and :pep:`3148`.  It has a few notable differences:
202
203*  ``distributed`` accepts :class:`~distributed.client.Future` objects within
204   calls to ``submit/map``. When chaining computations, it is preferable to
205   submit Future objects directly rather than wait on them before submission.
206*  The :meth:`~distributed.client.Client.map` method returns
207   :class:`~distributed.client.Future` objects, not concrete results.
208   The :meth:`~distributed.client.Client.map` method returns immediately.
209*  Despite sharing a similar API, ``distributed`` :class:`~distributed.client.Future`
210   objects cannot always be substituted for :class:`concurrent.futures.Future`
211   objects, especially when using ``wait()`` or ``as_completed()``.
212*  Distributed generally does not support callbacks.
213
214If you need full compatibility with the :class:`concurrent.futures.Executor`
215API, use the object returned by the
216:meth:`~distributed.client.Client.get_executor` method.
217
218
219.. _PEP-3148: https://www.python.org/dev/peps/pep-3148/
220