1Related Work 2============ 3 4Writing the "related work" for a project called "distributed", is a Sisyphean 5task. We'll list a few notable projects that you've probably already heard of 6down below. 7 8You may also find the `dask comparison with spark`_ of interest. 9 10.. _`dask comparison with spark`: http://docs.dask.org/en/latest/spark.html 11 12 13Big Data World 14-------------- 15 16* The venerable Hadoop_ provides batch processing with the MapReduce 17 programming paradigm. Python users typically use `Hadoop Streaming`_ or 18 MRJob_. 19* Spark builds on top of HDFS systems with a nicer API and in-memory 20 processing. Python users typically use PySpark_. 21* Storm_ provides streaming computation. Python users typically use 22 streamparse_. 23 24This is a woefully inadequate representation of the excellent work blossoming 25in this space. A variety of projects have come into this space and rival or 26complement the projects above. Still, most "Big Data" processing hype probably 27centers around the three projects above, or their derivatives. 28 29.. _Hadoop: https://hadoop.apache.org/ 30.. _MRJob: https://pythonhosted.org/mrjob/ 31.. _`Hadoop Streaming`: https://hadoop.apache.org/docs/r1.2.1/streaming.html 32.. _Spark: http://spark.apache.org/ 33.. _PySpark: http://spark.apache.org/docs/latest/api/python/ 34.. _storm: http://storm.apache.org/ 35.. _streamparse: https://streamparse.readthedocs.io/en/latest/index.html 36.. _Disco: http://discoproject.org/ 37 38Python Projects 39--------------- 40 41There are dozens of Python projects for distributed computing. Here we list a 42few of the more prominent projects that we see in active use today. 43 44Task scheduling 45~~~~~~~~~~~~~~~ 46 47* Celery_: An asynchronous task scheduler, focusing on real-time processing. 48* Luigi_: A bulk big-data/batch task scheduler, with hooks to a variety of 49 interesting data sources. 50 51Ad hoc computation 52~~~~~~~~~~~~~~~~~~ 53 54* `IPython Parallel`_: Allows for stateful remote control of several running 55 ipython sessions. 56* Scoop_: Implements the `concurrent.futures`_ API on distributed workers. 57 Notably allows tasks to spawn more tasks. 58 59Direct Communication 60~~~~~~~~~~~~~~~~~~~~ 61 62* MPI4Py_: Wraps the Message Passing Interface popular in high performance 63 computing. 64* PyZMQ_: Wraps ZeroMQ, the high-performance asynchronous messaging library. 65 66Venerable 67~~~~~~~~~ 68 69There are a couple of older projects that often get mentioned 70 71* Dispy_: Embarrassingly parallel function evaluation 72* Pyro_: Remote objects / RPC 73 74.. _Luigi: https://luigi.readthedocs.io/en/latest/ 75.. _MPI4Py: http://mpi4py.readthedocs.io/en/stable/ 76.. _PyZMQ: https://github.com/zeromq/pyzmq 77.. _Celery: http://www.celeryproject.org/ 78.. _`IPython Parallel`: https://ipyparallel.readthedocs.io/en/latest/ 79.. _Scoop: https://github.com/soravux/scoop/ 80.. _`concurrent.futures`: https://docs.python.org/3/library/concurrent.futures.html 81.. _Dispy: http://dispy.sourceforge.net/ 82.. _Pyro: https://pythonhosted.org/Pyro4/ 83 84Relationship 85------------ 86 87In relation to these projects ``distributed``... 88 89* Supports data-local computation like Hadoop and Spark 90* Uses a task graph with data dependencies abstraction like Luigi 91* In support of ad-hoc applications, like IPython Parallel and Scoop 92 93 94In depth comparison to particular projects 95------------------------------------------ 96 97IPython Parallel 98~~~~~~~~~~~~~~~~ 99 100**Short Description** 101 102`IPython Parallel`_ is a distributed computing framework from the IPython 103project. It uses a centralized hub to farm out jobs to several ``ipengine`` 104processes running on remote workers. It communicates over ZeroMQ sockets and 105centralizes communication through the central hub. 106 107IPython parallel has been around for a while and, while not particularly fancy, 108is quite stable and robust. 109 110IPython Parallel offers parallel ``map`` and remote ``apply`` functions that 111route computations to remote workers 112 113.. code-block:: python 114 115 >>> view = Client(...)[:] 116 >>> results = view.map(func, sequence) 117 >>> result = view.apply(func, *args, **kwargs) 118 >>> future = view.apply_async(func, *args, **kwargs) 119 120It also provides direct execution of code in the remote process and collection 121of data from the remote namespace. 122 123.. code-block:: python 124 125 >>> view.execute('x = 1 + 2') 126 >>> view['x'] 127 [3, 3, 3, 3, 3, 3] 128 129**Brief Comparison** 130 131Distributed and IPython Parallel are similar in that they provide ``map`` and 132``apply/submit`` abstractions over distributed worker processes running Python. 133Both manage the remote namespaces of those worker processes. 134 135They are dissimilar in terms of their maturity, how worker nodes communicate to 136each other, and in the complexity of algorithms that they enable. 137 138**Distributed Advantages** 139 140The primary advantages of ``distributed`` over IPython Parallel include 141 1421. Peer-to-peer communication between workers 1432. Dynamic task scheduling 144 145``Distributed`` workers share data in a peer-to-peer fashion, without having to 146send intermediate results through a central bottleneck. This allows 147``distributed`` to be more effective for more complex algorithms and to manage 148larger datasets in a more natural manner. IPython parallel does not provide a 149mechanism for workers to communicate with each other, except by using the 150central node as an intermediary for data transfer or by relying on some other 151medium, like a shared file system. Data transfer through the central node can 152easily become a bottleneck and so IPython parallel has been mostly helpful in 153embarrassingly parallel work (the bulk of applications) but has not been used 154extensively for more sophisticated algorithms that require non-trivial 155communication patterns. 156 157The distributed client includes a dynamic task scheduler capable of managing 158deep data dependencies between tasks. The IPython parallel docs include `a 159recipe`_ for executing task graphs with data dependencies. This same idea is 160core to all of ``distributed``, which uses a dynamic task scheduler for all 161operations. Notably, ``distributed.Future`` objects can be used within 162``submit/map/get`` calls before they have completed. 163 164.. code-block:: python 165 166 >>> x = client.submit(f, 1) # returns a future 167 >>> y = client.submit(f, 2) # returns a future 168 >>> z = client.submit(add, x, y) # consumes futures 169 170The ability to use futures cheaply within ``submit`` and ``map`` methods 171enables the construction of very sophisticated data pipelines with simple code. 172Additionally, distributed can serve as a full dask task scheduler, enabling 173support for distributed arrays, dataframes, machine learning pipelines, and any 174other application build on dask graphs. The dynamic task schedulers within 175``distributed`` are adapted from the dask_ task schedulers and so are fairly 176sophisticated/efficient. 177 178**IPython Parallel Advantages** 179 180IPython Parallel has the following advantages over ``distributed`` 181 1821. Maturity: IPython Parallel has been around for a while. 1832. Explicit control over the worker processes: IPython parallel 184 allows you to execute arbitrary statements on the workers, allowing it to 185 serve in system administration tasks. 1863. Deployment help: IPython Parallel has mechanisms built-in to aid 187 deployment on SGE, MPI, etc.. Distributed does not have any such sugar, 188 though is fairly simple to `set up <https://docs.dask.org/en/latest/setup.html>`_ by hand. 1894. Various other advantages: Over the years IPython parallel has accrued a 190 variety of helpful features like IPython interaction magics, ``@parallel`` 191 decorators, etc.. 192 193.. _`a recipe`: https://ipython.org/ipython-doc/3/parallel/dag_dependencies.html#dag-dependencies 194.. _dask: https://dask.org/ 195 196 197concurrent.futures 198~~~~~~~~~~~~~~~~~~ 199 200The :class:`distributed.Client` API is modeled after :mod:`concurrent.futures` 201and :pep:`3148`. It has a few notable differences: 202 203* ``distributed`` accepts :class:`~distributed.client.Future` objects within 204 calls to ``submit/map``. When chaining computations, it is preferable to 205 submit Future objects directly rather than wait on them before submission. 206* The :meth:`~distributed.client.Client.map` method returns 207 :class:`~distributed.client.Future` objects, not concrete results. 208 The :meth:`~distributed.client.Client.map` method returns immediately. 209* Despite sharing a similar API, ``distributed`` :class:`~distributed.client.Future` 210 objects cannot always be substituted for :class:`concurrent.futures.Future` 211 objects, especially when using ``wait()`` or ``as_completed()``. 212* Distributed generally does not support callbacks. 213 214If you need full compatibility with the :class:`concurrent.futures.Executor` 215API, use the object returned by the 216:meth:`~distributed.client.Client.get_executor` method. 217 218 219.. _PEP-3148: https://www.python.org/dev/peps/pep-3148/ 220