1############################
2Distributed XGBoost with Ray
3############################
4
5`Ray <https://ray.io/>`_ is a general purpose distributed execution framework.
6Ray can be used to scale computations from a single node to a cluster of hundreds
7of nodes without changing any code.
8
9The Python bindings of Ray come with a collection of well maintained
10machine learning libraries for hyperparameter optimization and model serving.
11
12The `XGBoost-Ray <https://github.com/ray-project/xgboost_ray>`_ project provides
13an interface to run XGBoost training and prediction jobs on a Ray cluster. It allows
14to utilize distributed data representations, such as
15`Modin <https://modin.readthedocs.io/en/latest/>`_ dataframes,
16as well as distributed loading from cloud storage (e.g. Parquet files).
17
18XGBoost-Ray integrates well with hyperparameter optimization library Ray Tune, and
19implements advanced fault tolerance handling mechanisms. With Ray you can scale
20your training jobs to hundreds of nodes just by adding new
21nodes to a cluster. You can also use Ray to leverage multi GPU XGBoost training.
22
23Installing and starting Ray
24===========================
25Ray can be installed from PyPI like this:
26
27.. code-block:: bash
28
29    pip install ray
30
31If you're using Ray on a single machine, you don't need to do anything else -
32XGBoost-Ray will automatically start a local Ray cluster when used.
33
34If you want to use Ray on a cluster, you can use the
35`Ray cluster launcher <https://docs.ray.io/en/master/cluster/cloud.html>`_.
36
37Installing XGBoost-Ray
38======================
39XGBoost-Ray is also available via PyPI:
40
41.. code-block:: bash
42
43    pip install xgboost_ray
44
45This will install all dependencies needed to run XGBoost on Ray, including
46Ray itself if it hasn't been installed before.
47
48Using XGBoost-Ray for training and prediction
49=============================================
50XGBoost-Ray uses the same API as core XGBoost. There are only two differences:
51
521. Instead of using a ``xgboost.DMatrix``, you'll use a ``xgboost_ray.RayDMatrix`` object
532. There is an additional :class:`ray_params <xgboost_ray.RayParams>` parameter that you can use to configure distributed training.
54
55Simple training example
56-----------------------
57
58To run this simple example, you'll need to install
59`scikit-learn <https://scikit-learn.org/>`_ (with ``pip install sklearn``).
60
61In this example, we will load the `breast cancer dataset <https://archive.ics.uci.edu/ml/datasets/breast+cancer>`_
62and train a binary classifier using two actors.
63
64.. code-block:: python
65
66    from xgboost_ray import RayDMatrix, RayParams, train
67    from sklearn.datasets import load_breast_cancer
68
69    train_x, train_y = load_breast_cancer(return_X_y=True)
70    train_set = RayDMatrix(train_x, train_y)
71
72    evals_result = {}
73    bst = train(
74        {
75            "objective": "binary:logistic",
76            "eval_metric": ["logloss", "error"],
77        },
78        train_set,
79        evals_result=evals_result,
80        evals=[(train_set, "train")],
81        verbose_eval=False,
82        ray_params=RayParams(num_actors=2, cpus_per_actor=1))
83
84    bst.save_model("model.xgb")
85    print("Final training error: {:.4f}".format(
86        evals_result["train"]["error"][-1]))
87
88
89The only differences compared to the non-distributed API are
90the import statement (``xgboost_ray`` instead of ``xgboost``), using the
91``RayDMatrix`` instead of the ``DMatrix``, and passing a :class:`RayParams <xgboost_ray.RayParams>` object.
92
93The return object is a regular ``xgboost.Booster`` instance.
94
95
96Simple prediction example
97-------------------------
98.. code-block:: python
99
100    from xgboost_ray import RayDMatrix, RayParams, predict
101    from sklearn.datasets import load_breast_cancer
102    import xgboost as xgb
103
104    data, labels = load_breast_cancer(return_X_y=True)
105
106    dpred = RayDMatrix(data, labels)
107
108    bst = xgb.Booster(model_file="model.xgb")
109    pred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=2))
110
111    print(pred_ray)
112
113In this example, the data will be split across two actors. The result array
114will integrate this data in the correct order.
115
116The RayParams object
117========================
118The ``RayParams`` object is used to configure various settings relating to
119the distributed training.
120
121.. autoclass:: xgboost_ray.RayParams
122
123Multi GPU training
124==================
125Ray automatically detects GPUs on cluster nodes.
126In order to start training on multiple GPUs, all you have to do is
127to set the ``gpus_per_actor`` parameter of the ``RayParams`` object, as well
128as the ``num_actors`` parameter for multiple GPUs:
129
130.. code-block:: python
131
132    ray_params = RayParams(
133        num_actors=4,
134        gpus_per_actor=1,
135    )
136
137This will train on four GPUs in parallel.
138
139Note that it usually does not make sense to allocate more than one GPU per actor,
140as XGBoost relies on distributed libraries such as Dask or Ray to utilize multi
141GPU taining.
142
143Setting the number of CPUs per actor
144====================================
145XGBoost natively utilizes multi threading to speed up computations. Thus if
146your are training on CPUs only, there is likely no benefit in using more than
147one actor per node. In that case, assuming you have a cluster of homogeneous nodes,
148set the number of CPUs per actor to the number of CPUs available on each node,
149and the number of actors to the number of nodes.
150
151If you are using multi GPU training on a single node, divide the number of
152available CPUs evenly across all actors. For instance, if you have 16 CPUs and
1534 GPUs available, each actor should access 1 GPU and 4 CPUs.
154
155If you are using a cluster of heterogeneous nodes (with different amounts of CPUs),
156you might just want to use the `greatest common divisor <https://en.wikipedia.org/wiki/Greatest_common_divisor>`_
157for the number of CPUs per actor. E.g. if you have a cluster of three nodes with
1584, 8, and 12 CPUs, respectively, you'd start 6 actors with 4 CPUs each for maximum
159CPU utilization.
160
161Fault tolerance
162===============
163XGBoost-Ray supports two fault tolerance modes. In **non-elastic training**, whenever
164a training actor dies (e.g. because the node goes down), the training job will stop,
165XGBoost-Ray will wait for the actor (or its resources) to become available again
166(this might be on a different node), and then continue training once all actors are back.
167
168In **elastic-training**, whenever a training actor dies, the rest of the actors
169continue training without the dead actor. If the actor comes back, it will be re-integrated
170into training again.
171
172Please note that in elastic-training this means that you will train on fewer data
173for some time. The benefit is that you can continue training even if a node goes
174away for the remainder of the training run, and don't have to wait until it is back up again.
175In practice this usually leads to a very minor decrease in accuracy but a much shorter
176training time compared to non-elastic training.
177
178Both training modes can be configured using the respective :class:`RayParams <xgboost_ray.RayParams>`
179parameters.
180
181Hyperparameter optimization
182===========================
183XGBoost-Ray integrates well with `hyperparameter optimization framework Ray Tune <http://tune.io>`_.
184Ray Tune uses Ray to start multiple distributed trials with different hyperparameter configurations.
185If used with XGBoost-Ray, these trials will then start their own distributed training
186jobs.
187
188XGBoost-Ray automatically reports evaluation results back to Ray Tune. There's only
189a few things you need to do:
190
1911. Put your XGBoost-Ray training call into a function accepting parameter configurations
192   (``train_model`` in the example below).
1932. Create a :class:`RayParams <xgboost_ray.RayParams>` object (``ray_params``
194   in the example below).
1953. Define the parameter search space (``config`` dict in the example below).
1964. Call ``tune.run()``:
197    * The ``metric`` parameter should contain the metric you'd like to optimize.
198      Usually this consists of the prefix passed to the ``evals`` argument of
199      ``xgboost_ray.train()``, and an ``eval_metric`` passed in the
200      XGBoost parameters (``train-error`` in the example below).
201    * The ``mode`` should either be ``min`` or ``max``, depending on whether
202      you'd like to minimize or maximize the metric
203    * The ``resources_per_actor`` should be set using ``ray_params.get_tune_resources()``.
204      This will make sure that each trial has the necessary resources available to
205      start their distributed training jobs.
206
207.. code-block:: python
208
209    from xgboost_ray import RayDMatrix, RayParams, train
210    from sklearn.datasets import load_breast_cancer
211
212    num_actors = 4
213    num_cpus_per_actor = 1
214
215    ray_params = RayParams(
216        num_actors=num_actors, cpus_per_actor=num_cpus_per_actor)
217
218    def train_model(config):
219        train_x, train_y = load_breast_cancer(return_X_y=True)
220        train_set = RayDMatrix(train_x, train_y)
221
222        evals_result = {}
223        bst = train(
224            params=config,
225            dtrain=train_set,
226            evals_result=evals_result,
227            evals=[(train_set, "train")],
228            verbose_eval=False,
229            ray_params=ray_params)
230        bst.save_model("model.xgb")
231
232    from ray import tune
233
234    # Specify the hyperparameter search space.
235    config = {
236        "tree_method": "approx",
237        "objective": "binary:logistic",
238        "eval_metric": ["logloss", "error"],
239        "eta": tune.loguniform(1e-4, 1e-1),
240        "subsample": tune.uniform(0.5, 1.0),
241        "max_depth": tune.randint(1, 9)
242    }
243
244    # Make sure to use the `get_tune_resources` method to set the `resources_per_trial`
245    analysis = tune.run(
246        train_model,
247        config=config,
248        metric="train-error",
249        mode="min",
250        num_samples=4,
251        resources_per_trial=ray_params.get_tune_resources())
252    print("Best hyperparameters", analysis.best_config)
253
254
255Ray Tune supports various
256`search algorithms and libraries (e.g. BayesOpt, Tree-Parzen estimators) <https://docs.ray.io/en/latest/tune/key-concepts.html#search-algorithms>`_,
257`smart schedulers like successive halving <https://docs.ray.io/en/latest/tune/key-concepts.html#trial-schedulers>`_,
258and other features. Please refer to the `Ray Tune documentation <http://tune.io>`_
259for more information.
260
261Additional resources
262====================
263* `XGBoost-Ray repository <https://github.com/ray-project/xgboost_ray>`_
264* `XGBoost-Ray documentation <https://docs.ray.io/en/master/xgboost-ray.html>`_
265* `Ray core documentation <https://docs.ray.io/en/master/index.html>`_
266* `Ray Tune documentation <http://tune.io>`_
267