1.. _ensemble:
2
3================
4Ensemble methods
5================
6
7.. currentmodule:: sklearn.ensemble
8
9The goal of **ensemble methods** is to combine the predictions of several
10base estimators built with a given learning algorithm in order to improve
11generalizability / robustness over a single estimator.
12
13Two families of ensemble methods are usually distinguished:
14
15- In **averaging methods**, the driving principle is to build several
16  estimators independently and then to average their predictions. On average,
17  the combined estimator is usually better than any of the single base
18  estimator because its variance is reduced.
19
20  **Examples:** :ref:`Bagging methods <bagging>`, :ref:`Forests of randomized trees <forest>`, ...
21
22- By contrast, in **boosting methods**, base estimators are built sequentially
23  and one tries to reduce the bias of the combined estimator. The motivation is
24  to combine several weak models to produce a powerful ensemble.
25
26  **Examples:** :ref:`AdaBoost <adaboost>`, :ref:`Gradient Tree Boosting <gradient_boosting>`, ...
27
28
29.. _bagging:
30
31Bagging meta-estimator
32======================
33
34In ensemble algorithms, bagging methods form a class of algorithms which build
35several instances of a black-box estimator on random subsets of the original
36training set and then aggregate their individual predictions to form a final
37prediction. These methods are used as a way to reduce the variance of a base
38estimator (e.g., a decision tree), by introducing randomization into its
39construction procedure and then making an ensemble out of it. In many cases,
40bagging methods constitute a very simple way to improve with respect to a
41single model, without making it necessary to adapt the underlying base
42algorithm. As they provide a way to reduce overfitting, bagging methods work
43best with strong and complex models (e.g., fully developed decision trees), in
44contrast with boosting methods which usually work best with weak models (e.g.,
45shallow decision trees).
46
47Bagging methods come in many flavours but mostly differ from each other by the
48way they draw random subsets of the training set:
49
50  * When random subsets of the dataset are drawn as random subsets of the
51    samples, then this algorithm is known as Pasting [B1999]_.
52
53  * When samples are drawn with replacement, then the method is known as
54    Bagging [B1996]_.
55
56  * When random subsets of the dataset are drawn as random subsets of
57    the features, then the method is known as Random Subspaces [H1998]_.
58
59  * Finally, when base estimators are built on subsets of both samples and
60    features, then the method is known as Random Patches [LG2012]_.
61
62In scikit-learn, bagging methods are offered as a unified
63:class:`BaggingClassifier` meta-estimator  (resp. :class:`BaggingRegressor`),
64taking as input a user-specified base estimator along with parameters
65specifying the strategy to draw random subsets. In particular, ``max_samples``
66and ``max_features`` control the size of the subsets (in terms of samples and
67features), while ``bootstrap`` and ``bootstrap_features`` control whether
68samples and features are drawn with or without replacement. When using a subset
69of the available samples the generalization accuracy can be estimated with the
70out-of-bag samples by setting ``oob_score=True``. As an example, the
71snippet below illustrates how to instantiate a bagging ensemble of
72:class:`KNeighborsClassifier` base estimators, each built on random subsets of
7350% of the samples and 50% of the features.
74
75    >>> from sklearn.ensemble import BaggingClassifier
76    >>> from sklearn.neighbors import KNeighborsClassifier
77    >>> bagging = BaggingClassifier(KNeighborsClassifier(),
78    ...                             max_samples=0.5, max_features=0.5)
79
80.. topic:: Examples:
81
82 * :ref:`sphx_glr_auto_examples_ensemble_plot_bias_variance.py`
83
84.. topic:: References
85
86  .. [B1999] L. Breiman, "Pasting small votes for classification in large
87         databases and on-line", Machine Learning, 36(1), 85-103, 1999.
88
89  .. [B1996] L. Breiman, "Bagging predictors", Machine Learning, 24(2),
90         123-140, 1996.
91
92  .. [H1998] T. Ho, "The random subspace method for constructing decision
93         forests", Pattern Analysis and Machine Intelligence, 20(8), 832-844,
94         1998.
95
96  .. [LG2012] G. Louppe and P. Geurts, "Ensembles on Random Patches",
97         Machine Learning and Knowledge Discovery in Databases, 346-361, 2012.
98
99.. _forest:
100
101Forests of randomized trees
102===========================
103
104The :mod:`sklearn.ensemble` module includes two averaging algorithms based
105on randomized :ref:`decision trees <tree>`: the RandomForest algorithm
106and the Extra-Trees method. Both algorithms are perturb-and-combine
107techniques [B1998]_ specifically designed for trees. This means a diverse
108set of classifiers is created by introducing randomness in the classifier
109construction.  The prediction of the ensemble is given as the averaged
110prediction of the individual classifiers.
111
112As other classifiers, forest classifiers have to be fitted with two
113arrays: a sparse or dense array X of shape ``(n_samples, n_features)``
114holding the training samples, and an array Y of shape ``(n_samples,)``
115holding the target values (class labels) for the training samples::
116
117    >>> from sklearn.ensemble import RandomForestClassifier
118    >>> X = [[0, 0], [1, 1]]
119    >>> Y = [0, 1]
120    >>> clf = RandomForestClassifier(n_estimators=10)
121    >>> clf = clf.fit(X, Y)
122
123Like :ref:`decision trees <tree>`, forests of trees also extend to
124:ref:`multi-output problems <tree_multioutput>`  (if Y is an array
125of shape ``(n_samples, n_outputs)``).
126
127Random Forests
128--------------
129
130In random forests (see :class:`RandomForestClassifier` and
131:class:`RandomForestRegressor` classes), each tree in the ensemble is built
132from a sample drawn with replacement (i.e., a bootstrap sample) from the
133training set.
134
135Furthermore, when splitting each node during the construction of a tree, the
136best split is found either from all input features or a random subset of size
137``max_features``. (See the :ref:`parameter tuning guidelines
138<random_forest_parameters>` for more details).
139
140The purpose of these two sources of randomness is to decrease the variance of
141the forest estimator. Indeed, individual decision trees typically exhibit high
142variance and tend to overfit. The injected randomness in forests yield decision
143trees with somewhat decoupled prediction errors. By taking an average of those
144predictions, some errors can cancel out. Random forests achieve a reduced
145variance by combining diverse trees, sometimes at the cost of a slight increase
146in bias. In practice the variance reduction is often significant hence yielding
147an overall better model.
148
149In contrast to the original publication [B2001]_, the scikit-learn
150implementation combines classifiers by averaging their probabilistic
151prediction, instead of letting each classifier vote for a single class.
152
153Extremely Randomized Trees
154--------------------------
155
156In extremely randomized trees (see :class:`ExtraTreesClassifier`
157and :class:`ExtraTreesRegressor` classes), randomness goes one step
158further in the way splits are computed. As in random forests, a random
159subset of candidate features is used, but instead of looking for the
160most discriminative thresholds, thresholds are drawn at random for each
161candidate feature and the best of these randomly-generated thresholds is
162picked as the splitting rule. This usually allows to reduce the variance
163of the model a bit more, at the expense of a slightly greater increase
164in bias::
165
166    >>> from sklearn.model_selection import cross_val_score
167    >>> from sklearn.datasets import make_blobs
168    >>> from sklearn.ensemble import RandomForestClassifier
169    >>> from sklearn.ensemble import ExtraTreesClassifier
170    >>> from sklearn.tree import DecisionTreeClassifier
171
172    >>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
173    ...     random_state=0)
174
175    >>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
176    ...     random_state=0)
177    >>> scores = cross_val_score(clf, X, y, cv=5)
178    >>> scores.mean()
179    0.98...
180
181    >>> clf = RandomForestClassifier(n_estimators=10, max_depth=None,
182    ...     min_samples_split=2, random_state=0)
183    >>> scores = cross_val_score(clf, X, y, cv=5)
184    >>> scores.mean()
185    0.999...
186
187    >>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
188    ...     min_samples_split=2, random_state=0)
189    >>> scores = cross_val_score(clf, X, y, cv=5)
190    >>> scores.mean() > 0.999
191    True
192
193.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_forest_iris_001.png
194    :target: ../auto_examples/ensemble/plot_forest_iris.html
195    :align: center
196    :scale: 75%
197
198.. _random_forest_parameters:
199
200Parameters
201----------
202
203The main parameters to adjust when using these methods is ``n_estimators`` and
204``max_features``. The former is the number of trees in the forest. The larger
205the better, but also the longer it will take to compute. In addition, note that
206results will stop getting significantly better beyond a critical number of
207trees. The latter is the size of the random subsets of features to consider
208when splitting a node. The lower the greater the reduction of variance, but
209also the greater the increase in bias. Empirical good default values are
210``max_features=None`` (always considering all features instead of a random
211subset) for regression problems, and ``max_features="sqrt"`` (using a random
212subset of size ``sqrt(n_features)``) for classification tasks (where
213``n_features`` is the number of features in the data). Good results are often
214achieved when setting ``max_depth=None`` in combination with
215``min_samples_split=2`` (i.e., when fully developing the trees). Bear in mind
216though that these values are usually not optimal, and might result in models
217that consume a lot of RAM. The best parameter values should always be
218cross-validated. In addition, note that in random forests, bootstrap samples
219are used by default (``bootstrap=True``) while the default strategy for
220extra-trees is to use the whole dataset (``bootstrap=False``). When using
221bootstrap sampling the generalization accuracy can be estimated on the left out
222or out-of-bag samples. This can be enabled by setting ``oob_score=True``.
223
224.. note::
225
226    The size of the model with the default parameters is :math:`O( M * N * log (N) )`,
227    where :math:`M` is the number of trees and :math:`N` is the number of samples.
228    In order to reduce the size of the model, you can change these parameters:
229    ``min_samples_split``, ``max_leaf_nodes``, ``max_depth`` and ``min_samples_leaf``.
230
231Parallelization
232---------------
233
234Finally, this module also features the parallel construction of the trees
235and the parallel computation of the predictions through the ``n_jobs``
236parameter. If ``n_jobs=k`` then computations are partitioned into
237``k`` jobs, and run on ``k`` cores of the machine. If ``n_jobs=-1``
238then all cores available on the machine are used. Note that because of
239inter-process communication overhead, the speedup might not be linear
240(i.e., using ``k`` jobs will unfortunately not be ``k`` times as
241fast). Significant speedup can still be achieved though when building
242a large number of trees, or when building a single tree requires a fair
243amount of time (e.g., on large datasets).
244
245.. topic:: Examples:
246
247 * :ref:`sphx_glr_auto_examples_ensemble_plot_forest_iris.py`
248 * :ref:`sphx_glr_auto_examples_ensemble_plot_forest_importances_faces.py`
249 * :ref:`sphx_glr_auto_examples_miscellaneous_plot_multioutput_face_completion.py`
250
251.. topic:: References
252
253 .. [B2001] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
254
255 .. [B1998] L. Breiman, "Arcing Classifiers", Annals of Statistics 1998.
256
257 * P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized
258   trees", Machine Learning, 63(1), 3-42, 2006.
259
260.. _random_forest_feature_importance:
261
262Feature importance evaluation
263-----------------------------
264
265The relative rank (i.e. depth) of a feature used as a decision node in a
266tree can be used to assess the relative importance of that feature with
267respect to the predictability of the target variable. Features used at
268the top of the tree contribute to the final prediction decision of a
269larger fraction of the input samples. The **expected fraction of the
270samples** they contribute to can thus be used as an estimate of the
271**relative importance of the features**. In scikit-learn, the fraction of
272samples a feature contributes to is combined with the decrease in impurity
273from splitting them to create a normalized estimate of the predictive power
274of that feature.
275
276By **averaging** the estimates of predictive ability over several randomized
277trees one can **reduce the variance** of such an estimate and use it
278for feature selection. This is known as the mean decrease in impurity, or MDI.
279Refer to [L2014]_ for more information on MDI and feature importance
280evaluation with Random Forests.
281
282.. warning::
283
284  The impurity-based feature importances computed on tree-based models suffer
285  from two flaws that can lead to misleading conclusions. First they are
286  computed on statistics derived from the training dataset and therefore **do
287  not necessarily inform us on which features are most important to make good
288  predictions on held-out dataset**. Secondly, **they favor high cardinality
289  features**, that is features with many unique values.
290  :ref:`permutation_importance` is an alternative to impurity-based feature
291  importance that does not suffer from these flaws. These two methods of
292  obtaining feature importance are explored in:
293  :ref:`sphx_glr_auto_examples_inspection_plot_permutation_importance.py`.
294
295The following example shows a color-coded representation of the relative
296importances of each individual pixel for a face recognition task using
297a :class:`ExtraTreesClassifier` model.
298
299.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_forest_importances_faces_001.png
300   :target: ../auto_examples/ensemble/plot_forest_importances_faces.html
301   :align: center
302   :scale: 75
303
304In practice those estimates are stored as an attribute named
305``feature_importances_`` on the fitted model. This is an array with shape
306``(n_features,)`` whose values are positive and sum to 1.0. The higher
307the value, the more important is the contribution of the matching feature
308to the prediction function.
309
310.. topic:: Examples:
311
312 * :ref:`sphx_glr_auto_examples_ensemble_plot_forest_importances_faces.py`
313 * :ref:`sphx_glr_auto_examples_ensemble_plot_forest_importances.py`
314
315.. topic:: References
316
317 .. [L2014] G. Louppe,
318         "Understanding Random Forests: From Theory to Practice",
319         PhD Thesis, U. of Liege, 2014.
320
321.. _random_trees_embedding:
322
323Totally Random Trees Embedding
324------------------------------
325
326:class:`RandomTreesEmbedding` implements an unsupervised transformation of the
327data.  Using a forest of completely random trees, :class:`RandomTreesEmbedding`
328encodes the data by the indices of the leaves a data point ends up in.  This
329index is then encoded in a one-of-K manner, leading to a high dimensional,
330sparse binary coding.
331This coding can be computed very efficiently and can then be used as a basis
332for other learning tasks.
333The size and sparsity of the code can be influenced by choosing the number of
334trees and the maximum depth per tree. For each tree in the ensemble, the coding
335contains one entry of one. The size of the coding is at most ``n_estimators * 2
336** max_depth``, the maximum number of leaves in the forest.
337
338As neighboring data points are more likely to lie within the same leaf of a
339tree, the transformation performs an implicit, non-parametric density
340estimation.
341
342.. topic:: Examples:
343
344 * :ref:`sphx_glr_auto_examples_ensemble_plot_random_forest_embedding.py`
345
346 * :ref:`sphx_glr_auto_examples_manifold_plot_lle_digits.py` compares non-linear
347   dimensionality reduction techniques on handwritten digits.
348
349 * :ref:`sphx_glr_auto_examples_ensemble_plot_feature_transformation.py` compares
350   supervised and unsupervised tree based feature transformations.
351
352.. seealso::
353
354   :ref:`manifold` techniques can also be useful to derive non-linear
355   representations of feature space, also these approaches focus also on
356   dimensionality reduction.
357
358
359.. _adaboost:
360
361AdaBoost
362========
363
364The module :mod:`sklearn.ensemble` includes the popular boosting algorithm
365AdaBoost, introduced in 1995 by Freund and Schapire [FS1995]_.
366
367The core principle of AdaBoost is to fit a sequence of weak learners (i.e.,
368models that are only slightly better than random guessing, such as small
369decision trees) on repeatedly modified versions of the data. The predictions
370from all of them are then combined through a weighted majority vote (or sum) to
371produce the final prediction. The data modifications at each so-called boosting
372iteration consist of applying weights :math:`w_1`, :math:`w_2`, ..., :math:`w_N`
373to each of the training samples. Initially, those weights are all set to
374:math:`w_i = 1/N`, so that the first step simply trains a weak learner on the
375original data. For each successive iteration, the sample weights are
376individually modified and the learning algorithm is reapplied to the reweighted
377data. At a given step, those training examples that were incorrectly predicted
378by the boosted model induced at the previous step have their weights increased,
379whereas the weights are decreased for those that were predicted correctly. As
380iterations proceed, examples that are difficult to predict receive
381ever-increasing influence. Each subsequent weak learner is thereby forced to
382concentrate on the examples that are missed by the previous ones in the sequence
383[HTF]_.
384
385.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_adaboost_hastie_10_2_001.png
386   :target: ../auto_examples/ensemble/plot_adaboost_hastie_10_2.html
387   :align: center
388   :scale: 75
389
390AdaBoost can be used both for classification and regression problems:
391
392  - For multi-class classification, :class:`AdaBoostClassifier` implements
393    AdaBoost-SAMME and AdaBoost-SAMME.R [ZZRH2009]_.
394
395  - For regression, :class:`AdaBoostRegressor` implements AdaBoost.R2 [D1997]_.
396
397Usage
398-----
399
400The following example shows how to fit an AdaBoost classifier with 100 weak
401learners::
402
403    >>> from sklearn.model_selection import cross_val_score
404    >>> from sklearn.datasets import load_iris
405    >>> from sklearn.ensemble import AdaBoostClassifier
406
407    >>> X, y = load_iris(return_X_y=True)
408    >>> clf = AdaBoostClassifier(n_estimators=100)
409    >>> scores = cross_val_score(clf, X, y, cv=5)
410    >>> scores.mean()
411    0.9...
412
413The number of weak learners is controlled by the parameter ``n_estimators``. The
414``learning_rate`` parameter controls the contribution of the weak learners in
415the final combination. By default, weak learners are decision stumps. Different
416weak learners can be specified through the ``base_estimator`` parameter.
417The main parameters to tune to obtain good results are ``n_estimators`` and
418the complexity of the base estimators (e.g., its depth ``max_depth`` or
419minimum required number of samples to consider a split ``min_samples_split``).
420
421.. topic:: Examples:
422
423 * :ref:`sphx_glr_auto_examples_ensemble_plot_adaboost_hastie_10_2.py` compares the
424   classification error of a decision stump, decision tree, and a boosted
425   decision stump using AdaBoost-SAMME and AdaBoost-SAMME.R.
426
427 * :ref:`sphx_glr_auto_examples_ensemble_plot_adaboost_multiclass.py` shows the performance
428   of AdaBoost-SAMME and AdaBoost-SAMME.R on a multi-class problem.
429
430 * :ref:`sphx_glr_auto_examples_ensemble_plot_adaboost_twoclass.py` shows the decision boundary
431   and decision function values for a non-linearly separable two-class problem
432   using AdaBoost-SAMME.
433
434 * :ref:`sphx_glr_auto_examples_ensemble_plot_adaboost_regression.py` demonstrates regression
435   with the AdaBoost.R2 algorithm.
436
437.. topic:: References
438
439 .. [FS1995] Y. Freund, and R. Schapire, "A Decision-Theoretic Generalization of
440             On-Line Learning and an Application to Boosting", 1997.
441
442 .. [ZZRH2009] J. Zhu, H. Zou, S. Rosset, T. Hastie. "Multi-class AdaBoost",
443               2009.
444
445 .. [D1997] H. Drucker. "Improving Regressors using Boosting Techniques", 1997.
446
447 .. [HTF] T. Hastie, R. Tibshirani and J. Friedman, "Elements of
448              Statistical Learning Ed. 2", Springer, 2009.
449
450
451.. _gradient_boosting:
452
453Gradient Tree Boosting
454======================
455
456`Gradient Tree Boosting <https://en.wikipedia.org/wiki/Gradient_boosting>`_
457or Gradient Boosted Decision Trees (GBDT) is a generalization
458of boosting to arbitrary
459differentiable loss functions. GBDT is an accurate and effective
460off-the-shelf procedure that can be used for both regression and
461classification problems in a
462variety of areas including Web search ranking and ecology.
463
464The module :mod:`sklearn.ensemble` provides methods
465for both classification and regression via gradient boosted decision
466trees.
467
468.. note::
469
470  Scikit-learn 0.21 introduces two new implementations of
471  gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
472  and :class:`HistGradientBoostingRegressor`, inspired by
473  `LightGBM <https://github.com/Microsoft/LightGBM>`__ (See [LightGBM]_).
474
475  These histogram-based estimators can be **orders of magnitude faster**
476  than :class:`GradientBoostingClassifier` and
477  :class:`GradientBoostingRegressor` when the number of samples is larger
478  than tens of thousands of samples.
479
480  They also have built-in support for missing values, which avoids the need
481  for an imputer.
482
483  These estimators are described in more detail below in
484  :ref:`histogram_based_gradient_boosting`.
485
486  The following guide focuses on :class:`GradientBoostingClassifier` and
487  :class:`GradientBoostingRegressor`, which might be preferred for small
488  sample sizes since binning may lead to split points that are too approximate
489  in this setting.
490
491
492The usage and the parameters of :class:`GradientBoostingClassifier` and
493:class:`GradientBoostingRegressor` are described below. The 2 most important
494parameters of these estimators are `n_estimators` and `learning_rate`.
495
496Classification
497---------------
498
499:class:`GradientBoostingClassifier` supports both binary and multi-class
500classification.
501The following example shows how to fit a gradient boosting classifier
502with 100 decision stumps as weak learners::
503
504    >>> from sklearn.datasets import make_hastie_10_2
505    >>> from sklearn.ensemble import GradientBoostingClassifier
506
507    >>> X, y = make_hastie_10_2(random_state=0)
508    >>> X_train, X_test = X[:2000], X[2000:]
509    >>> y_train, y_test = y[:2000], y[2000:]
510
511    >>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
512    ...     max_depth=1, random_state=0).fit(X_train, y_train)
513    >>> clf.score(X_test, y_test)
514    0.913...
515
516The number of weak learners (i.e. regression trees) is controlled by the
517parameter ``n_estimators``; :ref:`The size of each tree
518<gradient_boosting_tree_size>` can be controlled either by setting the tree
519depth via ``max_depth`` or by setting the number of leaf nodes via
520``max_leaf_nodes``. The ``learning_rate`` is a hyper-parameter in the range
521(0.0, 1.0] that controls overfitting via :ref:`shrinkage
522<gradient_boosting_shrinkage>` .
523
524.. note::
525
526   Classification with more than 2 classes requires the induction
527   of ``n_classes`` regression trees at each iteration,
528   thus, the total number of induced trees equals
529   ``n_classes * n_estimators``. For datasets with a large number
530   of classes we strongly recommend to use
531   :class:`HistGradientBoostingClassifier` as an alternative to
532   :class:`GradientBoostingClassifier` .
533
534Regression
535----------
536
537:class:`GradientBoostingRegressor` supports a number of
538:ref:`different loss functions <gradient_boosting_loss>`
539for regression which can be specified via the argument
540``loss``; the default loss function for regression is squared error
541(``'squared_error'``).
542
543::
544
545    >>> import numpy as np
546    >>> from sklearn.metrics import mean_squared_error
547    >>> from sklearn.datasets import make_friedman1
548    >>> from sklearn.ensemble import GradientBoostingRegressor
549
550    >>> X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
551    >>> X_train, X_test = X[:200], X[200:]
552    >>> y_train, y_test = y[:200], y[200:]
553    >>> est = GradientBoostingRegressor(
554    ...     n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0,
555    ...     loss='squared_error'
556    ... ).fit(X_train, y_train)
557    >>> mean_squared_error(y_test, est.predict(X_test))
558    5.00...
559
560The figure below shows the results of applying :class:`GradientBoostingRegressor`
561with least squares loss and 500 base learners to the diabetes dataset
562(:func:`sklearn.datasets.load_diabetes`).
563The plot on the left shows the train and test error at each iteration.
564The train error at each iteration is stored in the
565:attr:`~GradientBoostingRegressor.train_score_` attribute
566of the gradient boosting model. The test error at each iterations can be obtained
567via the :meth:`~GradientBoostingRegressor.staged_predict` method which returns a
568generator that yields the predictions at each stage. Plots like these can be used
569to determine the optimal number of trees (i.e. ``n_estimators``) by early stopping.
570
571.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_gradient_boosting_regression_001.png
572   :target: ../auto_examples/ensemble/plot_gradient_boosting_regression.html
573   :align: center
574   :scale: 75
575
576.. topic:: Examples:
577
578 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py`
579 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_oob.py`
580
581.. _gradient_boosting_warm_start:
582
583Fitting additional weak-learners
584--------------------------------
585
586Both :class:`GradientBoostingRegressor` and :class:`GradientBoostingClassifier`
587support ``warm_start=True`` which allows you to add more estimators to an already
588fitted model.
589
590::
591
592  >>> _ = est.set_params(n_estimators=200, warm_start=True)  # set warm_start and new nr of trees
593  >>> _ = est.fit(X_train, y_train) # fit additional 100 trees to est
594  >>> mean_squared_error(y_test, est.predict(X_test))
595  3.84...
596
597.. _gradient_boosting_tree_size:
598
599Controlling the tree size
600-------------------------
601
602The size of the regression tree base learners defines the level of variable
603interactions that can be captured by the gradient boosting model. In general,
604a tree of depth ``h`` can capture interactions of order ``h`` .
605There are two ways in which the size of the individual regression trees can
606be controlled.
607
608If you specify ``max_depth=h`` then complete binary trees
609of depth ``h`` will be grown. Such trees will have (at most) ``2**h`` leaf nodes
610and ``2**h - 1`` split nodes.
611
612Alternatively, you can control the tree size by specifying the number of
613leaf nodes via the parameter ``max_leaf_nodes``. In this case,
614trees will be grown using best-first search where nodes with the highest improvement
615in impurity will be expanded first.
616A tree with ``max_leaf_nodes=k`` has ``k - 1`` split nodes and thus can
617model interactions of up to order ``max_leaf_nodes - 1`` .
618
619We found that ``max_leaf_nodes=k`` gives comparable results to ``max_depth=k-1``
620but is significantly faster to train at the expense of a slightly higher
621training error.
622The parameter ``max_leaf_nodes`` corresponds to the variable ``J`` in the
623chapter on gradient boosting in [F2001]_ and is related to the parameter
624``interaction.depth`` in R's gbm package where ``max_leaf_nodes == interaction.depth + 1`` .
625
626Mathematical formulation
627-------------------------
628
629We first present GBRT for regression, and then detail the classification
630case.
631
632Regression
633^^^^^^^^^^
634
635GBRT regressors are additive models whose prediction :math:`y_i` for a
636given input :math:`x_i` is of the following form:
637
638  .. math::
639
640    \hat{y_i} = F_M(x_i) = \sum_{m=1}^{M} h_m(x_i)
641
642where the :math:`h_m` are estimators called *weak learners* in the context
643of boosting. Gradient Tree Boosting uses :ref:`decision tree regressors
644<tree>` of fixed size as weak learners. The constant M corresponds to the
645`n_estimators` parameter.
646
647Similar to other boosting algorithms, a GBRT is built in a greedy fashion:
648
649  .. math::
650
651    F_m(x) = F_{m-1}(x) + h_m(x),
652
653where the newly added tree :math:`h_m` is fitted in order to minimize a sum
654of losses :math:`L_m`, given the previous ensemble :math:`F_{m-1}`:
655
656  .. math::
657
658    h_m =  \arg\min_{h} L_m = \arg\min_{h} \sum_{i=1}^{n}
659    l(y_i, F_{m-1}(x_i) + h(x_i)),
660
661where :math:`l(y_i, F(x_i))` is defined by the `loss` parameter, detailed
662in the next section.
663
664By default, the initial model :math:`F_{0}` is chosen as the constant that
665minimizes the loss: for a least-squares loss, this is the empirical mean of
666the target values. The initial model can also be specified via the ``init``
667argument.
668
669Using a first-order Taylor approximation, the value of :math:`l` can be
670approximated as follows:
671
672  .. math::
673
674    l(y_i, F_{m-1}(x_i) + h_m(x_i)) \approx
675    l(y_i, F_{m-1}(x_i))
676    + h_m(x_i)
677    \left[ \frac{\partial l(y_i, F(x_i))}{\partial F(x_i)} \right]_{F=F_{m - 1}}.
678
679.. note::
680
681  Briefly, a first-order Taylor approximation says that
682  :math:`l(z) \approx l(a) + (z - a) \frac{\partial l(a)}{\partial a}`.
683  Here, :math:`z` corresponds to :math:`F_{m - 1}(x_i) + h_m(x_i)`, and
684  :math:`a` corresponds to :math:`F_{m-1}(x_i)`
685
686The quantity :math:`\left[ \frac{\partial l(y_i, F(x_i))}{\partial F(x_i)}
687\right]_{F=F_{m - 1}}` is the derivative of the loss with respect to its
688second parameter, evaluated at :math:`F_{m-1}(x)`. It is easy to compute for
689any given :math:`F_{m - 1}(x_i)` in a closed form since the loss is
690differentiable. We will denote it by :math:`g_i`.
691
692Removing the constant terms, we have:
693
694  .. math::
695
696    h_m \approx \arg\min_{h} \sum_{i=1}^{n} h(x_i) g_i
697
698This is minimized if :math:`h(x_i)` is fitted to predict a value that is
699proportional to the negative gradient :math:`-g_i`. Therefore, at each
700iteration, **the estimator** :math:`h_m` **is fitted to predict the negative
701gradients of the samples**. The gradients are updated at each iteration.
702This can be considered as some kind of gradient descent in a functional
703space.
704
705.. note::
706
707  For some losses, e.g. the least absolute deviation (LAD) where the gradients
708  are :math:`\pm 1`, the values predicted by a fitted :math:`h_m` are not
709  accurate enough: the tree can only output integer values. As a result, the
710  leaves values of the tree :math:`h_m` are modified once the tree is
711  fitted, such that the leaves values minimize the loss :math:`L_m`. The
712  update is loss-dependent: for the LAD loss, the value of a leaf is updated
713  to the median of the samples in that leaf.
714
715Classification
716^^^^^^^^^^^^^^
717
718Gradient boosting for classification is very similar to the regression case.
719However, the sum of the trees :math:`F_M(x_i) = \sum_m h_m(x_i)` is not
720homogeneous to a prediction: it cannot be a class, since the trees predict
721continuous values.
722
723The mapping from the value :math:`F_M(x_i)` to a class or a probability is
724loss-dependent. For the deviance (or log-loss), the probability that
725:math:`x_i` belongs to the positive class is modeled as :math:`p(y_i = 1 |
726x_i) = \sigma(F_M(x_i))` where :math:`\sigma` is the sigmoid function.
727
728For multiclass classification, K trees (for K classes) are built at each of
729the :math:`M` iterations. The probability that :math:`x_i` belongs to class
730k is modeled as a softmax of the :math:`F_{M,k}(x_i)` values.
731
732Note that even for a classification task, the :math:`h_m` sub-estimator is
733still a regressor, not a classifier. This is because the sub-estimators are
734trained to predict (negative) *gradients*, which are always continuous
735quantities.
736
737.. _gradient_boosting_loss:
738
739Loss Functions
740--------------
741
742The following loss functions are supported and can be specified using
743the parameter ``loss``:
744
745  * Regression
746
747    * Squared error (``'squared_error'``): The natural choice for regression
748      due to its superior computational properties. The initial model is
749      given by the mean of the target values.
750    * Least absolute deviation (``'lad'``): A robust loss function for
751      regression. The initial model is given by the median of the
752      target values.
753    * Huber (``'huber'``): Another robust loss function that combines
754      least squares and least absolute deviation; use ``alpha`` to
755      control the sensitivity with regards to outliers (see [F2001]_ for
756      more details).
757    * Quantile (``'quantile'``): A loss function for quantile regression.
758      Use ``0 < alpha < 1`` to specify the quantile. This loss function
759      can be used to create prediction intervals
760      (see :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_quantile.py`).
761
762  * Classification
763
764    * Binomial deviance (``'deviance'``): The binomial
765      negative log-likelihood loss function for binary classification (provides
766      probability estimates).  The initial model is given by the
767      log odds-ratio.
768    * Multinomial deviance (``'deviance'``): The multinomial
769      negative log-likelihood loss function for multi-class classification with
770      ``n_classes`` mutually exclusive classes. It provides
771      probability estimates.  The initial model is given by the
772      prior probability of each class. At each iteration ``n_classes``
773      regression trees have to be constructed which makes GBRT rather
774      inefficient for data sets with a large number of classes.
775    * Exponential loss (``'exponential'``): The same loss function
776      as :class:`AdaBoostClassifier`. Less robust to mislabeled
777      examples than ``'deviance'``; can only be used for binary
778      classification.
779
780.. _gradient_boosting_shrinkage:
781
782Shrinkage via learning rate
783---------------------------
784
785[F2001]_ proposed a simple regularization strategy that scales
786the contribution of each weak learner by a constant factor :math:`\nu`:
787
788.. math::
789
790    F_m(x) = F_{m-1}(x) + \nu h_m(x)
791
792The parameter :math:`\nu` is also called the **learning rate** because
793it scales the step length the gradient descent procedure; it can
794be set via the ``learning_rate`` parameter.
795
796The parameter ``learning_rate`` strongly interacts with the parameter
797``n_estimators``, the number of weak learners to fit. Smaller values
798of ``learning_rate`` require larger numbers of weak learners to maintain
799a constant training error. Empirical evidence suggests that small
800values of ``learning_rate`` favor better test error. [HTF]_
801recommend to set the learning rate to a small constant
802(e.g. ``learning_rate <= 0.1``) and choose ``n_estimators`` by early
803stopping. For a more detailed discussion of the interaction between
804``learning_rate`` and ``n_estimators`` see [R2007]_.
805
806Subsampling
807-----------
808
809[F1999]_ proposed stochastic gradient boosting, which combines gradient
810boosting with bootstrap averaging (bagging). At each iteration
811the base classifier is trained on a fraction ``subsample`` of
812the available training data. The subsample is drawn without replacement.
813A typical value of ``subsample`` is 0.5.
814
815The figure below illustrates the effect of shrinkage and subsampling
816on the goodness-of-fit of the model. We can clearly see that shrinkage
817outperforms no-shrinkage. Subsampling with shrinkage can further increase
818the accuracy of the model. Subsampling without shrinkage, on the other hand,
819does poorly.
820
821.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_gradient_boosting_regularization_001.png
822   :target: ../auto_examples/ensemble/plot_gradient_boosting_regularization.html
823   :align: center
824   :scale: 75
825
826Another strategy to reduce the variance is by subsampling the features
827analogous to the random splits in :class:`RandomForestClassifier` .
828The number of subsampled features can be controlled via the ``max_features``
829parameter.
830
831.. note:: Using a small ``max_features`` value can significantly decrease the runtime.
832
833Stochastic gradient boosting allows to compute out-of-bag estimates of the
834test deviance by computing the improvement in deviance on the examples that are
835not included in the bootstrap sample (i.e. the out-of-bag examples).
836The improvements are stored in the attribute
837:attr:`~GradientBoostingRegressor.oob_improvement_`. ``oob_improvement_[i]`` holds
838the improvement in terms of the loss on the OOB samples if you add the i-th stage
839to the current predictions.
840Out-of-bag estimates can be used for model selection, for example to determine
841the optimal number of iterations. OOB estimates are usually very pessimistic thus
842we recommend to use cross-validation instead and only use OOB if cross-validation
843is too time consuming.
844
845.. topic:: Examples:
846
847 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regularization.py`
848 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_oob.py`
849 * :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`
850
851Interpretation with feature importance
852--------------------------------------
853
854Individual decision trees can be interpreted easily by simply
855visualizing the tree structure. Gradient boosting models, however,
856comprise hundreds of regression trees thus they cannot be easily
857interpreted by visual inspection of the individual trees. Fortunately,
858a number of techniques have been proposed to summarize and interpret
859gradient boosting models.
860
861Often features do not contribute equally to predict the target
862response; in many situations the majority of the features are in fact
863irrelevant.
864When interpreting a model, the first question usually is: what are
865those important features and how do they contributing in predicting
866the target response?
867
868Individual decision trees intrinsically perform feature selection by selecting
869appropriate split points. This information can be used to measure the
870importance of each feature; the basic idea is: the more often a
871feature is used in the split points of a tree the more important that
872feature is. This notion of importance can be extended to decision tree
873ensembles by simply averaging the impurity-based feature importance of each tree (see
874:ref:`random_forest_feature_importance` for more details).
875
876The feature importance scores of a fit gradient boosting model can be
877accessed via the ``feature_importances_`` property::
878
879    >>> from sklearn.datasets import make_hastie_10_2
880    >>> from sklearn.ensemble import GradientBoostingClassifier
881
882    >>> X, y = make_hastie_10_2(random_state=0)
883    >>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
884    ...     max_depth=1, random_state=0).fit(X, y)
885    >>> clf.feature_importances_
886    array([0.10..., 0.10..., 0.11..., ...
887
888Note that this computation of feature importance is based on entropy, and it
889is distinct from :func:`sklearn.inspection.permutation_importance` which is
890based on permutation of the features.
891
892.. topic:: Examples:
893
894 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py`
895
896.. _histogram_based_gradient_boosting:
897
898Histogram-Based Gradient Boosting
899=================================
900
901Scikit-learn 0.21 introduced two new implementations of
902gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
903and :class:`HistGradientBoostingRegressor`, inspired by
904`LightGBM <https://github.com/Microsoft/LightGBM>`__ (See [LightGBM]_).
905
906These histogram-based estimators can be **orders of magnitude faster**
907than :class:`GradientBoostingClassifier` and
908:class:`GradientBoostingRegressor` when the number of samples is larger
909than tens of thousands of samples.
910
911They also have built-in support for missing values, which avoids the need
912for an imputer.
913
914These fast estimators first bin the input samples ``X`` into
915integer-valued bins (typically 256 bins) which tremendously reduces the
916number of splitting points to consider, and allows the algorithm to
917leverage integer-based data structures (histograms) instead of relying on
918sorted continuous values when building the trees. The API of these
919estimators is slightly different, and some of the features from
920:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`
921are not yet supported, for instance some loss functions.
922
923.. topic:: Examples:
924
925 * :ref:`sphx_glr_auto_examples_inspection_plot_partial_dependence.py`
926
927Usage
928-----
929
930Most of the parameters are unchanged from
931:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`.
932One exception is the ``max_iter`` parameter that replaces ``n_estimators``, and
933controls the number of iterations of the boosting process::
934
935  >>> from sklearn.ensemble import HistGradientBoostingClassifier
936  >>> from sklearn.datasets import make_hastie_10_2
937
938  >>> X, y = make_hastie_10_2(random_state=0)
939  >>> X_train, X_test = X[:2000], X[2000:]
940  >>> y_train, y_test = y[:2000], y[2000:]
941
942  >>> clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train)
943  >>> clf.score(X_test, y_test)
944  0.8965
945
946Available losses for regression are 'squared_error',
947'absolute_error', which is less sensitive to outliers, and
948'poisson', which is well suited to model counts and frequencies. For
949classification, 'binary_crossentropy' is used for binary classification and
950'categorical_crossentropy' is used for multiclass classification. By default
951the loss is 'auto' and will select the appropriate loss depending on
952:term:`y` passed to :term:`fit`.
953
954The size of the trees can be controlled through the ``max_leaf_nodes``,
955``max_depth``, and ``min_samples_leaf`` parameters.
956
957The number of bins used to bin the data is controlled with the ``max_bins``
958parameter. Using less bins acts as a form of regularization. It is
959generally recommended to use as many bins as possible, which is the default.
960
961The ``l2_regularization`` parameter is a regularizer on the loss function and
962corresponds to :math:`\lambda` in equation (2) of [XGBoost]_.
963
964Note that **early-stopping is enabled by default if the number of samples is
965larger than 10,000**. The early-stopping behaviour is controlled via the
966``early-stopping``, ``scoring``, ``validation_fraction``,
967``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop
968using an arbitrary :term:`scorer`, or just the training or validation loss.
969Note that for technical reasons, using a scorer is significantly slower than
970using the loss. By default, early-stopping is performed if there are at least
97110,000 samples in the training set, using the validation loss.
972
973Missing values support
974----------------------
975
976:class:`HistGradientBoostingClassifier` and
977:class:`HistGradientBoostingRegressor` have built-in support for missing
978values (NaNs).
979
980During training, the tree grower learns at each split point whether samples
981with missing values should go to the left or right child, based on the
982potential gain. When predicting, samples with missing values are assigned to
983the left or right child consequently::
984
985  >>> from sklearn.ensemble import HistGradientBoostingClassifier
986  >>> import numpy as np
987
988  >>> X = np.array([0, 1, 2, np.nan]).reshape(-1, 1)
989  >>> y = [0, 0, 1, 1]
990
991  >>> gbdt = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y)
992  >>> gbdt.predict(X)
993  array([0, 0, 1, 1])
994
995When the missingness pattern is predictive, the splits can be done on
996whether the feature value is missing or not::
997
998  >>> X = np.array([0, np.nan, 1, 2, np.nan]).reshape(-1, 1)
999  >>> y = [0, 1, 0, 0, 1]
1000  >>> gbdt = HistGradientBoostingClassifier(min_samples_leaf=1,
1001  ...                                       max_depth=2,
1002  ...                                       learning_rate=1,
1003  ...                                       max_iter=1).fit(X, y)
1004  >>> gbdt.predict(X)
1005  array([0, 1, 0, 0, 1])
1006
1007If no missing values were encountered for a given feature during training,
1008then samples with missing values are mapped to whichever child has the most
1009samples.
1010
1011.. _sw_hgbdt:
1012
1013Sample weight support
1014---------------------
1015
1016:class:`HistGradientBoostingClassifier` and
1017:class:`HistGradientBoostingRegressor` sample support weights during
1018:term:`fit`.
1019
1020The following toy example demonstrates how the model ignores the samples with
1021zero sample weights:
1022
1023    >>> X = [[1, 0],
1024    ...      [1, 0],
1025    ...      [1, 0],
1026    ...      [0, 1]]
1027    >>> y = [0, 0, 1, 0]
1028    >>> # ignore the first 2 training samples by setting their weight to 0
1029    >>> sample_weight = [0, 0, 1, 1]
1030    >>> gb = HistGradientBoostingClassifier(min_samples_leaf=1)
1031    >>> gb.fit(X, y, sample_weight=sample_weight)
1032    HistGradientBoostingClassifier(...)
1033    >>> gb.predict([[1, 0]])
1034    array([1])
1035    >>> gb.predict_proba([[1, 0]])[0, 1]
1036    0.99...
1037
1038As you can see, the `[1, 0]` is comfortably classified as `1` since the first
1039two samples are ignored due to their sample weights.
1040
1041Implementation detail: taking sample weights into account amounts to
1042multiplying the gradients (and the hessians) by the sample weights. Note that
1043the binning stage (specifically the quantiles computation) does not take the
1044weights into account.
1045
1046.. _categorical_support_gbdt:
1047
1048Categorical Features Support
1049----------------------------
1050
1051:class:`HistGradientBoostingClassifier` and
1052:class:`HistGradientBoostingRegressor` have native support for categorical
1053features: they can consider splits on non-ordered, categorical data.
1054
1055For datasets with categorical features, using the native categorical support
1056is often better than relying on one-hot encoding
1057(:class:`~sklearn.preprocessing.OneHotEncoder`), because one-hot encoding
1058requires more tree depth to achieve equivalent splits. It is also usually
1059better to rely on the native categorical support rather than to treat
1060categorical features as continuous (ordinal), which happens for ordinal-encoded
1061categorical data, since categories are nominal quantities where order does not
1062matter.
1063
1064To enable categorical support, a boolean mask can be passed to the
1065`categorical_features` parameter, indicating which feature is categorical. In
1066the following, the first feature will be treated as categorical and the
1067second feature as numerical::
1068
1069  >>> gbdt = HistGradientBoostingClassifier(categorical_features=[True, False])
1070
1071Equivalently, one can pass a list of integers indicating the indices of the
1072categorical features::
1073
1074  >>> gbdt = HistGradientBoostingClassifier(categorical_features=[0])
1075
1076The cardinality of each categorical feature should be less than the `max_bins`
1077parameter, and each categorical feature is expected to be encoded in
1078`[0, max_bins - 1]`. To that end, it might be useful to pre-process the data
1079with an :class:`~sklearn.preprocessing.OrdinalEncoder` as done in
1080:ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py`.
1081
1082If there are missing values during training, the missing values will be
1083treated as a proper category. If there are no missing values during training,
1084then at prediction time, missing values are mapped to the child node that has
1085the most samples (just like for continuous features). When predicting,
1086categories that were not seen during fit time will be treated as missing
1087values.
1088
1089**Split finding with categorical features**: The canonical way of considering
1090categorical splits in a tree is to consider
1091all of the :math:`2^{K - 1} - 1` partitions, where :math:`K` is the number of
1092categories. This can quickly become prohibitive when :math:`K` is large.
1093Fortunately, since gradient boosting trees are always regression trees (even
1094for classification problems), there exist a faster strategy that can yield
1095equivalent splits. First, the categories of a feature are sorted according to
1096the variance of the target, for each category `k`. Once the categories are
1097sorted, one can consider *continuous partitions*, i.e. treat the categories
1098as if they were ordered continuous values (see Fisher [Fisher1958]_ for a
1099formal proof). As a result, only :math:`K - 1` splits need to be considered
1100instead of :math:`2^{K - 1} - 1`. The initial sorting is a
1101:math:`\mathcal{O}(K \log(K))` operation, leading to a total complexity of
1102:math:`\mathcal{O}(K \log(K) + K)`, instead of :math:`\mathcal{O}(2^K)`.
1103
1104.. topic:: Examples:
1105
1106  * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py`
1107
1108.. _monotonic_cst_gbdt:
1109
1110Monotonic Constraints
1111---------------------
1112
1113Depending on the problem at hand, you may have prior knowledge indicating
1114that a given feature should in general have a positive (or negative) effect
1115on the target value. For example, all else being equal, a higher credit
1116score should increase the probability of getting approved for a loan.
1117Monotonic constraints allow you to incorporate such prior knowledge into the
1118model.
1119
1120A positive monotonic constraint is a constraint of the form:
1121
1122:math:`x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2)`,
1123where :math:`F` is the predictor with two features.
1124
1125Similarly, a negative monotonic constraint is of the form:
1126
1127:math:`x_1 \leq x_1' \implies F(x_1, x_2) \geq F(x_1', x_2)`.
1128
1129Note that monotonic constraints only constraint the output "all else being
1130equal". Indeed, the following relation **is not enforced** by a positive
1131constraint: :math:`x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2')`.
1132
1133You can specify a monotonic constraint on each feature using the
1134`monotonic_cst` parameter. For each feature, a value of 0 indicates no
1135constraint, while -1 and 1 indicate a negative and positive constraint,
1136respectively::
1137
1138  >>> from sklearn.ensemble import HistGradientBoostingRegressor
1139
1140  ... # positive, negative, and no constraint on the 3 features
1141  >>> gbdt = HistGradientBoostingRegressor(monotonic_cst=[1, -1, 0])
1142
1143In a binary classification context, imposing a monotonic constraint means
1144that the feature is supposed to have a positive / negative effect on the
1145probability to belong to the positive class. Monotonic constraints are not
1146supported for multiclass context.
1147
1148.. note::
1149    Since categories are unordered quantities, it is not possible to enforce
1150    monotonic constraints on categorical features.
1151
1152.. topic:: Examples:
1153
1154  * :ref:`sphx_glr_auto_examples_ensemble_plot_monotonic_constraints.py`
1155
1156Low-level parallelism
1157---------------------
1158
1159:class:`HistGradientBoostingClassifier` and
1160:class:`HistGradientBoostingRegressor` have implementations that use OpenMP
1161for parallelization through Cython. For more details on how to control the
1162number of threads, please refer to our :ref:`parallelism` notes.
1163
1164The following parts are parallelized:
1165
1166- mapping samples from real values to integer-valued bins (finding the bin
1167  thresholds is however sequential)
1168- building histograms is parallelized over features
1169- finding the best split point at a node is parallelized over features
1170- during fit, mapping samples into the left and right children is
1171  parallelized over samples
1172- gradient and hessians computations are parallelized over samples
1173- predicting is parallelized over samples
1174
1175Why it's faster
1176---------------
1177
1178The bottleneck of a gradient boosting procedure is building the decision
1179trees. Building a traditional decision tree (as in the other GBDTs
1180:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`)
1181requires sorting the samples at each node (for
1182each feature). Sorting is needed so that the potential gain of a split point
1183can be computed efficiently. Splitting a single node has thus a complexity
1184of :math:`\mathcal{O}(n_\text{features} \times n \log(n))` where :math:`n`
1185is the number of samples at the node.
1186
1187:class:`HistGradientBoostingClassifier` and
1188:class:`HistGradientBoostingRegressor`, in contrast, do not require sorting the
1189feature values and instead use a data-structure called a histogram, where the
1190samples are implicitly ordered. Building a histogram has a
1191:math:`\mathcal{O}(n)` complexity, so the node splitting procedure has a
1192:math:`\mathcal{O}(n_\text{features} \times n)` complexity, much smaller
1193than the previous one. In addition, instead of considering :math:`n` split
1194points, we here consider only ``max_bins`` split points, which is much
1195smaller.
1196
1197In order to build histograms, the input data `X` needs to be binned into
1198integer-valued bins. This binning procedure does require sorting the feature
1199values, but it only happens once at the very beginning of the boosting process
1200(not at each node, like in :class:`GradientBoostingClassifier` and
1201:class:`GradientBoostingRegressor`).
1202
1203Finally, many parts of the implementation of
1204:class:`HistGradientBoostingClassifier` and
1205:class:`HistGradientBoostingRegressor` are parallelized.
1206
1207.. topic:: References
1208
1209  .. [F1999] Friedmann, Jerome H., 2007, `"Stochastic Gradient Boosting"
1210     <https://statweb.stanford.edu/~jhf/ftp/stobst.pdf>`_
1211  .. [R2007] G. Ridgeway, "Generalized Boosted Models: A guide to the gbm
1212     package", 2007
1213  .. [XGBoost] Tianqi Chen, Carlos Guestrin, :arxiv:`"XGBoost: A Scalable Tree
1214     Boosting System" <1603.02754>`
1215  .. [LightGBM] Ke et. al. `"LightGBM: A Highly Efficient Gradient
1216     BoostingDecision Tree" <https://papers.nips.cc/paper/
1217     6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree>`_
1218  .. [Fisher1958] Walter D. Fisher. `"On Grouping for Maximum Homogeneity"
1219     <http://www.csiss.org/SPACE/workshops/2004/SAC/files/fisher.pdf>`_
1220
1221.. _voting_classifier:
1222
1223Voting Classifier
1224========================
1225
1226The idea behind the :class:`VotingClassifier` is to combine
1227conceptually different machine learning classifiers and use a majority vote
1228or the average predicted probabilities (soft vote) to predict the class labels.
1229Such a classifier can be useful for a set of equally well performing model
1230in order to balance out their individual weaknesses.
1231
1232
1233Majority Class Labels (Majority/Hard Voting)
1234--------------------------------------------
1235
1236In majority voting, the predicted class label for a particular sample is
1237the class label that represents the majority (mode) of the class labels
1238predicted by each individual classifier.
1239
1240E.g., if the prediction for a given sample is
1241
1242- classifier 1 -> class 1
1243- classifier 2 -> class 1
1244- classifier 3 -> class 2
1245
1246the VotingClassifier (with ``voting='hard'``) would classify the sample
1247as "class 1" based on the majority class label.
1248
1249In the cases of a tie, the :class:`VotingClassifier` will select the class
1250based on the ascending sort order. E.g., in the following scenario
1251
1252- classifier 1 -> class 2
1253- classifier 2 -> class 1
1254
1255the class label 1 will be assigned to the sample.
1256
1257Usage
1258-----
1259
1260The following example shows how to fit the majority rule classifier::
1261
1262   >>> from sklearn import datasets
1263   >>> from sklearn.model_selection import cross_val_score
1264   >>> from sklearn.linear_model import LogisticRegression
1265   >>> from sklearn.naive_bayes import GaussianNB
1266   >>> from sklearn.ensemble import RandomForestClassifier
1267   >>> from sklearn.ensemble import VotingClassifier
1268
1269   >>> iris = datasets.load_iris()
1270   >>> X, y = iris.data[:, 1:3], iris.target
1271
1272   >>> clf1 = LogisticRegression(random_state=1)
1273   >>> clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
1274   >>> clf3 = GaussianNB()
1275
1276   >>> eclf = VotingClassifier(
1277   ...     estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
1278   ...     voting='hard')
1279
1280   >>> for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
1281   ...     scores = cross_val_score(clf, X, y, scoring='accuracy', cv=5)
1282   ...     print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
1283   Accuracy: 0.95 (+/- 0.04) [Logistic Regression]
1284   Accuracy: 0.94 (+/- 0.04) [Random Forest]
1285   Accuracy: 0.91 (+/- 0.04) [naive Bayes]
1286   Accuracy: 0.95 (+/- 0.04) [Ensemble]
1287
1288
1289Weighted Average Probabilities (Soft Voting)
1290--------------------------------------------
1291
1292In contrast to majority voting (hard voting), soft voting
1293returns the class label as argmax of the sum of predicted probabilities.
1294
1295Specific weights can be assigned to each classifier via the ``weights``
1296parameter. When weights are provided, the predicted class probabilities
1297for each classifier are collected, multiplied by the classifier weight,
1298and averaged. The final class label is then derived from the class label
1299with the highest average probability.
1300
1301To illustrate this with a simple example, let's assume we have 3
1302classifiers and a 3-class classification problems where we assign
1303equal weights to all classifiers: w1=1, w2=1, w3=1.
1304
1305The weighted average probabilities for a sample would then be
1306calculated as follows:
1307
1308================  ==========    ==========      ==========
1309classifier        class 1       class 2         class 3
1310================  ==========    ==========      ==========
1311classifier 1	  w1 * 0.2      w1 * 0.5        w1 * 0.3
1312classifier 2	  w2 * 0.6      w2 * 0.3        w2 * 0.1
1313classifier 3      w3 * 0.3      w3 * 0.4        w3 * 0.3
1314weighted average  0.37	        0.4             0.23
1315================  ==========    ==========      ==========
1316
1317Here, the predicted class label is 2, since it has the
1318highest average probability.
1319
1320The following example illustrates how the decision regions may change
1321when a soft :class:`VotingClassifier` is used based on an linear Support
1322Vector Machine, a Decision Tree, and a K-nearest neighbor classifier::
1323
1324   >>> from sklearn import datasets
1325   >>> from sklearn.tree import DecisionTreeClassifier
1326   >>> from sklearn.neighbors import KNeighborsClassifier
1327   >>> from sklearn.svm import SVC
1328   >>> from itertools import product
1329   >>> from sklearn.ensemble import VotingClassifier
1330
1331   >>> # Loading some example data
1332   >>> iris = datasets.load_iris()
1333   >>> X = iris.data[:, [0, 2]]
1334   >>> y = iris.target
1335
1336   >>> # Training classifiers
1337   >>> clf1 = DecisionTreeClassifier(max_depth=4)
1338   >>> clf2 = KNeighborsClassifier(n_neighbors=7)
1339   >>> clf3 = SVC(kernel='rbf', probability=True)
1340   >>> eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), ('svc', clf3)],
1341   ...                         voting='soft', weights=[2, 1, 2])
1342
1343   >>> clf1 = clf1.fit(X, y)
1344   >>> clf2 = clf2.fit(X, y)
1345   >>> clf3 = clf3.fit(X, y)
1346   >>> eclf = eclf.fit(X, y)
1347
1348.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_voting_decision_regions_001.png
1349    :target: ../auto_examples/ensemble/plot_voting_decision_regions.html
1350    :align: center
1351    :scale: 75%
1352
1353Using the `VotingClassifier` with `GridSearchCV`
1354------------------------------------------------
1355
1356The :class:`VotingClassifier` can also be used together with
1357:class:`~sklearn.model_selection.GridSearchCV` in order to tune the
1358hyperparameters of the individual estimators::
1359
1360   >>> from sklearn.model_selection import GridSearchCV
1361   >>> clf1 = LogisticRegression(random_state=1)
1362   >>> clf2 = RandomForestClassifier(random_state=1)
1363   >>> clf3 = GaussianNB()
1364   >>> eclf = VotingClassifier(
1365   ...     estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
1366   ...     voting='soft'
1367   ... )
1368
1369   >>> params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200]}
1370
1371   >>> grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
1372   >>> grid = grid.fit(iris.data, iris.target)
1373
1374Usage
1375-----
1376
1377In order to predict the class labels based on the predicted
1378class-probabilities (scikit-learn estimators in the VotingClassifier
1379must support ``predict_proba`` method)::
1380
1381   >>> eclf = VotingClassifier(
1382   ...     estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
1383   ...     voting='soft'
1384   ... )
1385
1386Optionally, weights can be provided for the individual classifiers::
1387
1388   >>> eclf = VotingClassifier(
1389   ...     estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
1390   ...     voting='soft', weights=[2,5,1]
1391   ... )
1392
1393.. _voting_regressor:
1394
1395Voting Regressor
1396================
1397
1398The idea behind the :class:`VotingRegressor` is to combine conceptually
1399different machine learning regressors and return the average predicted values.
1400Such a regressor can be useful for a set of equally well performing models
1401in order to balance out their individual weaknesses.
1402
1403Usage
1404-----
1405
1406The following example shows how to fit the VotingRegressor::
1407
1408   >>> from sklearn.datasets import load_diabetes
1409   >>> from sklearn.ensemble import GradientBoostingRegressor
1410   >>> from sklearn.ensemble import RandomForestRegressor
1411   >>> from sklearn.linear_model import LinearRegression
1412   >>> from sklearn.ensemble import VotingRegressor
1413
1414   >>> # Loading some example data
1415   >>> X, y = load_diabetes(return_X_y=True)
1416
1417   >>> # Training classifiers
1418   >>> reg1 = GradientBoostingRegressor(random_state=1)
1419   >>> reg2 = RandomForestRegressor(random_state=1)
1420   >>> reg3 = LinearRegression()
1421   >>> ereg = VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('lr', reg3)])
1422   >>> ereg = ereg.fit(X, y)
1423
1424.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_voting_regressor_001.png
1425    :target: ../auto_examples/ensemble/plot_voting_regressor.html
1426    :align: center
1427    :scale: 75%
1428
1429.. topic:: Examples:
1430
1431  * :ref:`sphx_glr_auto_examples_ensemble_plot_voting_regressor.py`
1432
1433.. _stacking:
1434
1435Stacked generalization
1436======================
1437
1438Stacked generalization is a method for combining estimators to reduce their
1439biases [W1992]_ [HTF]_. More precisely, the predictions of each individual
1440estimator are stacked together and used as input to a final estimator to
1441compute the prediction. This final estimator is trained through
1442cross-validation.
1443
1444The :class:`StackingClassifier` and :class:`StackingRegressor` provide such
1445strategies which can be applied to classification and regression problems.
1446
1447The `estimators` parameter corresponds to the list of the estimators which
1448are stacked together in parallel on the input data. It should be given as a
1449list of names and estimators::
1450
1451  >>> from sklearn.linear_model import RidgeCV, LassoCV
1452  >>> from sklearn.neighbors import KNeighborsRegressor
1453  >>> estimators = [('ridge', RidgeCV()),
1454  ...               ('lasso', LassoCV(random_state=42)),
1455  ...               ('knr', KNeighborsRegressor(n_neighbors=20,
1456  ...                                           metric='euclidean'))]
1457
1458The `final_estimator` will use the predictions of the `estimators` as input. It
1459needs to be a classifier or a regressor when using :class:`StackingClassifier`
1460or :class:`StackingRegressor`, respectively::
1461
1462  >>> from sklearn.ensemble import GradientBoostingRegressor
1463  >>> from sklearn.ensemble import StackingRegressor
1464  >>> final_estimator = GradientBoostingRegressor(
1465  ...     n_estimators=25, subsample=0.5, min_samples_leaf=25, max_features=1,
1466  ...     random_state=42)
1467  >>> reg = StackingRegressor(
1468  ...     estimators=estimators,
1469  ...     final_estimator=final_estimator)
1470
1471To train the `estimators` and `final_estimator`, the `fit` method needs
1472to be called on the training data::
1473
1474  >>> from sklearn.datasets import load_diabetes
1475  >>> X, y = load_diabetes(return_X_y=True)
1476  >>> from sklearn.model_selection import train_test_split
1477  >>> X_train, X_test, y_train, y_test = train_test_split(X, y,
1478  ...                                                     random_state=42)
1479  >>> reg.fit(X_train, y_train)
1480  StackingRegressor(...)
1481
1482During training, the `estimators` are fitted on the whole training data
1483`X_train`. They will be used when calling `predict` or `predict_proba`. To
1484generalize and avoid over-fitting, the `final_estimator` is trained on
1485out-samples using :func:`sklearn.model_selection.cross_val_predict` internally.
1486
1487For :class:`StackingClassifier`, note that the output of the ``estimators`` is
1488controlled by the parameter `stack_method` and it is called by each estimator.
1489This parameter is either a string, being estimator method names, or `'auto'`
1490which will automatically identify an available method depending on the
1491availability, tested in the order of preference: `predict_proba`,
1492`decision_function` and `predict`.
1493
1494A :class:`StackingRegressor` and :class:`StackingClassifier` can be used as
1495any other regressor or classifier, exposing a `predict`, `predict_proba`, and
1496`decision_function` methods, e.g.::
1497
1498   >>> y_pred = reg.predict(X_test)
1499   >>> from sklearn.metrics import r2_score
1500   >>> print('R2 score: {:.2f}'.format(r2_score(y_test, y_pred)))
1501   R2 score: 0.53
1502
1503Note that it is also possible to get the output of the stacked
1504`estimators` using the `transform` method::
1505
1506  >>> reg.transform(X_test[:5])
1507  array([[142..., 138..., 146...],
1508         [179..., 182..., 151...],
1509         [139..., 132..., 158...],
1510         [286..., 292..., 225...],
1511         [126..., 124..., 164...]])
1512
1513In practice, a stacking predictor predicts as good as the best predictor of the
1514base layer and even sometimes outperforms it by combining the different
1515strengths of the these predictors. However, training a stacking predictor is
1516computationally expensive.
1517
1518.. note::
1519   For :class:`StackingClassifier`, when using `stack_method_='predict_proba'`,
1520   the first column is dropped when the problem is a binary classification
1521   problem. Indeed, both probability columns predicted by each estimator are
1522   perfectly collinear.
1523
1524.. note::
1525   Multiple stacking layers can be achieved by assigning `final_estimator` to
1526   a :class:`StackingClassifier` or :class:`StackingRegressor`::
1527
1528    >>> final_layer_rfr = RandomForestRegressor(
1529    ...     n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
1530    >>> final_layer_gbr = GradientBoostingRegressor(
1531    ...     n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
1532    >>> final_layer = StackingRegressor(
1533    ...     estimators=[('rf', final_layer_rfr),
1534    ...                 ('gbrt', final_layer_gbr)],
1535    ...     final_estimator=RidgeCV()
1536    ...     )
1537    >>> multi_layer_regressor = StackingRegressor(
1538    ...     estimators=[('ridge', RidgeCV()),
1539    ...                 ('lasso', LassoCV(random_state=42)),
1540    ...                 ('knr', KNeighborsRegressor(n_neighbors=20,
1541    ...                                             metric='euclidean'))],
1542    ...     final_estimator=final_layer
1543    ... )
1544    >>> multi_layer_regressor.fit(X_train, y_train)
1545    StackingRegressor(...)
1546    >>> print('R2 score: {:.2f}'
1547    ...       .format(multi_layer_regressor.score(X_test, y_test)))
1548    R2 score: 0.53
1549
1550.. topic:: References
1551
1552   .. [W1992] Wolpert, David H. "Stacked generalization." Neural networks 5.2
1553      (1992): 241-259.
1554