1.. _ensemble: 2 3================ 4Ensemble methods 5================ 6 7.. currentmodule:: sklearn.ensemble 8 9The goal of **ensemble methods** is to combine the predictions of several 10base estimators built with a given learning algorithm in order to improve 11generalizability / robustness over a single estimator. 12 13Two families of ensemble methods are usually distinguished: 14 15- In **averaging methods**, the driving principle is to build several 16 estimators independently and then to average their predictions. On average, 17 the combined estimator is usually better than any of the single base 18 estimator because its variance is reduced. 19 20 **Examples:** :ref:`Bagging methods <bagging>`, :ref:`Forests of randomized trees <forest>`, ... 21 22- By contrast, in **boosting methods**, base estimators are built sequentially 23 and one tries to reduce the bias of the combined estimator. The motivation is 24 to combine several weak models to produce a powerful ensemble. 25 26 **Examples:** :ref:`AdaBoost <adaboost>`, :ref:`Gradient Tree Boosting <gradient_boosting>`, ... 27 28 29.. _bagging: 30 31Bagging meta-estimator 32====================== 33 34In ensemble algorithms, bagging methods form a class of algorithms which build 35several instances of a black-box estimator on random subsets of the original 36training set and then aggregate their individual predictions to form a final 37prediction. These methods are used as a way to reduce the variance of a base 38estimator (e.g., a decision tree), by introducing randomization into its 39construction procedure and then making an ensemble out of it. In many cases, 40bagging methods constitute a very simple way to improve with respect to a 41single model, without making it necessary to adapt the underlying base 42algorithm. As they provide a way to reduce overfitting, bagging methods work 43best with strong and complex models (e.g., fully developed decision trees), in 44contrast with boosting methods which usually work best with weak models (e.g., 45shallow decision trees). 46 47Bagging methods come in many flavours but mostly differ from each other by the 48way they draw random subsets of the training set: 49 50 * When random subsets of the dataset are drawn as random subsets of the 51 samples, then this algorithm is known as Pasting [B1999]_. 52 53 * When samples are drawn with replacement, then the method is known as 54 Bagging [B1996]_. 55 56 * When random subsets of the dataset are drawn as random subsets of 57 the features, then the method is known as Random Subspaces [H1998]_. 58 59 * Finally, when base estimators are built on subsets of both samples and 60 features, then the method is known as Random Patches [LG2012]_. 61 62In scikit-learn, bagging methods are offered as a unified 63:class:`BaggingClassifier` meta-estimator (resp. :class:`BaggingRegressor`), 64taking as input a user-specified base estimator along with parameters 65specifying the strategy to draw random subsets. In particular, ``max_samples`` 66and ``max_features`` control the size of the subsets (in terms of samples and 67features), while ``bootstrap`` and ``bootstrap_features`` control whether 68samples and features are drawn with or without replacement. When using a subset 69of the available samples the generalization accuracy can be estimated with the 70out-of-bag samples by setting ``oob_score=True``. As an example, the 71snippet below illustrates how to instantiate a bagging ensemble of 72:class:`KNeighborsClassifier` base estimators, each built on random subsets of 7350% of the samples and 50% of the features. 74 75 >>> from sklearn.ensemble import BaggingClassifier 76 >>> from sklearn.neighbors import KNeighborsClassifier 77 >>> bagging = BaggingClassifier(KNeighborsClassifier(), 78 ... max_samples=0.5, max_features=0.5) 79 80.. topic:: Examples: 81 82 * :ref:`sphx_glr_auto_examples_ensemble_plot_bias_variance.py` 83 84.. topic:: References 85 86 .. [B1999] L. Breiman, "Pasting small votes for classification in large 87 databases and on-line", Machine Learning, 36(1), 85-103, 1999. 88 89 .. [B1996] L. Breiman, "Bagging predictors", Machine Learning, 24(2), 90 123-140, 1996. 91 92 .. [H1998] T. Ho, "The random subspace method for constructing decision 93 forests", Pattern Analysis and Machine Intelligence, 20(8), 832-844, 94 1998. 95 96 .. [LG2012] G. Louppe and P. Geurts, "Ensembles on Random Patches", 97 Machine Learning and Knowledge Discovery in Databases, 346-361, 2012. 98 99.. _forest: 100 101Forests of randomized trees 102=========================== 103 104The :mod:`sklearn.ensemble` module includes two averaging algorithms based 105on randomized :ref:`decision trees <tree>`: the RandomForest algorithm 106and the Extra-Trees method. Both algorithms are perturb-and-combine 107techniques [B1998]_ specifically designed for trees. This means a diverse 108set of classifiers is created by introducing randomness in the classifier 109construction. The prediction of the ensemble is given as the averaged 110prediction of the individual classifiers. 111 112As other classifiers, forest classifiers have to be fitted with two 113arrays: a sparse or dense array X of shape ``(n_samples, n_features)`` 114holding the training samples, and an array Y of shape ``(n_samples,)`` 115holding the target values (class labels) for the training samples:: 116 117 >>> from sklearn.ensemble import RandomForestClassifier 118 >>> X = [[0, 0], [1, 1]] 119 >>> Y = [0, 1] 120 >>> clf = RandomForestClassifier(n_estimators=10) 121 >>> clf = clf.fit(X, Y) 122 123Like :ref:`decision trees <tree>`, forests of trees also extend to 124:ref:`multi-output problems <tree_multioutput>` (if Y is an array 125of shape ``(n_samples, n_outputs)``). 126 127Random Forests 128-------------- 129 130In random forests (see :class:`RandomForestClassifier` and 131:class:`RandomForestRegressor` classes), each tree in the ensemble is built 132from a sample drawn with replacement (i.e., a bootstrap sample) from the 133training set. 134 135Furthermore, when splitting each node during the construction of a tree, the 136best split is found either from all input features or a random subset of size 137``max_features``. (See the :ref:`parameter tuning guidelines 138<random_forest_parameters>` for more details). 139 140The purpose of these two sources of randomness is to decrease the variance of 141the forest estimator. Indeed, individual decision trees typically exhibit high 142variance and tend to overfit. The injected randomness in forests yield decision 143trees with somewhat decoupled prediction errors. By taking an average of those 144predictions, some errors can cancel out. Random forests achieve a reduced 145variance by combining diverse trees, sometimes at the cost of a slight increase 146in bias. In practice the variance reduction is often significant hence yielding 147an overall better model. 148 149In contrast to the original publication [B2001]_, the scikit-learn 150implementation combines classifiers by averaging their probabilistic 151prediction, instead of letting each classifier vote for a single class. 152 153Extremely Randomized Trees 154-------------------------- 155 156In extremely randomized trees (see :class:`ExtraTreesClassifier` 157and :class:`ExtraTreesRegressor` classes), randomness goes one step 158further in the way splits are computed. As in random forests, a random 159subset of candidate features is used, but instead of looking for the 160most discriminative thresholds, thresholds are drawn at random for each 161candidate feature and the best of these randomly-generated thresholds is 162picked as the splitting rule. This usually allows to reduce the variance 163of the model a bit more, at the expense of a slightly greater increase 164in bias:: 165 166 >>> from sklearn.model_selection import cross_val_score 167 >>> from sklearn.datasets import make_blobs 168 >>> from sklearn.ensemble import RandomForestClassifier 169 >>> from sklearn.ensemble import ExtraTreesClassifier 170 >>> from sklearn.tree import DecisionTreeClassifier 171 172 >>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100, 173 ... random_state=0) 174 175 >>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2, 176 ... random_state=0) 177 >>> scores = cross_val_score(clf, X, y, cv=5) 178 >>> scores.mean() 179 0.98... 180 181 >>> clf = RandomForestClassifier(n_estimators=10, max_depth=None, 182 ... min_samples_split=2, random_state=0) 183 >>> scores = cross_val_score(clf, X, y, cv=5) 184 >>> scores.mean() 185 0.999... 186 187 >>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, 188 ... min_samples_split=2, random_state=0) 189 >>> scores = cross_val_score(clf, X, y, cv=5) 190 >>> scores.mean() > 0.999 191 True 192 193.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_forest_iris_001.png 194 :target: ../auto_examples/ensemble/plot_forest_iris.html 195 :align: center 196 :scale: 75% 197 198.. _random_forest_parameters: 199 200Parameters 201---------- 202 203The main parameters to adjust when using these methods is ``n_estimators`` and 204``max_features``. The former is the number of trees in the forest. The larger 205the better, but also the longer it will take to compute. In addition, note that 206results will stop getting significantly better beyond a critical number of 207trees. The latter is the size of the random subsets of features to consider 208when splitting a node. The lower the greater the reduction of variance, but 209also the greater the increase in bias. Empirical good default values are 210``max_features=None`` (always considering all features instead of a random 211subset) for regression problems, and ``max_features="sqrt"`` (using a random 212subset of size ``sqrt(n_features)``) for classification tasks (where 213``n_features`` is the number of features in the data). Good results are often 214achieved when setting ``max_depth=None`` in combination with 215``min_samples_split=2`` (i.e., when fully developing the trees). Bear in mind 216though that these values are usually not optimal, and might result in models 217that consume a lot of RAM. The best parameter values should always be 218cross-validated. In addition, note that in random forests, bootstrap samples 219are used by default (``bootstrap=True``) while the default strategy for 220extra-trees is to use the whole dataset (``bootstrap=False``). When using 221bootstrap sampling the generalization accuracy can be estimated on the left out 222or out-of-bag samples. This can be enabled by setting ``oob_score=True``. 223 224.. note:: 225 226 The size of the model with the default parameters is :math:`O( M * N * log (N) )`, 227 where :math:`M` is the number of trees and :math:`N` is the number of samples. 228 In order to reduce the size of the model, you can change these parameters: 229 ``min_samples_split``, ``max_leaf_nodes``, ``max_depth`` and ``min_samples_leaf``. 230 231Parallelization 232--------------- 233 234Finally, this module also features the parallel construction of the trees 235and the parallel computation of the predictions through the ``n_jobs`` 236parameter. If ``n_jobs=k`` then computations are partitioned into 237``k`` jobs, and run on ``k`` cores of the machine. If ``n_jobs=-1`` 238then all cores available on the machine are used. Note that because of 239inter-process communication overhead, the speedup might not be linear 240(i.e., using ``k`` jobs will unfortunately not be ``k`` times as 241fast). Significant speedup can still be achieved though when building 242a large number of trees, or when building a single tree requires a fair 243amount of time (e.g., on large datasets). 244 245.. topic:: Examples: 246 247 * :ref:`sphx_glr_auto_examples_ensemble_plot_forest_iris.py` 248 * :ref:`sphx_glr_auto_examples_ensemble_plot_forest_importances_faces.py` 249 * :ref:`sphx_glr_auto_examples_miscellaneous_plot_multioutput_face_completion.py` 250 251.. topic:: References 252 253 .. [B2001] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001. 254 255 .. [B1998] L. Breiman, "Arcing Classifiers", Annals of Statistics 1998. 256 257 * P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized 258 trees", Machine Learning, 63(1), 3-42, 2006. 259 260.. _random_forest_feature_importance: 261 262Feature importance evaluation 263----------------------------- 264 265The relative rank (i.e. depth) of a feature used as a decision node in a 266tree can be used to assess the relative importance of that feature with 267respect to the predictability of the target variable. Features used at 268the top of the tree contribute to the final prediction decision of a 269larger fraction of the input samples. The **expected fraction of the 270samples** they contribute to can thus be used as an estimate of the 271**relative importance of the features**. In scikit-learn, the fraction of 272samples a feature contributes to is combined with the decrease in impurity 273from splitting them to create a normalized estimate of the predictive power 274of that feature. 275 276By **averaging** the estimates of predictive ability over several randomized 277trees one can **reduce the variance** of such an estimate and use it 278for feature selection. This is known as the mean decrease in impurity, or MDI. 279Refer to [L2014]_ for more information on MDI and feature importance 280evaluation with Random Forests. 281 282.. warning:: 283 284 The impurity-based feature importances computed on tree-based models suffer 285 from two flaws that can lead to misleading conclusions. First they are 286 computed on statistics derived from the training dataset and therefore **do 287 not necessarily inform us on which features are most important to make good 288 predictions on held-out dataset**. Secondly, **they favor high cardinality 289 features**, that is features with many unique values. 290 :ref:`permutation_importance` is an alternative to impurity-based feature 291 importance that does not suffer from these flaws. These two methods of 292 obtaining feature importance are explored in: 293 :ref:`sphx_glr_auto_examples_inspection_plot_permutation_importance.py`. 294 295The following example shows a color-coded representation of the relative 296importances of each individual pixel for a face recognition task using 297a :class:`ExtraTreesClassifier` model. 298 299.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_forest_importances_faces_001.png 300 :target: ../auto_examples/ensemble/plot_forest_importances_faces.html 301 :align: center 302 :scale: 75 303 304In practice those estimates are stored as an attribute named 305``feature_importances_`` on the fitted model. This is an array with shape 306``(n_features,)`` whose values are positive and sum to 1.0. The higher 307the value, the more important is the contribution of the matching feature 308to the prediction function. 309 310.. topic:: Examples: 311 312 * :ref:`sphx_glr_auto_examples_ensemble_plot_forest_importances_faces.py` 313 * :ref:`sphx_glr_auto_examples_ensemble_plot_forest_importances.py` 314 315.. topic:: References 316 317 .. [L2014] G. Louppe, 318 "Understanding Random Forests: From Theory to Practice", 319 PhD Thesis, U. of Liege, 2014. 320 321.. _random_trees_embedding: 322 323Totally Random Trees Embedding 324------------------------------ 325 326:class:`RandomTreesEmbedding` implements an unsupervised transformation of the 327data. Using a forest of completely random trees, :class:`RandomTreesEmbedding` 328encodes the data by the indices of the leaves a data point ends up in. This 329index is then encoded in a one-of-K manner, leading to a high dimensional, 330sparse binary coding. 331This coding can be computed very efficiently and can then be used as a basis 332for other learning tasks. 333The size and sparsity of the code can be influenced by choosing the number of 334trees and the maximum depth per tree. For each tree in the ensemble, the coding 335contains one entry of one. The size of the coding is at most ``n_estimators * 2 336** max_depth``, the maximum number of leaves in the forest. 337 338As neighboring data points are more likely to lie within the same leaf of a 339tree, the transformation performs an implicit, non-parametric density 340estimation. 341 342.. topic:: Examples: 343 344 * :ref:`sphx_glr_auto_examples_ensemble_plot_random_forest_embedding.py` 345 346 * :ref:`sphx_glr_auto_examples_manifold_plot_lle_digits.py` compares non-linear 347 dimensionality reduction techniques on handwritten digits. 348 349 * :ref:`sphx_glr_auto_examples_ensemble_plot_feature_transformation.py` compares 350 supervised and unsupervised tree based feature transformations. 351 352.. seealso:: 353 354 :ref:`manifold` techniques can also be useful to derive non-linear 355 representations of feature space, also these approaches focus also on 356 dimensionality reduction. 357 358 359.. _adaboost: 360 361AdaBoost 362======== 363 364The module :mod:`sklearn.ensemble` includes the popular boosting algorithm 365AdaBoost, introduced in 1995 by Freund and Schapire [FS1995]_. 366 367The core principle of AdaBoost is to fit a sequence of weak learners (i.e., 368models that are only slightly better than random guessing, such as small 369decision trees) on repeatedly modified versions of the data. The predictions 370from all of them are then combined through a weighted majority vote (or sum) to 371produce the final prediction. The data modifications at each so-called boosting 372iteration consist of applying weights :math:`w_1`, :math:`w_2`, ..., :math:`w_N` 373to each of the training samples. Initially, those weights are all set to 374:math:`w_i = 1/N`, so that the first step simply trains a weak learner on the 375original data. For each successive iteration, the sample weights are 376individually modified and the learning algorithm is reapplied to the reweighted 377data. At a given step, those training examples that were incorrectly predicted 378by the boosted model induced at the previous step have their weights increased, 379whereas the weights are decreased for those that were predicted correctly. As 380iterations proceed, examples that are difficult to predict receive 381ever-increasing influence. Each subsequent weak learner is thereby forced to 382concentrate on the examples that are missed by the previous ones in the sequence 383[HTF]_. 384 385.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_adaboost_hastie_10_2_001.png 386 :target: ../auto_examples/ensemble/plot_adaboost_hastie_10_2.html 387 :align: center 388 :scale: 75 389 390AdaBoost can be used both for classification and regression problems: 391 392 - For multi-class classification, :class:`AdaBoostClassifier` implements 393 AdaBoost-SAMME and AdaBoost-SAMME.R [ZZRH2009]_. 394 395 - For regression, :class:`AdaBoostRegressor` implements AdaBoost.R2 [D1997]_. 396 397Usage 398----- 399 400The following example shows how to fit an AdaBoost classifier with 100 weak 401learners:: 402 403 >>> from sklearn.model_selection import cross_val_score 404 >>> from sklearn.datasets import load_iris 405 >>> from sklearn.ensemble import AdaBoostClassifier 406 407 >>> X, y = load_iris(return_X_y=True) 408 >>> clf = AdaBoostClassifier(n_estimators=100) 409 >>> scores = cross_val_score(clf, X, y, cv=5) 410 >>> scores.mean() 411 0.9... 412 413The number of weak learners is controlled by the parameter ``n_estimators``. The 414``learning_rate`` parameter controls the contribution of the weak learners in 415the final combination. By default, weak learners are decision stumps. Different 416weak learners can be specified through the ``base_estimator`` parameter. 417The main parameters to tune to obtain good results are ``n_estimators`` and 418the complexity of the base estimators (e.g., its depth ``max_depth`` or 419minimum required number of samples to consider a split ``min_samples_split``). 420 421.. topic:: Examples: 422 423 * :ref:`sphx_glr_auto_examples_ensemble_plot_adaboost_hastie_10_2.py` compares the 424 classification error of a decision stump, decision tree, and a boosted 425 decision stump using AdaBoost-SAMME and AdaBoost-SAMME.R. 426 427 * :ref:`sphx_glr_auto_examples_ensemble_plot_adaboost_multiclass.py` shows the performance 428 of AdaBoost-SAMME and AdaBoost-SAMME.R on a multi-class problem. 429 430 * :ref:`sphx_glr_auto_examples_ensemble_plot_adaboost_twoclass.py` shows the decision boundary 431 and decision function values for a non-linearly separable two-class problem 432 using AdaBoost-SAMME. 433 434 * :ref:`sphx_glr_auto_examples_ensemble_plot_adaboost_regression.py` demonstrates regression 435 with the AdaBoost.R2 algorithm. 436 437.. topic:: References 438 439 .. [FS1995] Y. Freund, and R. Schapire, "A Decision-Theoretic Generalization of 440 On-Line Learning and an Application to Boosting", 1997. 441 442 .. [ZZRH2009] J. Zhu, H. Zou, S. Rosset, T. Hastie. "Multi-class AdaBoost", 443 2009. 444 445 .. [D1997] H. Drucker. "Improving Regressors using Boosting Techniques", 1997. 446 447 .. [HTF] T. Hastie, R. Tibshirani and J. Friedman, "Elements of 448 Statistical Learning Ed. 2", Springer, 2009. 449 450 451.. _gradient_boosting: 452 453Gradient Tree Boosting 454====================== 455 456`Gradient Tree Boosting <https://en.wikipedia.org/wiki/Gradient_boosting>`_ 457or Gradient Boosted Decision Trees (GBDT) is a generalization 458of boosting to arbitrary 459differentiable loss functions. GBDT is an accurate and effective 460off-the-shelf procedure that can be used for both regression and 461classification problems in a 462variety of areas including Web search ranking and ecology. 463 464The module :mod:`sklearn.ensemble` provides methods 465for both classification and regression via gradient boosted decision 466trees. 467 468.. note:: 469 470 Scikit-learn 0.21 introduces two new implementations of 471 gradient boosting trees, namely :class:`HistGradientBoostingClassifier` 472 and :class:`HistGradientBoostingRegressor`, inspired by 473 `LightGBM <https://github.com/Microsoft/LightGBM>`__ (See [LightGBM]_). 474 475 These histogram-based estimators can be **orders of magnitude faster** 476 than :class:`GradientBoostingClassifier` and 477 :class:`GradientBoostingRegressor` when the number of samples is larger 478 than tens of thousands of samples. 479 480 They also have built-in support for missing values, which avoids the need 481 for an imputer. 482 483 These estimators are described in more detail below in 484 :ref:`histogram_based_gradient_boosting`. 485 486 The following guide focuses on :class:`GradientBoostingClassifier` and 487 :class:`GradientBoostingRegressor`, which might be preferred for small 488 sample sizes since binning may lead to split points that are too approximate 489 in this setting. 490 491 492The usage and the parameters of :class:`GradientBoostingClassifier` and 493:class:`GradientBoostingRegressor` are described below. The 2 most important 494parameters of these estimators are `n_estimators` and `learning_rate`. 495 496Classification 497--------------- 498 499:class:`GradientBoostingClassifier` supports both binary and multi-class 500classification. 501The following example shows how to fit a gradient boosting classifier 502with 100 decision stumps as weak learners:: 503 504 >>> from sklearn.datasets import make_hastie_10_2 505 >>> from sklearn.ensemble import GradientBoostingClassifier 506 507 >>> X, y = make_hastie_10_2(random_state=0) 508 >>> X_train, X_test = X[:2000], X[2000:] 509 >>> y_train, y_test = y[:2000], y[2000:] 510 511 >>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, 512 ... max_depth=1, random_state=0).fit(X_train, y_train) 513 >>> clf.score(X_test, y_test) 514 0.913... 515 516The number of weak learners (i.e. regression trees) is controlled by the 517parameter ``n_estimators``; :ref:`The size of each tree 518<gradient_boosting_tree_size>` can be controlled either by setting the tree 519depth via ``max_depth`` or by setting the number of leaf nodes via 520``max_leaf_nodes``. The ``learning_rate`` is a hyper-parameter in the range 521(0.0, 1.0] that controls overfitting via :ref:`shrinkage 522<gradient_boosting_shrinkage>` . 523 524.. note:: 525 526 Classification with more than 2 classes requires the induction 527 of ``n_classes`` regression trees at each iteration, 528 thus, the total number of induced trees equals 529 ``n_classes * n_estimators``. For datasets with a large number 530 of classes we strongly recommend to use 531 :class:`HistGradientBoostingClassifier` as an alternative to 532 :class:`GradientBoostingClassifier` . 533 534Regression 535---------- 536 537:class:`GradientBoostingRegressor` supports a number of 538:ref:`different loss functions <gradient_boosting_loss>` 539for regression which can be specified via the argument 540``loss``; the default loss function for regression is squared error 541(``'squared_error'``). 542 543:: 544 545 >>> import numpy as np 546 >>> from sklearn.metrics import mean_squared_error 547 >>> from sklearn.datasets import make_friedman1 548 >>> from sklearn.ensemble import GradientBoostingRegressor 549 550 >>> X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0) 551 >>> X_train, X_test = X[:200], X[200:] 552 >>> y_train, y_test = y[:200], y[200:] 553 >>> est = GradientBoostingRegressor( 554 ... n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, 555 ... loss='squared_error' 556 ... ).fit(X_train, y_train) 557 >>> mean_squared_error(y_test, est.predict(X_test)) 558 5.00... 559 560The figure below shows the results of applying :class:`GradientBoostingRegressor` 561with least squares loss and 500 base learners to the diabetes dataset 562(:func:`sklearn.datasets.load_diabetes`). 563The plot on the left shows the train and test error at each iteration. 564The train error at each iteration is stored in the 565:attr:`~GradientBoostingRegressor.train_score_` attribute 566of the gradient boosting model. The test error at each iterations can be obtained 567via the :meth:`~GradientBoostingRegressor.staged_predict` method which returns a 568generator that yields the predictions at each stage. Plots like these can be used 569to determine the optimal number of trees (i.e. ``n_estimators``) by early stopping. 570 571.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_gradient_boosting_regression_001.png 572 :target: ../auto_examples/ensemble/plot_gradient_boosting_regression.html 573 :align: center 574 :scale: 75 575 576.. topic:: Examples: 577 578 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py` 579 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_oob.py` 580 581.. _gradient_boosting_warm_start: 582 583Fitting additional weak-learners 584-------------------------------- 585 586Both :class:`GradientBoostingRegressor` and :class:`GradientBoostingClassifier` 587support ``warm_start=True`` which allows you to add more estimators to an already 588fitted model. 589 590:: 591 592 >>> _ = est.set_params(n_estimators=200, warm_start=True) # set warm_start and new nr of trees 593 >>> _ = est.fit(X_train, y_train) # fit additional 100 trees to est 594 >>> mean_squared_error(y_test, est.predict(X_test)) 595 3.84... 596 597.. _gradient_boosting_tree_size: 598 599Controlling the tree size 600------------------------- 601 602The size of the regression tree base learners defines the level of variable 603interactions that can be captured by the gradient boosting model. In general, 604a tree of depth ``h`` can capture interactions of order ``h`` . 605There are two ways in which the size of the individual regression trees can 606be controlled. 607 608If you specify ``max_depth=h`` then complete binary trees 609of depth ``h`` will be grown. Such trees will have (at most) ``2**h`` leaf nodes 610and ``2**h - 1`` split nodes. 611 612Alternatively, you can control the tree size by specifying the number of 613leaf nodes via the parameter ``max_leaf_nodes``. In this case, 614trees will be grown using best-first search where nodes with the highest improvement 615in impurity will be expanded first. 616A tree with ``max_leaf_nodes=k`` has ``k - 1`` split nodes and thus can 617model interactions of up to order ``max_leaf_nodes - 1`` . 618 619We found that ``max_leaf_nodes=k`` gives comparable results to ``max_depth=k-1`` 620but is significantly faster to train at the expense of a slightly higher 621training error. 622The parameter ``max_leaf_nodes`` corresponds to the variable ``J`` in the 623chapter on gradient boosting in [F2001]_ and is related to the parameter 624``interaction.depth`` in R's gbm package where ``max_leaf_nodes == interaction.depth + 1`` . 625 626Mathematical formulation 627------------------------- 628 629We first present GBRT for regression, and then detail the classification 630case. 631 632Regression 633^^^^^^^^^^ 634 635GBRT regressors are additive models whose prediction :math:`y_i` for a 636given input :math:`x_i` is of the following form: 637 638 .. math:: 639 640 \hat{y_i} = F_M(x_i) = \sum_{m=1}^{M} h_m(x_i) 641 642where the :math:`h_m` are estimators called *weak learners* in the context 643of boosting. Gradient Tree Boosting uses :ref:`decision tree regressors 644<tree>` of fixed size as weak learners. The constant M corresponds to the 645`n_estimators` parameter. 646 647Similar to other boosting algorithms, a GBRT is built in a greedy fashion: 648 649 .. math:: 650 651 F_m(x) = F_{m-1}(x) + h_m(x), 652 653where the newly added tree :math:`h_m` is fitted in order to minimize a sum 654of losses :math:`L_m`, given the previous ensemble :math:`F_{m-1}`: 655 656 .. math:: 657 658 h_m = \arg\min_{h} L_m = \arg\min_{h} \sum_{i=1}^{n} 659 l(y_i, F_{m-1}(x_i) + h(x_i)), 660 661where :math:`l(y_i, F(x_i))` is defined by the `loss` parameter, detailed 662in the next section. 663 664By default, the initial model :math:`F_{0}` is chosen as the constant that 665minimizes the loss: for a least-squares loss, this is the empirical mean of 666the target values. The initial model can also be specified via the ``init`` 667argument. 668 669Using a first-order Taylor approximation, the value of :math:`l` can be 670approximated as follows: 671 672 .. math:: 673 674 l(y_i, F_{m-1}(x_i) + h_m(x_i)) \approx 675 l(y_i, F_{m-1}(x_i)) 676 + h_m(x_i) 677 \left[ \frac{\partial l(y_i, F(x_i))}{\partial F(x_i)} \right]_{F=F_{m - 1}}. 678 679.. note:: 680 681 Briefly, a first-order Taylor approximation says that 682 :math:`l(z) \approx l(a) + (z - a) \frac{\partial l(a)}{\partial a}`. 683 Here, :math:`z` corresponds to :math:`F_{m - 1}(x_i) + h_m(x_i)`, and 684 :math:`a` corresponds to :math:`F_{m-1}(x_i)` 685 686The quantity :math:`\left[ \frac{\partial l(y_i, F(x_i))}{\partial F(x_i)} 687\right]_{F=F_{m - 1}}` is the derivative of the loss with respect to its 688second parameter, evaluated at :math:`F_{m-1}(x)`. It is easy to compute for 689any given :math:`F_{m - 1}(x_i)` in a closed form since the loss is 690differentiable. We will denote it by :math:`g_i`. 691 692Removing the constant terms, we have: 693 694 .. math:: 695 696 h_m \approx \arg\min_{h} \sum_{i=1}^{n} h(x_i) g_i 697 698This is minimized if :math:`h(x_i)` is fitted to predict a value that is 699proportional to the negative gradient :math:`-g_i`. Therefore, at each 700iteration, **the estimator** :math:`h_m` **is fitted to predict the negative 701gradients of the samples**. The gradients are updated at each iteration. 702This can be considered as some kind of gradient descent in a functional 703space. 704 705.. note:: 706 707 For some losses, e.g. the least absolute deviation (LAD) where the gradients 708 are :math:`\pm 1`, the values predicted by a fitted :math:`h_m` are not 709 accurate enough: the tree can only output integer values. As a result, the 710 leaves values of the tree :math:`h_m` are modified once the tree is 711 fitted, such that the leaves values minimize the loss :math:`L_m`. The 712 update is loss-dependent: for the LAD loss, the value of a leaf is updated 713 to the median of the samples in that leaf. 714 715Classification 716^^^^^^^^^^^^^^ 717 718Gradient boosting for classification is very similar to the regression case. 719However, the sum of the trees :math:`F_M(x_i) = \sum_m h_m(x_i)` is not 720homogeneous to a prediction: it cannot be a class, since the trees predict 721continuous values. 722 723The mapping from the value :math:`F_M(x_i)` to a class or a probability is 724loss-dependent. For the deviance (or log-loss), the probability that 725:math:`x_i` belongs to the positive class is modeled as :math:`p(y_i = 1 | 726x_i) = \sigma(F_M(x_i))` where :math:`\sigma` is the sigmoid function. 727 728For multiclass classification, K trees (for K classes) are built at each of 729the :math:`M` iterations. The probability that :math:`x_i` belongs to class 730k is modeled as a softmax of the :math:`F_{M,k}(x_i)` values. 731 732Note that even for a classification task, the :math:`h_m` sub-estimator is 733still a regressor, not a classifier. This is because the sub-estimators are 734trained to predict (negative) *gradients*, which are always continuous 735quantities. 736 737.. _gradient_boosting_loss: 738 739Loss Functions 740-------------- 741 742The following loss functions are supported and can be specified using 743the parameter ``loss``: 744 745 * Regression 746 747 * Squared error (``'squared_error'``): The natural choice for regression 748 due to its superior computational properties. The initial model is 749 given by the mean of the target values. 750 * Least absolute deviation (``'lad'``): A robust loss function for 751 regression. The initial model is given by the median of the 752 target values. 753 * Huber (``'huber'``): Another robust loss function that combines 754 least squares and least absolute deviation; use ``alpha`` to 755 control the sensitivity with regards to outliers (see [F2001]_ for 756 more details). 757 * Quantile (``'quantile'``): A loss function for quantile regression. 758 Use ``0 < alpha < 1`` to specify the quantile. This loss function 759 can be used to create prediction intervals 760 (see :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_quantile.py`). 761 762 * Classification 763 764 * Binomial deviance (``'deviance'``): The binomial 765 negative log-likelihood loss function for binary classification (provides 766 probability estimates). The initial model is given by the 767 log odds-ratio. 768 * Multinomial deviance (``'deviance'``): The multinomial 769 negative log-likelihood loss function for multi-class classification with 770 ``n_classes`` mutually exclusive classes. It provides 771 probability estimates. The initial model is given by the 772 prior probability of each class. At each iteration ``n_classes`` 773 regression trees have to be constructed which makes GBRT rather 774 inefficient for data sets with a large number of classes. 775 * Exponential loss (``'exponential'``): The same loss function 776 as :class:`AdaBoostClassifier`. Less robust to mislabeled 777 examples than ``'deviance'``; can only be used for binary 778 classification. 779 780.. _gradient_boosting_shrinkage: 781 782Shrinkage via learning rate 783--------------------------- 784 785[F2001]_ proposed a simple regularization strategy that scales 786the contribution of each weak learner by a constant factor :math:`\nu`: 787 788.. math:: 789 790 F_m(x) = F_{m-1}(x) + \nu h_m(x) 791 792The parameter :math:`\nu` is also called the **learning rate** because 793it scales the step length the gradient descent procedure; it can 794be set via the ``learning_rate`` parameter. 795 796The parameter ``learning_rate`` strongly interacts with the parameter 797``n_estimators``, the number of weak learners to fit. Smaller values 798of ``learning_rate`` require larger numbers of weak learners to maintain 799a constant training error. Empirical evidence suggests that small 800values of ``learning_rate`` favor better test error. [HTF]_ 801recommend to set the learning rate to a small constant 802(e.g. ``learning_rate <= 0.1``) and choose ``n_estimators`` by early 803stopping. For a more detailed discussion of the interaction between 804``learning_rate`` and ``n_estimators`` see [R2007]_. 805 806Subsampling 807----------- 808 809[F1999]_ proposed stochastic gradient boosting, which combines gradient 810boosting with bootstrap averaging (bagging). At each iteration 811the base classifier is trained on a fraction ``subsample`` of 812the available training data. The subsample is drawn without replacement. 813A typical value of ``subsample`` is 0.5. 814 815The figure below illustrates the effect of shrinkage and subsampling 816on the goodness-of-fit of the model. We can clearly see that shrinkage 817outperforms no-shrinkage. Subsampling with shrinkage can further increase 818the accuracy of the model. Subsampling without shrinkage, on the other hand, 819does poorly. 820 821.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_gradient_boosting_regularization_001.png 822 :target: ../auto_examples/ensemble/plot_gradient_boosting_regularization.html 823 :align: center 824 :scale: 75 825 826Another strategy to reduce the variance is by subsampling the features 827analogous to the random splits in :class:`RandomForestClassifier` . 828The number of subsampled features can be controlled via the ``max_features`` 829parameter. 830 831.. note:: Using a small ``max_features`` value can significantly decrease the runtime. 832 833Stochastic gradient boosting allows to compute out-of-bag estimates of the 834test deviance by computing the improvement in deviance on the examples that are 835not included in the bootstrap sample (i.e. the out-of-bag examples). 836The improvements are stored in the attribute 837:attr:`~GradientBoostingRegressor.oob_improvement_`. ``oob_improvement_[i]`` holds 838the improvement in terms of the loss on the OOB samples if you add the i-th stage 839to the current predictions. 840Out-of-bag estimates can be used for model selection, for example to determine 841the optimal number of iterations. OOB estimates are usually very pessimistic thus 842we recommend to use cross-validation instead and only use OOB if cross-validation 843is too time consuming. 844 845.. topic:: Examples: 846 847 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regularization.py` 848 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_oob.py` 849 * :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py` 850 851Interpretation with feature importance 852-------------------------------------- 853 854Individual decision trees can be interpreted easily by simply 855visualizing the tree structure. Gradient boosting models, however, 856comprise hundreds of regression trees thus they cannot be easily 857interpreted by visual inspection of the individual trees. Fortunately, 858a number of techniques have been proposed to summarize and interpret 859gradient boosting models. 860 861Often features do not contribute equally to predict the target 862response; in many situations the majority of the features are in fact 863irrelevant. 864When interpreting a model, the first question usually is: what are 865those important features and how do they contributing in predicting 866the target response? 867 868Individual decision trees intrinsically perform feature selection by selecting 869appropriate split points. This information can be used to measure the 870importance of each feature; the basic idea is: the more often a 871feature is used in the split points of a tree the more important that 872feature is. This notion of importance can be extended to decision tree 873ensembles by simply averaging the impurity-based feature importance of each tree (see 874:ref:`random_forest_feature_importance` for more details). 875 876The feature importance scores of a fit gradient boosting model can be 877accessed via the ``feature_importances_`` property:: 878 879 >>> from sklearn.datasets import make_hastie_10_2 880 >>> from sklearn.ensemble import GradientBoostingClassifier 881 882 >>> X, y = make_hastie_10_2(random_state=0) 883 >>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, 884 ... max_depth=1, random_state=0).fit(X, y) 885 >>> clf.feature_importances_ 886 array([0.10..., 0.10..., 0.11..., ... 887 888Note that this computation of feature importance is based on entropy, and it 889is distinct from :func:`sklearn.inspection.permutation_importance` which is 890based on permutation of the features. 891 892.. topic:: Examples: 893 894 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py` 895 896.. _histogram_based_gradient_boosting: 897 898Histogram-Based Gradient Boosting 899================================= 900 901Scikit-learn 0.21 introduced two new implementations of 902gradient boosting trees, namely :class:`HistGradientBoostingClassifier` 903and :class:`HistGradientBoostingRegressor`, inspired by 904`LightGBM <https://github.com/Microsoft/LightGBM>`__ (See [LightGBM]_). 905 906These histogram-based estimators can be **orders of magnitude faster** 907than :class:`GradientBoostingClassifier` and 908:class:`GradientBoostingRegressor` when the number of samples is larger 909than tens of thousands of samples. 910 911They also have built-in support for missing values, which avoids the need 912for an imputer. 913 914These fast estimators first bin the input samples ``X`` into 915integer-valued bins (typically 256 bins) which tremendously reduces the 916number of splitting points to consider, and allows the algorithm to 917leverage integer-based data structures (histograms) instead of relying on 918sorted continuous values when building the trees. The API of these 919estimators is slightly different, and some of the features from 920:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor` 921are not yet supported, for instance some loss functions. 922 923.. topic:: Examples: 924 925 * :ref:`sphx_glr_auto_examples_inspection_plot_partial_dependence.py` 926 927Usage 928----- 929 930Most of the parameters are unchanged from 931:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`. 932One exception is the ``max_iter`` parameter that replaces ``n_estimators``, and 933controls the number of iterations of the boosting process:: 934 935 >>> from sklearn.ensemble import HistGradientBoostingClassifier 936 >>> from sklearn.datasets import make_hastie_10_2 937 938 >>> X, y = make_hastie_10_2(random_state=0) 939 >>> X_train, X_test = X[:2000], X[2000:] 940 >>> y_train, y_test = y[:2000], y[2000:] 941 942 >>> clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train) 943 >>> clf.score(X_test, y_test) 944 0.8965 945 946Available losses for regression are 'squared_error', 947'absolute_error', which is less sensitive to outliers, and 948'poisson', which is well suited to model counts and frequencies. For 949classification, 'binary_crossentropy' is used for binary classification and 950'categorical_crossentropy' is used for multiclass classification. By default 951the loss is 'auto' and will select the appropriate loss depending on 952:term:`y` passed to :term:`fit`. 953 954The size of the trees can be controlled through the ``max_leaf_nodes``, 955``max_depth``, and ``min_samples_leaf`` parameters. 956 957The number of bins used to bin the data is controlled with the ``max_bins`` 958parameter. Using less bins acts as a form of regularization. It is 959generally recommended to use as many bins as possible, which is the default. 960 961The ``l2_regularization`` parameter is a regularizer on the loss function and 962corresponds to :math:`\lambda` in equation (2) of [XGBoost]_. 963 964Note that **early-stopping is enabled by default if the number of samples is 965larger than 10,000**. The early-stopping behaviour is controlled via the 966``early-stopping``, ``scoring``, ``validation_fraction``, 967``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop 968using an arbitrary :term:`scorer`, or just the training or validation loss. 969Note that for technical reasons, using a scorer is significantly slower than 970using the loss. By default, early-stopping is performed if there are at least 97110,000 samples in the training set, using the validation loss. 972 973Missing values support 974---------------------- 975 976:class:`HistGradientBoostingClassifier` and 977:class:`HistGradientBoostingRegressor` have built-in support for missing 978values (NaNs). 979 980During training, the tree grower learns at each split point whether samples 981with missing values should go to the left or right child, based on the 982potential gain. When predicting, samples with missing values are assigned to 983the left or right child consequently:: 984 985 >>> from sklearn.ensemble import HistGradientBoostingClassifier 986 >>> import numpy as np 987 988 >>> X = np.array([0, 1, 2, np.nan]).reshape(-1, 1) 989 >>> y = [0, 0, 1, 1] 990 991 >>> gbdt = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y) 992 >>> gbdt.predict(X) 993 array([0, 0, 1, 1]) 994 995When the missingness pattern is predictive, the splits can be done on 996whether the feature value is missing or not:: 997 998 >>> X = np.array([0, np.nan, 1, 2, np.nan]).reshape(-1, 1) 999 >>> y = [0, 1, 0, 0, 1] 1000 >>> gbdt = HistGradientBoostingClassifier(min_samples_leaf=1, 1001 ... max_depth=2, 1002 ... learning_rate=1, 1003 ... max_iter=1).fit(X, y) 1004 >>> gbdt.predict(X) 1005 array([0, 1, 0, 0, 1]) 1006 1007If no missing values were encountered for a given feature during training, 1008then samples with missing values are mapped to whichever child has the most 1009samples. 1010 1011.. _sw_hgbdt: 1012 1013Sample weight support 1014--------------------- 1015 1016:class:`HistGradientBoostingClassifier` and 1017:class:`HistGradientBoostingRegressor` sample support weights during 1018:term:`fit`. 1019 1020The following toy example demonstrates how the model ignores the samples with 1021zero sample weights: 1022 1023 >>> X = [[1, 0], 1024 ... [1, 0], 1025 ... [1, 0], 1026 ... [0, 1]] 1027 >>> y = [0, 0, 1, 0] 1028 >>> # ignore the first 2 training samples by setting their weight to 0 1029 >>> sample_weight = [0, 0, 1, 1] 1030 >>> gb = HistGradientBoostingClassifier(min_samples_leaf=1) 1031 >>> gb.fit(X, y, sample_weight=sample_weight) 1032 HistGradientBoostingClassifier(...) 1033 >>> gb.predict([[1, 0]]) 1034 array([1]) 1035 >>> gb.predict_proba([[1, 0]])[0, 1] 1036 0.99... 1037 1038As you can see, the `[1, 0]` is comfortably classified as `1` since the first 1039two samples are ignored due to their sample weights. 1040 1041Implementation detail: taking sample weights into account amounts to 1042multiplying the gradients (and the hessians) by the sample weights. Note that 1043the binning stage (specifically the quantiles computation) does not take the 1044weights into account. 1045 1046.. _categorical_support_gbdt: 1047 1048Categorical Features Support 1049---------------------------- 1050 1051:class:`HistGradientBoostingClassifier` and 1052:class:`HistGradientBoostingRegressor` have native support for categorical 1053features: they can consider splits on non-ordered, categorical data. 1054 1055For datasets with categorical features, using the native categorical support 1056is often better than relying on one-hot encoding 1057(:class:`~sklearn.preprocessing.OneHotEncoder`), because one-hot encoding 1058requires more tree depth to achieve equivalent splits. It is also usually 1059better to rely on the native categorical support rather than to treat 1060categorical features as continuous (ordinal), which happens for ordinal-encoded 1061categorical data, since categories are nominal quantities where order does not 1062matter. 1063 1064To enable categorical support, a boolean mask can be passed to the 1065`categorical_features` parameter, indicating which feature is categorical. In 1066the following, the first feature will be treated as categorical and the 1067second feature as numerical:: 1068 1069 >>> gbdt = HistGradientBoostingClassifier(categorical_features=[True, False]) 1070 1071Equivalently, one can pass a list of integers indicating the indices of the 1072categorical features:: 1073 1074 >>> gbdt = HistGradientBoostingClassifier(categorical_features=[0]) 1075 1076The cardinality of each categorical feature should be less than the `max_bins` 1077parameter, and each categorical feature is expected to be encoded in 1078`[0, max_bins - 1]`. To that end, it might be useful to pre-process the data 1079with an :class:`~sklearn.preprocessing.OrdinalEncoder` as done in 1080:ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py`. 1081 1082If there are missing values during training, the missing values will be 1083treated as a proper category. If there are no missing values during training, 1084then at prediction time, missing values are mapped to the child node that has 1085the most samples (just like for continuous features). When predicting, 1086categories that were not seen during fit time will be treated as missing 1087values. 1088 1089**Split finding with categorical features**: The canonical way of considering 1090categorical splits in a tree is to consider 1091all of the :math:`2^{K - 1} - 1` partitions, where :math:`K` is the number of 1092categories. This can quickly become prohibitive when :math:`K` is large. 1093Fortunately, since gradient boosting trees are always regression trees (even 1094for classification problems), there exist a faster strategy that can yield 1095equivalent splits. First, the categories of a feature are sorted according to 1096the variance of the target, for each category `k`. Once the categories are 1097sorted, one can consider *continuous partitions*, i.e. treat the categories 1098as if they were ordered continuous values (see Fisher [Fisher1958]_ for a 1099formal proof). As a result, only :math:`K - 1` splits need to be considered 1100instead of :math:`2^{K - 1} - 1`. The initial sorting is a 1101:math:`\mathcal{O}(K \log(K))` operation, leading to a total complexity of 1102:math:`\mathcal{O}(K \log(K) + K)`, instead of :math:`\mathcal{O}(2^K)`. 1103 1104.. topic:: Examples: 1105 1106 * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py` 1107 1108.. _monotonic_cst_gbdt: 1109 1110Monotonic Constraints 1111--------------------- 1112 1113Depending on the problem at hand, you may have prior knowledge indicating 1114that a given feature should in general have a positive (or negative) effect 1115on the target value. For example, all else being equal, a higher credit 1116score should increase the probability of getting approved for a loan. 1117Monotonic constraints allow you to incorporate such prior knowledge into the 1118model. 1119 1120A positive monotonic constraint is a constraint of the form: 1121 1122:math:`x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2)`, 1123where :math:`F` is the predictor with two features. 1124 1125Similarly, a negative monotonic constraint is of the form: 1126 1127:math:`x_1 \leq x_1' \implies F(x_1, x_2) \geq F(x_1', x_2)`. 1128 1129Note that monotonic constraints only constraint the output "all else being 1130equal". Indeed, the following relation **is not enforced** by a positive 1131constraint: :math:`x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2')`. 1132 1133You can specify a monotonic constraint on each feature using the 1134`monotonic_cst` parameter. For each feature, a value of 0 indicates no 1135constraint, while -1 and 1 indicate a negative and positive constraint, 1136respectively:: 1137 1138 >>> from sklearn.ensemble import HistGradientBoostingRegressor 1139 1140 ... # positive, negative, and no constraint on the 3 features 1141 >>> gbdt = HistGradientBoostingRegressor(monotonic_cst=[1, -1, 0]) 1142 1143In a binary classification context, imposing a monotonic constraint means 1144that the feature is supposed to have a positive / negative effect on the 1145probability to belong to the positive class. Monotonic constraints are not 1146supported for multiclass context. 1147 1148.. note:: 1149 Since categories are unordered quantities, it is not possible to enforce 1150 monotonic constraints on categorical features. 1151 1152.. topic:: Examples: 1153 1154 * :ref:`sphx_glr_auto_examples_ensemble_plot_monotonic_constraints.py` 1155 1156Low-level parallelism 1157--------------------- 1158 1159:class:`HistGradientBoostingClassifier` and 1160:class:`HistGradientBoostingRegressor` have implementations that use OpenMP 1161for parallelization through Cython. For more details on how to control the 1162number of threads, please refer to our :ref:`parallelism` notes. 1163 1164The following parts are parallelized: 1165 1166- mapping samples from real values to integer-valued bins (finding the bin 1167 thresholds is however sequential) 1168- building histograms is parallelized over features 1169- finding the best split point at a node is parallelized over features 1170- during fit, mapping samples into the left and right children is 1171 parallelized over samples 1172- gradient and hessians computations are parallelized over samples 1173- predicting is parallelized over samples 1174 1175Why it's faster 1176--------------- 1177 1178The bottleneck of a gradient boosting procedure is building the decision 1179trees. Building a traditional decision tree (as in the other GBDTs 1180:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`) 1181requires sorting the samples at each node (for 1182each feature). Sorting is needed so that the potential gain of a split point 1183can be computed efficiently. Splitting a single node has thus a complexity 1184of :math:`\mathcal{O}(n_\text{features} \times n \log(n))` where :math:`n` 1185is the number of samples at the node. 1186 1187:class:`HistGradientBoostingClassifier` and 1188:class:`HistGradientBoostingRegressor`, in contrast, do not require sorting the 1189feature values and instead use a data-structure called a histogram, where the 1190samples are implicitly ordered. Building a histogram has a 1191:math:`\mathcal{O}(n)` complexity, so the node splitting procedure has a 1192:math:`\mathcal{O}(n_\text{features} \times n)` complexity, much smaller 1193than the previous one. In addition, instead of considering :math:`n` split 1194points, we here consider only ``max_bins`` split points, which is much 1195smaller. 1196 1197In order to build histograms, the input data `X` needs to be binned into 1198integer-valued bins. This binning procedure does require sorting the feature 1199values, but it only happens once at the very beginning of the boosting process 1200(not at each node, like in :class:`GradientBoostingClassifier` and 1201:class:`GradientBoostingRegressor`). 1202 1203Finally, many parts of the implementation of 1204:class:`HistGradientBoostingClassifier` and 1205:class:`HistGradientBoostingRegressor` are parallelized. 1206 1207.. topic:: References 1208 1209 .. [F1999] Friedmann, Jerome H., 2007, `"Stochastic Gradient Boosting" 1210 <https://statweb.stanford.edu/~jhf/ftp/stobst.pdf>`_ 1211 .. [R2007] G. Ridgeway, "Generalized Boosted Models: A guide to the gbm 1212 package", 2007 1213 .. [XGBoost] Tianqi Chen, Carlos Guestrin, :arxiv:`"XGBoost: A Scalable Tree 1214 Boosting System" <1603.02754>` 1215 .. [LightGBM] Ke et. al. `"LightGBM: A Highly Efficient Gradient 1216 BoostingDecision Tree" <https://papers.nips.cc/paper/ 1217 6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree>`_ 1218 .. [Fisher1958] Walter D. Fisher. `"On Grouping for Maximum Homogeneity" 1219 <http://www.csiss.org/SPACE/workshops/2004/SAC/files/fisher.pdf>`_ 1220 1221.. _voting_classifier: 1222 1223Voting Classifier 1224======================== 1225 1226The idea behind the :class:`VotingClassifier` is to combine 1227conceptually different machine learning classifiers and use a majority vote 1228or the average predicted probabilities (soft vote) to predict the class labels. 1229Such a classifier can be useful for a set of equally well performing model 1230in order to balance out their individual weaknesses. 1231 1232 1233Majority Class Labels (Majority/Hard Voting) 1234-------------------------------------------- 1235 1236In majority voting, the predicted class label for a particular sample is 1237the class label that represents the majority (mode) of the class labels 1238predicted by each individual classifier. 1239 1240E.g., if the prediction for a given sample is 1241 1242- classifier 1 -> class 1 1243- classifier 2 -> class 1 1244- classifier 3 -> class 2 1245 1246the VotingClassifier (with ``voting='hard'``) would classify the sample 1247as "class 1" based on the majority class label. 1248 1249In the cases of a tie, the :class:`VotingClassifier` will select the class 1250based on the ascending sort order. E.g., in the following scenario 1251 1252- classifier 1 -> class 2 1253- classifier 2 -> class 1 1254 1255the class label 1 will be assigned to the sample. 1256 1257Usage 1258----- 1259 1260The following example shows how to fit the majority rule classifier:: 1261 1262 >>> from sklearn import datasets 1263 >>> from sklearn.model_selection import cross_val_score 1264 >>> from sklearn.linear_model import LogisticRegression 1265 >>> from sklearn.naive_bayes import GaussianNB 1266 >>> from sklearn.ensemble import RandomForestClassifier 1267 >>> from sklearn.ensemble import VotingClassifier 1268 1269 >>> iris = datasets.load_iris() 1270 >>> X, y = iris.data[:, 1:3], iris.target 1271 1272 >>> clf1 = LogisticRegression(random_state=1) 1273 >>> clf2 = RandomForestClassifier(n_estimators=50, random_state=1) 1274 >>> clf3 = GaussianNB() 1275 1276 >>> eclf = VotingClassifier( 1277 ... estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], 1278 ... voting='hard') 1279 1280 >>> for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']): 1281 ... scores = cross_val_score(clf, X, y, scoring='accuracy', cv=5) 1282 ... print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label)) 1283 Accuracy: 0.95 (+/- 0.04) [Logistic Regression] 1284 Accuracy: 0.94 (+/- 0.04) [Random Forest] 1285 Accuracy: 0.91 (+/- 0.04) [naive Bayes] 1286 Accuracy: 0.95 (+/- 0.04) [Ensemble] 1287 1288 1289Weighted Average Probabilities (Soft Voting) 1290-------------------------------------------- 1291 1292In contrast to majority voting (hard voting), soft voting 1293returns the class label as argmax of the sum of predicted probabilities. 1294 1295Specific weights can be assigned to each classifier via the ``weights`` 1296parameter. When weights are provided, the predicted class probabilities 1297for each classifier are collected, multiplied by the classifier weight, 1298and averaged. The final class label is then derived from the class label 1299with the highest average probability. 1300 1301To illustrate this with a simple example, let's assume we have 3 1302classifiers and a 3-class classification problems where we assign 1303equal weights to all classifiers: w1=1, w2=1, w3=1. 1304 1305The weighted average probabilities for a sample would then be 1306calculated as follows: 1307 1308================ ========== ========== ========== 1309classifier class 1 class 2 class 3 1310================ ========== ========== ========== 1311classifier 1 w1 * 0.2 w1 * 0.5 w1 * 0.3 1312classifier 2 w2 * 0.6 w2 * 0.3 w2 * 0.1 1313classifier 3 w3 * 0.3 w3 * 0.4 w3 * 0.3 1314weighted average 0.37 0.4 0.23 1315================ ========== ========== ========== 1316 1317Here, the predicted class label is 2, since it has the 1318highest average probability. 1319 1320The following example illustrates how the decision regions may change 1321when a soft :class:`VotingClassifier` is used based on an linear Support 1322Vector Machine, a Decision Tree, and a K-nearest neighbor classifier:: 1323 1324 >>> from sklearn import datasets 1325 >>> from sklearn.tree import DecisionTreeClassifier 1326 >>> from sklearn.neighbors import KNeighborsClassifier 1327 >>> from sklearn.svm import SVC 1328 >>> from itertools import product 1329 >>> from sklearn.ensemble import VotingClassifier 1330 1331 >>> # Loading some example data 1332 >>> iris = datasets.load_iris() 1333 >>> X = iris.data[:, [0, 2]] 1334 >>> y = iris.target 1335 1336 >>> # Training classifiers 1337 >>> clf1 = DecisionTreeClassifier(max_depth=4) 1338 >>> clf2 = KNeighborsClassifier(n_neighbors=7) 1339 >>> clf3 = SVC(kernel='rbf', probability=True) 1340 >>> eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), ('svc', clf3)], 1341 ... voting='soft', weights=[2, 1, 2]) 1342 1343 >>> clf1 = clf1.fit(X, y) 1344 >>> clf2 = clf2.fit(X, y) 1345 >>> clf3 = clf3.fit(X, y) 1346 >>> eclf = eclf.fit(X, y) 1347 1348.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_voting_decision_regions_001.png 1349 :target: ../auto_examples/ensemble/plot_voting_decision_regions.html 1350 :align: center 1351 :scale: 75% 1352 1353Using the `VotingClassifier` with `GridSearchCV` 1354------------------------------------------------ 1355 1356The :class:`VotingClassifier` can also be used together with 1357:class:`~sklearn.model_selection.GridSearchCV` in order to tune the 1358hyperparameters of the individual estimators:: 1359 1360 >>> from sklearn.model_selection import GridSearchCV 1361 >>> clf1 = LogisticRegression(random_state=1) 1362 >>> clf2 = RandomForestClassifier(random_state=1) 1363 >>> clf3 = GaussianNB() 1364 >>> eclf = VotingClassifier( 1365 ... estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], 1366 ... voting='soft' 1367 ... ) 1368 1369 >>> params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200]} 1370 1371 >>> grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5) 1372 >>> grid = grid.fit(iris.data, iris.target) 1373 1374Usage 1375----- 1376 1377In order to predict the class labels based on the predicted 1378class-probabilities (scikit-learn estimators in the VotingClassifier 1379must support ``predict_proba`` method):: 1380 1381 >>> eclf = VotingClassifier( 1382 ... estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], 1383 ... voting='soft' 1384 ... ) 1385 1386Optionally, weights can be provided for the individual classifiers:: 1387 1388 >>> eclf = VotingClassifier( 1389 ... estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], 1390 ... voting='soft', weights=[2,5,1] 1391 ... ) 1392 1393.. _voting_regressor: 1394 1395Voting Regressor 1396================ 1397 1398The idea behind the :class:`VotingRegressor` is to combine conceptually 1399different machine learning regressors and return the average predicted values. 1400Such a regressor can be useful for a set of equally well performing models 1401in order to balance out their individual weaknesses. 1402 1403Usage 1404----- 1405 1406The following example shows how to fit the VotingRegressor:: 1407 1408 >>> from sklearn.datasets import load_diabetes 1409 >>> from sklearn.ensemble import GradientBoostingRegressor 1410 >>> from sklearn.ensemble import RandomForestRegressor 1411 >>> from sklearn.linear_model import LinearRegression 1412 >>> from sklearn.ensemble import VotingRegressor 1413 1414 >>> # Loading some example data 1415 >>> X, y = load_diabetes(return_X_y=True) 1416 1417 >>> # Training classifiers 1418 >>> reg1 = GradientBoostingRegressor(random_state=1) 1419 >>> reg2 = RandomForestRegressor(random_state=1) 1420 >>> reg3 = LinearRegression() 1421 >>> ereg = VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('lr', reg3)]) 1422 >>> ereg = ereg.fit(X, y) 1423 1424.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_voting_regressor_001.png 1425 :target: ../auto_examples/ensemble/plot_voting_regressor.html 1426 :align: center 1427 :scale: 75% 1428 1429.. topic:: Examples: 1430 1431 * :ref:`sphx_glr_auto_examples_ensemble_plot_voting_regressor.py` 1432 1433.. _stacking: 1434 1435Stacked generalization 1436====================== 1437 1438Stacked generalization is a method for combining estimators to reduce their 1439biases [W1992]_ [HTF]_. More precisely, the predictions of each individual 1440estimator are stacked together and used as input to a final estimator to 1441compute the prediction. This final estimator is trained through 1442cross-validation. 1443 1444The :class:`StackingClassifier` and :class:`StackingRegressor` provide such 1445strategies which can be applied to classification and regression problems. 1446 1447The `estimators` parameter corresponds to the list of the estimators which 1448are stacked together in parallel on the input data. It should be given as a 1449list of names and estimators:: 1450 1451 >>> from sklearn.linear_model import RidgeCV, LassoCV 1452 >>> from sklearn.neighbors import KNeighborsRegressor 1453 >>> estimators = [('ridge', RidgeCV()), 1454 ... ('lasso', LassoCV(random_state=42)), 1455 ... ('knr', KNeighborsRegressor(n_neighbors=20, 1456 ... metric='euclidean'))] 1457 1458The `final_estimator` will use the predictions of the `estimators` as input. It 1459needs to be a classifier or a regressor when using :class:`StackingClassifier` 1460or :class:`StackingRegressor`, respectively:: 1461 1462 >>> from sklearn.ensemble import GradientBoostingRegressor 1463 >>> from sklearn.ensemble import StackingRegressor 1464 >>> final_estimator = GradientBoostingRegressor( 1465 ... n_estimators=25, subsample=0.5, min_samples_leaf=25, max_features=1, 1466 ... random_state=42) 1467 >>> reg = StackingRegressor( 1468 ... estimators=estimators, 1469 ... final_estimator=final_estimator) 1470 1471To train the `estimators` and `final_estimator`, the `fit` method needs 1472to be called on the training data:: 1473 1474 >>> from sklearn.datasets import load_diabetes 1475 >>> X, y = load_diabetes(return_X_y=True) 1476 >>> from sklearn.model_selection import train_test_split 1477 >>> X_train, X_test, y_train, y_test = train_test_split(X, y, 1478 ... random_state=42) 1479 >>> reg.fit(X_train, y_train) 1480 StackingRegressor(...) 1481 1482During training, the `estimators` are fitted on the whole training data 1483`X_train`. They will be used when calling `predict` or `predict_proba`. To 1484generalize and avoid over-fitting, the `final_estimator` is trained on 1485out-samples using :func:`sklearn.model_selection.cross_val_predict` internally. 1486 1487For :class:`StackingClassifier`, note that the output of the ``estimators`` is 1488controlled by the parameter `stack_method` and it is called by each estimator. 1489This parameter is either a string, being estimator method names, or `'auto'` 1490which will automatically identify an available method depending on the 1491availability, tested in the order of preference: `predict_proba`, 1492`decision_function` and `predict`. 1493 1494A :class:`StackingRegressor` and :class:`StackingClassifier` can be used as 1495any other regressor or classifier, exposing a `predict`, `predict_proba`, and 1496`decision_function` methods, e.g.:: 1497 1498 >>> y_pred = reg.predict(X_test) 1499 >>> from sklearn.metrics import r2_score 1500 >>> print('R2 score: {:.2f}'.format(r2_score(y_test, y_pred))) 1501 R2 score: 0.53 1502 1503Note that it is also possible to get the output of the stacked 1504`estimators` using the `transform` method:: 1505 1506 >>> reg.transform(X_test[:5]) 1507 array([[142..., 138..., 146...], 1508 [179..., 182..., 151...], 1509 [139..., 132..., 158...], 1510 [286..., 292..., 225...], 1511 [126..., 124..., 164...]]) 1512 1513In practice, a stacking predictor predicts as good as the best predictor of the 1514base layer and even sometimes outperforms it by combining the different 1515strengths of the these predictors. However, training a stacking predictor is 1516computationally expensive. 1517 1518.. note:: 1519 For :class:`StackingClassifier`, when using `stack_method_='predict_proba'`, 1520 the first column is dropped when the problem is a binary classification 1521 problem. Indeed, both probability columns predicted by each estimator are 1522 perfectly collinear. 1523 1524.. note:: 1525 Multiple stacking layers can be achieved by assigning `final_estimator` to 1526 a :class:`StackingClassifier` or :class:`StackingRegressor`:: 1527 1528 >>> final_layer_rfr = RandomForestRegressor( 1529 ... n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42) 1530 >>> final_layer_gbr = GradientBoostingRegressor( 1531 ... n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42) 1532 >>> final_layer = StackingRegressor( 1533 ... estimators=[('rf', final_layer_rfr), 1534 ... ('gbrt', final_layer_gbr)], 1535 ... final_estimator=RidgeCV() 1536 ... ) 1537 >>> multi_layer_regressor = StackingRegressor( 1538 ... estimators=[('ridge', RidgeCV()), 1539 ... ('lasso', LassoCV(random_state=42)), 1540 ... ('knr', KNeighborsRegressor(n_neighbors=20, 1541 ... metric='euclidean'))], 1542 ... final_estimator=final_layer 1543 ... ) 1544 >>> multi_layer_regressor.fit(X_train, y_train) 1545 StackingRegressor(...) 1546 >>> print('R2 score: {:.2f}' 1547 ... .format(multi_layer_regressor.score(X_test, y_test))) 1548 R2 score: 0.53 1549 1550.. topic:: References 1551 1552 .. [W1992] Wolpert, David H. "Stacked generalization." Neural networks 5.2 1553 (1992): 241-259. 1554