1############################# 2Introduction to Boosted Trees 3############################# 4XGBoost stands for "Extreme Gradient Boosting", where the term "Gradient Boosting" originates from the paper *Greedy Function Approximation: A Gradient Boosting Machine*, by Friedman. 5 6The **gradient boosted trees** has been around for a while, and there are a lot of materials on the topic. 7This tutorial will explain boosted trees in a self-contained and principled way using the elements of supervised learning. 8We think this explanation is cleaner, more formal, and motivates the model formulation used in XGBoost. 9 10******************************* 11Elements of Supervised Learning 12******************************* 13XGBoost is used for supervised learning problems, where we use the training data (with multiple features) :math:`x_i` to predict a target variable :math:`y_i`. 14Before we learn about trees specifically, let us start by reviewing the basic elements in supervised learning. 15 16Model and Parameters 17==================== 18The **model** in supervised learning usually refers to the mathematical structure of by which the prediction :math:`y_i` is made from the input :math:`x_i`. 19A common example is a *linear model*, where the prediction is given as :math:`\hat{y}_i = \sum_j \theta_j x_{ij}`, a linear combination of weighted input features. 20The prediction value can have different interpretations, depending on the task, i.e., regression or classification. 21For example, it can be logistic transformed to get the probability of positive class in logistic regression, and it can also be used as a ranking score when we want to rank the outputs. 22 23The **parameters** are the undetermined part that we need to learn from data. In linear regression problems, the parameters are the coefficients :math:`\theta`. 24Usually we will use :math:`\theta` to denote the parameters (there are many parameters in a model, our definition here is sloppy). 25 26Objective Function: Training Loss + Regularization 27================================================== 28With judicious choices for :math:`y_i`, we may express a variety of tasks, such as regression, classification, and ranking. 29The task of **training** the model amounts to finding the best parameters :math:`\theta` that best fit the training data :math:`x_i` and labels :math:`y_i`. In order to train the model, we need to define the **objective function** 30to measure how well the model fit the training data. 31 32A salient characteristic of objective functions is that they consist two parts: **training loss** and **regularization term**: 33 34.. math:: 35 36 \text{obj}(\theta) = L(\theta) + \Omega(\theta) 37 38where :math:`L` is the training loss function, and :math:`\Omega` is the regularization term. The training loss measures how *predictive* our model is with respect to the training data. 39A common choice of :math:`L` is the *mean squared error*, which is given by 40 41.. math:: 42 43 L(\theta) = \sum_i (y_i-\hat{y}_i)^2 44 45Another commonly used loss function is logistic loss, to be used for logistic regression: 46 47.. math:: 48 49 L(\theta) = \sum_i[ y_i\ln (1+e^{-\hat{y}_i}) + (1-y_i)\ln (1+e^{\hat{y}_i})] 50 51The **regularization term** is what people usually forget to add. The regularization term controls the complexity of the model, which helps us to avoid overfitting. 52This sounds a bit abstract, so let us consider the following problem in the following picture. You are asked to *fit* visually a step function given the input data points 53on the upper left corner of the image. 54Which solution among the three do you think is the best fit? 55 56.. image:: https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/model/step_fit.png 57 :alt: step functions to fit data points, illustrating bias-variance tradeoff 58 59The correct answer is marked in red. Please consider if this visually seems a reasonable fit to you. The general principle is we want both a *simple* and *predictive* model. 60The tradeoff between the two is also referred as **bias-variance tradeoff** in machine learning. 61 62Why introduce the general principle? 63==================================== 64The elements introduced above form the basic elements of supervised learning, and they are natural building blocks of machine learning toolkits. 65For example, you should be able to describe the differences and commonalities between gradient boosted trees and random forests. 66Understanding the process in a formalized way also helps us to understand the objective that we are learning and the reason behind the heuristics such as 67pruning and smoothing. 68 69*********************** 70Decision Tree Ensembles 71*********************** 72Now that we have introduced the elements of supervised learning, let us get started with real trees. 73To begin with, let us first learn about the model choice of XGBoost: **decision tree ensembles**. 74The tree ensemble model consists of a set of classification and regression trees (CART). Here's a simple example of a CART that classifies whether someone will like a hypothetical computer game X. 75 76.. image:: https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/model/cart.png 77 :width: 100% 78 :alt: a toy example for CART 79 80We classify the members of a family into different leaves, and assign them the score on the corresponding leaf. 81A CART is a bit different from decision trees, in which the leaf only contains decision values. In CART, a real score 82is associated with each of the leaves, which gives us richer interpretations that go beyond classification. 83This also allows for a principled, unified approach to optimization, as we will see in a later part of this tutorial. 84 85Usually, a single tree is not strong enough to be used in practice. What is actually used is the ensemble model, 86which sums the prediction of multiple trees together. 87 88.. image:: https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/model/twocart.png 89 :width: 100% 90 :alt: a toy example for tree ensemble, consisting of two CARTs 91 92Here is an example of a tree ensemble of two trees. The prediction scores of each individual tree are summed up to get the final score. 93If you look at the example, an important fact is that the two trees try to *complement* each other. 94Mathematically, we can write our model in the form 95 96.. math:: 97 98 \hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F} 99 100where :math:`K` is the number of trees, :math:`f` is a function in the functional space :math:`\mathcal{F}`, and :math:`\mathcal{F}` is the set of all possible CARTs. The objective function to be optimized is given by 101 102.. math:: 103 104 \text{obj}(\theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k) 105 106Now here comes a trick question: what is the *model* used in random forests? Tree ensembles! So random forests and boosted trees are really the same models; the 107difference arises from how we train them. This means that, if you write a predictive service for tree ensembles, you only need to write one and it should work 108for both random forests and gradient boosted trees. (See `Treelite <https://treelite.readthedocs.io/en/latest/index.html>`_ for an actual example.) One example of why elements of supervised learning rock. 109 110************* 111Tree Boosting 112************* 113Now that we introduced the model, let us turn to training: How should we learn the trees? 114The answer is, as is always for all supervised learning models: *define an objective function and optimize it*! 115 116Let the following be the objective function (remember it always needs to contain training loss and regularization): 117 118.. math:: 119 120 \text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) 121 122Additive Training 123================= 124 125The first question we want to ask: what are the **parameters** of trees? 126You can find that what we need to learn are those functions :math:`f_i`, each containing the structure 127of the tree and the leaf scores. Learning tree structure is much harder than traditional optimization problem where you can simply take the gradient. 128It is intractable to learn all the trees at once. 129Instead, we use an additive strategy: fix what we have learned, and add one new tree at a time. 130We write the prediction value at step :math:`t` as :math:`\hat{y}_i^{(t)}`. Then we have 131 132.. math:: 133 134 \hat{y}_i^{(0)} &= 0\\ 135 \hat{y}_i^{(1)} &= f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\\ 136 \hat{y}_i^{(2)} &= f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i)\\ 137 &\dots\\ 138 \hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i) 139 140It remains to ask: which tree do we want at each step? A natural thing is to add the one that optimizes our objective. 141 142.. math:: 143 144 \text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\ 145 & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant} 146 147If we consider using mean squared error (MSE) as our loss function, the objective becomes 148 149.. math:: 150 151 \text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\ 152 & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant} 153 154The form of MSE is friendly, with a first order term (usually called the residual) and a quadratic term. 155For other losses of interest (for example, logistic loss), it is not so easy to get such a nice form. 156So in the general case, we take the *Taylor expansion of the loss function up to the second order*: 157 158.. math:: 159 160 \text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant} 161 162where the :math:`g_i` and :math:`h_i` are defined as 163 164.. math:: 165 166 g_i &= \partial_{\hat{y}_i^{(t-1)}} l(y_i, \hat{y}_i^{(t-1)})\\ 167 h_i &= \partial_{\hat{y}_i^{(t-1)}}^2 l(y_i, \hat{y}_i^{(t-1)}) 168 169After we remove all the constants, the specific objective at step :math:`t` becomes 170 171.. math:: 172 173 \sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) 174 175This becomes our optimization goal for the new tree. One important advantage of this definition is that 176the value of the objective function only depends on :math:`g_i` and :math:`h_i`. This is how XGBoost supports custom loss functions. 177We can optimize every loss function, including logistic regression and pairwise ranking, using exactly 178the same solver that takes :math:`g_i` and :math:`h_i` as input! 179 180Model Complexity 181================ 182We have introduced the training step, but wait, there is one important thing, the **regularization term**! 183We need to define the complexity of the tree :math:`\Omega(f)`. In order to do so, let us first refine the definition of the tree :math:`f(x)` as 184 185.. math:: 186 187 f_t(x) = w_{q(x)}, w \in R^T, q:R^d\rightarrow \{1,2,\cdots,T\} . 188 189Here :math:`w` is the vector of scores on leaves, :math:`q` is a function assigning each data point to the corresponding leaf, and :math:`T` is the number of leaves. 190In XGBoost, we define the complexity as 191 192.. math:: 193 194 \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 195 196Of course, there is more than one way to define the complexity, but this one works well in practice. The regularization is one part most tree packages treat 197less carefully, or simply ignore. This was because the traditional treatment of tree learning only emphasized improving impurity, while the complexity control was left to heuristics. 198By defining it formally, we can get a better idea of what we are learning and obtain models that perform well in the wild. 199 200The Structure Score 201=================== 202Here is the magical part of the derivation. After re-formulating the tree model, we can write the objective value with the :math:`t`-th tree as: 203 204.. math:: 205 206 \text{obj}^{(t)} &\approx \sum_{i=1}^n [g_i w_{q(x_i)} + \frac{1}{2} h_i w_{q(x_i)}^2] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2\\ 207 &= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T 208 209where :math:`I_j = \{i|q(x_i)=j\}` is the set of indices of data points assigned to the :math:`j`-th leaf. 210Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score. 211We could further compress the expression by defining :math:`G_j = \sum_{i\in I_j} g_i` and :math:`H_j = \sum_{i\in I_j} h_i`: 212 213.. math:: 214 215 \text{obj}^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T 216 217In this equation, :math:`w_j` are independent with respect to each other, the form :math:`G_jw_j+\frac{1}{2}(H_j+\lambda)w_j^2` is quadratic and the best :math:`w_j` for a given structure :math:`q(x)` and the best objective reduction we can get is: 218 219.. math:: 220 221 w_j^\ast &= -\frac{G_j}{H_j+\lambda}\\ 222 \text{obj}^\ast &= -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T 223 224The last equation measures *how good* a tree structure :math:`q(x)` is. 225 226.. image:: https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/model/struct_score.png 227 :width: 100% 228 :alt: illustration of structure score (fitness) 229 230If all this sounds a bit complicated, let's take a look at the picture, and see how the scores can be calculated. 231Basically, for a given tree structure, we push the statistics :math:`g_i` and :math:`h_i` to the leaves they belong to, 232sum the statistics together, and use the formula to calculate how good the tree is. 233This score is like the impurity measure in a decision tree, except that it also takes the model complexity into account. 234 235Learn the tree structure 236======================== 237Now that we have a way to measure how good a tree is, ideally we would enumerate all possible trees and pick the best one. 238In practice this is intractable, so we will try to optimize one level of the tree at a time. 239Specifically we try to split a leaf into two leaves, and the score it gains is 240 241.. math:: 242 Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma 243 244This formula can be decomposed as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on the original leaf 4) regularization on the additional leaf. 245We can see an important fact here: if the gain is smaller than :math:`\gamma`, we would do better not to add that branch. This is exactly the **pruning** techniques in tree based 246models! By using the principles of supervised learning, we can naturally come up with the reason these techniques work :) 247 248For real valued data, we usually want to search for an optimal split. To efficiently do so, we place all the instances in sorted order, like the following picture. 249 250.. image:: https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/model/split_find.png 251 :width: 100% 252 :alt: Schematic of choosing the best split 253 254A left to right scan is sufficient to calculate the structure score of all possible split solutions, and we can find the best split efficiently. 255 256.. note:: Limitation of additive tree learning 257 258 Since it is intractable to enumerate all possible tree structures, we add one split at a time. This approach works well most of the time, but there are some edge cases that fail due to this approach. For those edge cases, training results in a degenerate model because we consider only one feature dimension at a time. See `Can Gradient Boosting Learn Simple Arithmetic? <http://mariofilho.com/can-gradient-boosting-learn-simple-arithmetic/>`_ for an example. 259 260********************** 261Final words on XGBoost 262********************** 263Now that you understand what boosted trees are, you may ask, where is the introduction for XGBoost? 264XGBoost is exactly a tool motivated by the formal principle introduced in this tutorial! 265More importantly, it is developed with both deep consideration in terms of **systems optimization** and **principles in machine learning**. 266The goal of this library is to push the extreme of the computation limits of machines to provide a **scalable**, **portable** and **accurate** library. 267Make sure you try it out, and most importantly, contribute your piece of wisdom (code, examples, tutorials) to the community! 268