1--- 2title: "Understanding XGBoost Model on Otto Dataset" 3author: "Michaël Benesty" 4output: 5 rmarkdown::html_vignette: 6 css: ../../R-package/vignettes/vignette.css 7 number_sections: yes 8 toc: yes 9--- 10 11Introduction 12============ 13 14**XGBoost** is an implementation of the famous gradient boosting algorithm. This model is often described as a *blackbox*, meaning it works well but it is not trivial to understand how. Indeed, the model is made of hundreds (thousands?) of decision trees. You may wonder how possible a human would be able to have a general view of the model? 15 16While XGBoost is known for its fast speed and accurate predictive power, it also comes with various functions to help you understand the model. 17The purpose of this RMarkdown document is to demonstrate how easily we can leverage the functions already implemented in **XGBoost R** package. Of course, everything showed below can be applied to the dataset you may have to manipulate at work or wherever! 18 19First we will prepare the **Otto** dataset and train a model, then we will generate two visualisations to get a clue of what is important to the model, finally, we will see how we can leverage these information. 20 21Preparation of the data 22======================= 23 24This part is based on the **R** tutorial example by [Tong He](https://github.com/dmlc/xgboost/blob/master/demo/kaggle-otto/otto_train_pred.R) 25 26First, let's load the packages and the dataset. 27 28```{r loading} 29require(xgboost) 30require(methods) 31require(data.table) 32require(magrittr) 33train <- fread('data/train.csv', header = T, stringsAsFactors = FALSE) 34test <- fread('data/test.csv', header=TRUE, stringsAsFactors = FALSE) 35``` 36> `magrittr` and `data.table` are here to make the code cleaner and much more rapid. 37 38Let's explore the dataset. 39 40```{r explore} 41# Train dataset dimensions 42dim(train) 43 44# Training content 45train[1:6,1:5, with =FALSE] 46 47# Test dataset dimensions 48dim(test) 49 50# Test content 51test[1:6,1:5, with =FALSE] 52``` 53> We only display the 6 first rows and 5 first columns for convenience 54 55Each *column* represents a feature measured by an `integer`. Each *row* is an **Otto** product. 56 57Obviously the first column (`ID`) doesn't contain any useful information. 58 59To let the algorithm focus on real stuff, we will delete it. 60 61```{r clean, results='hide'} 62# Delete ID column in training dataset 63train[, id := NULL] 64 65# Delete ID column in testing dataset 66test[, id := NULL] 67``` 68 69According to its description, the **Otto** challenge is a multi class classification challenge. We need to extract the labels (here the name of the different classes) from the dataset. We only have two files (test and training), it seems logical that the training file contains the class we are looking for. Usually the labels is in the first or the last column. We already know what is in the first column, let's check the content of the last one. 70 71```{r searchLabel} 72# Check the content of the last column 73train[1:6, ncol(train), with = FALSE] 74# Save the name of the last column 75nameLastCol <- names(train)[ncol(train)] 76``` 77 78The classes are provided as character string in the `r ncol(train)`th column called `r nameLastCol`. As you may know, **XGBoost** doesn't support anything else than numbers. So we will convert classes to `integer`. Moreover, according to the documentation, it should start at `0`. 79 80For that purpose, we will: 81 82* extract the target column 83* remove `Class_` from each class name 84* convert to `integer` 85* remove `1` to the new value 86 87```{r classToIntegers} 88# Convert from classes to numbers 89y <- train[, nameLastCol, with = FALSE][[1]] %>% gsub('Class_','',.) %>% {as.integer(.) -1} 90 91# Display the first 5 levels 92y[1:5] 93``` 94 95We remove label column from training dataset, otherwise **XGBoost** would use it to guess the labels! 96 97```{r deleteCols, results='hide'} 98train[, nameLastCol:=NULL, with = FALSE] 99``` 100 101`data.table` is an awesome implementation of data.frame, unfortunately it is not a format supported natively by **XGBoost**. We need to convert both datasets (training and test) in `numeric` Matrix format. 102 103```{r convertToNumericMatrix} 104trainMatrix <- train[,lapply(.SD,as.numeric)] %>% as.matrix 105testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix 106``` 107 108Model training 109============== 110 111Before the learning we will use the cross validation to evaluate the our error rate. 112 113Basically **XGBoost** will divide the training data in `nfold` parts, then **XGBoost** will retain the first part to use it as the test data and perform a training. Then it will reintegrate the first part and retain the second part, do a training and so on... 114 115You can look at the function documentation for more information. 116 117```{r crossValidation} 118numberOfClasses <- max(y) + 1 119 120param <- list("objective" = "multi:softprob", 121 "eval_metric" = "mlogloss", 122 "num_class" = numberOfClasses) 123 124cv.nrounds <- 5 125cv.nfold <- 3 126 127bst.cv = xgb.cv(param=param, data = trainMatrix, label = y, 128 nfold = cv.nfold, nrounds = cv.nrounds) 129``` 130> As we can see the error rate is low on the test dataset (for a 5mn trained model). 131 132Finally, we are ready to train the real model!!! 133 134```{r modelTraining} 135nrounds = 50 136bst = xgboost(param=param, data = trainMatrix, label = y, nrounds=nrounds) 137``` 138 139Model understanding 140=================== 141 142Feature importance 143------------------ 144 145So far, we have built a model made of **`r nrounds`** trees. 146 147To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products). 148 149Each division operation is called a *split*. 150 151Each group at each division level is called a branch and the deepest level is called a *leaf*. 152 153In the final model, these *leafs* are supposed to be as pure as possible for each tree, meaning in our case that each *leaf* should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits). 154 155**Not all *splits* are equally important**. Basically the first *split* of a tree will have more impact on the purity that, for instance, the deepest *split*. Intuitively, we understand that the first *split* makes most of the work, and the following *splits* focus on smaller parts of the dataset which have been misclassified by the first *tree*. 156 157In the same way, in Boosting we try to optimize the misclassification at each round (it is called the *loss*). So the first *tree* will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous *trees*. 158 159The improvement brought by each *split* can be measured, it is the *gain*. 160 161Each *split* is done on one feature only at one value. 162 163Let's see what the model looks like. 164 165```{r modelDump} 166model <- xgb.dump(bst, with.stats = TRUE) 167model[1:10] 168``` 169> For convenience, we are displaying the first 10 lines of the model only. 170 171Clearly, it is not easy to understand what it means. 172 173Basically each line represents a *branch*, there is the *tree* ID, the feature ID, the point where it *splits*, and information regarding the next *branches* (left, right, when the row for this feature is N/A). 174 175Hopefully, **XGBoost** offers a better representation: **feature importance**. 176 177Feature importance is about averaging the *gain* of each feature for all *split* and all *trees*. 178 179Then we can use the function `xgb.plot.importance`. 180 181```{r importanceFeature, fig.align='center', fig.height=5, fig.width=10} 182# Get the feature real names 183names <- dimnames(trainMatrix)[[2]] 184 185# Compute feature importance matrix 186importance_matrix <- xgb.importance(names, model = bst) 187 188# Nice graph 189xgb.plot.importance(importance_matrix[1:10,]) 190``` 191 192> To make it understandable we first extract the column names from the `Matrix`. 193 194Interpretation 195-------------- 196 197In the feature importance above, we can see the first 10 most important features. 198 199This function gives a color to each bar. These colors represent groups of features. Basically a K-means clustering is applied to group each feature by importance. 200 201From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels. 202 203Or you can just reason about why these features are so important (in **Otto** challenge we can't go this way because there is not enough information). 204 205Tree graph 206---------- 207 208Feature importance gives you feature weight information but not interaction between features. 209 210**XGBoost R** package have another useful function for that. 211 212Please, scroll on the right to see the tree. 213 214```{r treeGraph, dpi=1500, fig.align='left'} 215xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2) 216``` 217 218We are just displaying the first two trees here. 219 220On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the interaction between features is complicated. 221Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes. 222 223Going deeper 224============ 225 226There are 4 documents you may also be interested in: 227 228* [xgboostPresentation.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd): general presentation 229* [discoverYourData.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/discoverYourData.Rmd): explaining feature analysis 230* [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit): use case 231* [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/): very good book to have a good understanding of the model 232