1--- 2title: "XGBoost presentation" 3output: 4 rmarkdown::html_vignette: 5 css: vignette.css 6 number_sections: yes 7 toc: yes 8bibliography: xgboost.bib 9author: Tianqi Chen, Tong He, Michaël Benesty 10vignette: > 11 %\VignetteIndexEntry{XGBoost presentation} 12 %\VignetteEngine{knitr::rmarkdown} 13 \usepackage[utf8]{inputenc} 14--- 15 16XGBoost R Tutorial 17================== 18 19## Introduction 20 21 22**XGBoost** is short for e**X**treme **G**radient **Boost**ing package. 23 24The purpose of this Vignette is to show you how to use **XGBoost** to build a model and make predictions. 25 26It is an efficient and scalable implementation of gradient boosting framework by @friedman2000additive and @friedman2001greedy. Two solvers are included: 27 28- *linear* model ; 29- *tree learning* algorithm. 30 31It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective functions easily. 32 33It has been [used](https://github.com/dmlc/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions. 34 35It has several features: 36 37* Speed: it can automatically do parallel computation on *Windows* and *Linux*, with *OpenMP*. It is generally over 10 times faster than the classical `gbm`. 38* Input Type: it takes several types of input data: 39 * *Dense* Matrix: *R*'s *dense* matrix, i.e. `matrix` ; 40 * *Sparse* Matrix: *R*'s *sparse* matrix, i.e. `Matrix::dgCMatrix` ; 41 * Data File: local data files ; 42 * `xgb.DMatrix`: its own class (recommended). 43* Sparsity: it accepts *sparse* input for both *tree booster* and *linear booster*, and is optimized for *sparse* input ; 44* Customization: it supports customized objective functions and evaluation functions. 45 46## Installation 47 48 49### GitHub version 50 51 52For weekly updated version (highly recommended), install from *GitHub*: 53 54```{r installGithub, eval=FALSE} 55install.packages("drat", repos="https://cran.rstudio.com") 56drat:::addRepo("dmlc") 57install.packages("xgboost", repos="http://dmlc.ml/drat/", type = "source") 58``` 59 60> *Windows* user will need to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) first. 61 62### CRAN version 63 64 65The version 0.4-2 is on CRAN, and you can install it by: 66 67```{r, eval=FALSE} 68install.packages("xgboost") 69``` 70 71Formerly available versions can be obtained from the CRAN [archive](https://cran.r-project.org/src/contrib/Archive/xgboost/) 72 73## Learning 74 75 76For the purpose of this tutorial we will load **XGBoost** package. 77 78```{r libLoading, results='hold', message=F, warning=F} 79require(xgboost) 80``` 81 82### Dataset presentation 83 84 85In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the same as you will use on in your every day life :-). 86 87Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013. 88 89### Dataset loading 90 91 92We will load the `agaricus` datasets embedded with the package and will link them to variables. 93 94The datasets are already split in: 95 96* `train`: will be used to build the model ; 97* `test`: will be used to assess the quality of our model. 98 99Why *split* the dataset in two parts? 100 101In the first part we will build our model. In the second part we will want to test it and assess its quality. Without dividing the dataset we would test the model on the data which the algorithm have already seen. 102 103```{r datasetLoading, results='hold', message=F, warning=F} 104data(agaricus.train, package='xgboost') 105data(agaricus.test, package='xgboost') 106train <- agaricus.train 107test <- agaricus.test 108``` 109 110> In the real world, it would be up to you to make this division between `train` and `test` data. The way to do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/data-splitting.html). 111 112Each variable is a `list` containing two things, `label` and `data`: 113 114```{r dataList, message=F, warning=F} 115str(train) 116``` 117 118`label` is the outcome of our dataset meaning it is the binary *classification* we will try to predict. 119 120Let's discover the dimensionality of our datasets. 121 122```{r dataSize, message=F, warning=F} 123dim(train$data) 124dim(test$data) 125``` 126 127This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge dataset very efficiently. 128 129As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`): 130 131```{r dataClass, message=F, warning=F} 132class(train$data)[1] 133class(train$label) 134``` 135 136### Basic Training using XGBoost 137 138 139This step is the most critical part of the process for the quality of our model. 140 141#### Basic training 142 143We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`. 144 145In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset. 146 147We will train decision tree model using the following parameters: 148 149* `objective = "binary:logistic"`: we will train a binary classification model ; 150* `max_depth = 2`: the trees won't be deep, because our case is very simple ; 151* `nthread = 2`: the number of CPU threads we are going to use; 152* `nrounds = 2`: there will be two passes on the data, the second one will enhance the model by further reducing the difference between ground truth and prediction. 153 154```{r trainingSparse, message=F, warning=F} 155bstSparse <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") 156``` 157 158> More complex the relationship between your features and your `label` is, more passes you need. 159 160#### Parameter variations 161 162##### Dense matrix 163 164Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix. 165 166```{r trainingDense, message=F, warning=F} 167bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") 168``` 169 170##### xgb.DMatrix 171 172**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later. 173 174```{r trainingDmatrix, message=F, warning=F} 175dtrain <- xgb.DMatrix(data = train$data, label = train$label) 176bstDMatrix <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") 177``` 178 179##### Verbose option 180 181**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality. 182 183One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced techniques). 184 185```{r trainingVerbose0, message=T, warning=F} 186# verbose = 0, no message 187bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic", verbose = 0) 188``` 189 190```{r trainingVerbose1, message=T, warning=F} 191# verbose = 1, print evaluation metric 192bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic", verbose = 1) 193``` 194 195```{r trainingVerbose2, message=T, warning=F} 196# verbose = 2, also print information about tree 197bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic", verbose = 2) 198``` 199 200## Basic prediction using XGBoost 201 202 203## Perform the prediction 204 205 206The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step. 207 208```{r predicting, message=F, warning=F} 209pred <- predict(bst, test$data) 210 211# size of the prediction vector 212print(length(pred)) 213 214# limit display of predictions to the first 10 215print(head(pred)) 216``` 217 218These numbers doesn't look like *binary classification* `{0,1}`. We need to perform a simple transformation before being able to use these results. 219 220## Transform the regression in a binary classification 221 222 223The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model. 224 225How can we use a *regression* model to perform a binary classification? 226 227If we think about the meaning of a regression applied to our data, the numbers we get are probabilities that a datum will be classified as `1`. Therefore, we will set the rule that if this probability for a specific datum is `> 0.5` then the observation is classified as `1` (or `0` otherwise). 228 229```{r predictingTest, message=F, warning=F} 230prediction <- as.numeric(pred > 0.5) 231print(head(prediction)) 232``` 233 234## Measuring model performance 235 236 237To measure the model performance, we will compute a simple metric, the *average error*. 238 239```{r predictingAverageError, message=F, warning=F} 240err <- mean(as.numeric(pred > 0.5) != test$label) 241print(paste("test-error=", err)) 242``` 243 244> Note that the algorithm has not seen the `test` data during the model construction. 245 246Steps explanation: 247 2481. `as.numeric(pred > 0.5)` applies our rule that when the probability (<=> regression <=> prediction) is `> 0.5` the observation is classified as `1` and `0` otherwise ; 2492. `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ; 2503. `mean(vectorOfErrors)` computes the *average error* itself. 251 252The most important thing to remember is that **to do a classification, you just do a regression to the** `label` **and then apply a threshold**. 253 254*Multiclass* classification works in a similar way. 255 256This metric is **`r round(err, 2)`** and is pretty low: our yummy mushroom model works well! 257 258## Advanced features 259 260 261Most of the features below have been implemented to help you to improve your model by offering a better understanding of its content. 262 263 264### Dataset preparation 265 266 267For the following advanced features, we need to put data in `xgb.DMatrix` as explained above. 268 269```{r DMatrix, message=F, warning=F} 270dtrain <- xgb.DMatrix(data = train$data, label=train$label) 271dtest <- xgb.DMatrix(data = test$data, label=test$label) 272``` 273 274### Measure learning progress with xgb.train 275 276 277Both `xgboost` (simple) and `xgb.train` (advanced) functions train models. 278 279One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible. 280 281One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning. 282 283> in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors. 284 285For the purpose of this example, we use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name. 286 287```{r watchlist, message=F, warning=F} 288watchlist <- list(train=dtrain, test=dtest) 289 290bst <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2, nrounds=2, watchlist=watchlist, objective = "binary:logistic") 291``` 292 293**XGBoost** has computed at each round the same average error metric than seen above (we set `nrounds` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset. 294 295Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset. 296 297If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/data-splitting.html). 298 299For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics. 300 301```{r watchlist2, message=F, warning=F} 302bst <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic") 303``` 304 305> `eval_metric` allows us to monitor two new metrics for each round, `logloss` and `error`. 306 307### Linear boosting 308 309 310Until now, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter). 311 312```{r linearBoosting, message=F, warning=F} 313bst <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic") 314``` 315 316In this specific case, *linear boosting* gets slightly better performance metrics than decision trees based algorithm. 317 318In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use. 319 320### Manipulating xgb.DMatrix 321 322 323#### Save / Load 324 325Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome) can also be saved using `xgb.DMatrix.save` function. 326 327```{r DMatrixSave, message=F, warning=F} 328xgb.DMatrix.save(dtrain, "dtrain.buffer") 329# to load it in, simply call xgb.DMatrix 330dtrain2 <- xgb.DMatrix("dtrain.buffer") 331bst <- xgb.train(data=dtrain2, max_depth=2, eta=1, nthread = 2, nrounds=2, watchlist=watchlist, objective = "binary:logistic") 332``` 333 334```{r DMatrixDel, include=FALSE} 335file.remove("dtrain.buffer") 336``` 337 338#### Information extraction 339 340Information can be extracted from `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data. 341 342```{r getinfo, message=F, warning=F} 343label = getinfo(dtest, "label") 344pred <- predict(bst, dtest) 345err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label) 346print(paste("test-error=", err)) 347``` 348 349### View feature importance/influence from the learnt model 350 351 352Feature importance is similar to R gbm package's relative influence (rel.inf). 353 354``` 355importance_matrix <- xgb.importance(model = bst) 356print(importance_matrix) 357xgb.plot.importance(importance_matrix = importance_matrix) 358``` 359 360#### View the trees from a model 361 362 363You can dump the tree you learned using `xgb.dump` into a text file. 364 365```{r dump, message=T, warning=F} 366xgb.dump(bst, with_stats = TRUE) 367``` 368 369You can plot the trees from your model using ```xgb.plot.tree`` 370 371``` 372xgb.plot.tree(model = bst) 373``` 374 375> if you provide a path to `fname` parameter you can save the trees to your hard drive. 376 377#### Save and load models 378 379 380Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required. 381 382Hopefully for you, **XGBoost** implements such functions. 383 384```{r saveModel, message=F, warning=F} 385# save model to binary local file 386xgb.save(bst, "xgboost.model") 387``` 388 389> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise. 390 391An interesting test to see how identical our saved model is to the original one would be to compare the two predictions. 392 393```{r loadModel, message=F, warning=F} 394# load binary model to R 395bst2 <- xgb.load("xgboost.model") 396pred2 <- predict(bst2, test$data) 397 398# And now the test 399print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred)))) 400``` 401 402```{r clean, include=FALSE} 403# delete the created model 404file.remove("./xgboost.model") 405``` 406 407> result is `0`? We are good! 408 409In some very specific cases, like when you want to pilot **XGBoost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it. 410 411```{r saveLoadRBinVectorModel, message=F, warning=F} 412# save model to R's raw vector 413rawVec <- xgb.serialize(bst) 414 415# print class 416print(class(rawVec)) 417 418# load binary model to R 419bst3 <- xgb.load(rawVec) 420pred3 <- predict(bst3, test$data) 421 422# pred2 should be identical to pred 423print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred)))) 424``` 425 426> Again `0`? It seems that `XGBoost` works pretty well! 427 428## References 429