1---
2title: "XGBoost presentation"
3output:
4  rmarkdown::html_vignette:
5    css: vignette.css
6    number_sections: yes
7    toc: yes
8bibliography: xgboost.bib
9author: Tianqi Chen, Tong He, Michaël Benesty
10vignette: >
11  %\VignetteIndexEntry{XGBoost presentation}
12  %\VignetteEngine{knitr::rmarkdown}
13  \usepackage[utf8]{inputenc}
14---
15
16XGBoost R Tutorial
17==================
18
19## Introduction
20
21
22**XGBoost** is short for e**X**treme **G**radient **Boost**ing package.
23
24The purpose of this Vignette is to show you how to use **XGBoost** to build a model and make predictions.
25
26It is an efficient and scalable implementation of gradient boosting framework by @friedman2000additive and @friedman2001greedy. Two solvers are included:
27
28- *linear* model ;
29- *tree learning* algorithm.
30
31It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective functions easily.
32
33It has been [used](https://github.com/dmlc/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions.
34
35It has several features:
36
37* Speed: it can automatically do parallel computation on *Windows* and *Linux*, with *OpenMP*. It is generally over 10 times faster than the classical `gbm`.
38* Input Type: it takes several types of input data:
39    * *Dense* Matrix: *R*'s *dense* matrix, i.e. `matrix` ;
40    * *Sparse* Matrix: *R*'s *sparse* matrix, i.e. `Matrix::dgCMatrix` ;
41    * Data File: local data files ;
42    * `xgb.DMatrix`: its own class (recommended).
43* Sparsity: it accepts *sparse* input for both *tree booster*  and *linear booster*, and is optimized for *sparse* input ;
44* Customization: it supports customized objective functions and evaluation functions.
45
46## Installation
47
48
49### GitHub version
50
51
52For weekly updated version (highly recommended), install from *GitHub*:
53
54```{r installGithub, eval=FALSE}
55install.packages("drat", repos="https://cran.rstudio.com")
56drat:::addRepo("dmlc")
57install.packages("xgboost", repos="http://dmlc.ml/drat/", type = "source")
58```
59
60> *Windows* user will need to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) first.
61
62### CRAN version
63
64
65The version 0.4-2 is on CRAN, and you can install it by:
66
67```{r, eval=FALSE}
68install.packages("xgboost")
69```
70
71Formerly available versions can be obtained from the CRAN [archive](https://cran.r-project.org/src/contrib/Archive/xgboost/)
72
73## Learning
74
75
76For the purpose of this tutorial we will load **XGBoost** package.
77
78```{r libLoading, results='hold', message=F, warning=F}
79require(xgboost)
80```
81
82### Dataset presentation
83
84
85In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the same as you will use on in your every day life :-).
86
87Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.
88
89### Dataset loading
90
91
92We will load the `agaricus` datasets embedded with the package and will link them to variables.
93
94The datasets are already split in:
95
96* `train`: will be used to build the model ;
97* `test`: will be used to assess the quality of our model.
98
99Why *split* the dataset in two parts?
100
101In the first part we will build our model. In the second part we will want to test it and assess its quality. Without dividing the dataset we would test the model on the data which the algorithm have already seen.
102
103```{r datasetLoading, results='hold', message=F, warning=F}
104data(agaricus.train, package='xgboost')
105data(agaricus.test, package='xgboost')
106train <- agaricus.train
107test <- agaricus.test
108```
109
110> In the real world, it would be up to you to make this division between `train` and `test` data. The way to do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/data-splitting.html).
111
112Each variable is a `list` containing two things, `label` and `data`:
113
114```{r dataList, message=F, warning=F}
115str(train)
116```
117
118`label` is the outcome of our dataset meaning it is the binary *classification* we will try to predict.
119
120Let's discover the dimensionality of our datasets.
121
122```{r dataSize, message=F, warning=F}
123dim(train$data)
124dim(test$data)
125```
126
127This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge dataset very efficiently.
128
129As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`):
130
131```{r dataClass, message=F, warning=F}
132class(train$data)[1]
133class(train$label)
134```
135
136### Basic Training using XGBoost
137
138
139This step is the most critical part of the process for the quality of our model.
140
141#### Basic training
142
143We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`.
144
145In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset.
146
147We will train decision tree model using the following parameters:
148
149* `objective = "binary:logistic"`: we will train a binary classification model ;
150* `max_depth = 2`: the trees won't be deep, because our case is very simple ;
151* `nthread = 2`: the number of CPU threads we are going to use;
152* `nrounds = 2`: there will be two passes on the data, the second one will enhance the model by further reducing the difference between ground truth and prediction.
153
154```{r trainingSparse, message=F, warning=F}
155bstSparse <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
156```
157
158> More complex the relationship between your features and your `label` is, more passes you need.
159
160#### Parameter variations
161
162##### Dense matrix
163
164Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix.
165
166```{r trainingDense, message=F, warning=F}
167bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
168```
169
170##### xgb.DMatrix
171
172**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.
173
174```{r trainingDmatrix, message=F, warning=F}
175dtrain <- xgb.DMatrix(data = train$data, label = train$label)
176bstDMatrix <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
177```
178
179##### Verbose option
180
181**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
182
183One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced techniques).
184
185```{r trainingVerbose0, message=T, warning=F}
186# verbose = 0, no message
187bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic", verbose = 0)
188```
189
190```{r trainingVerbose1, message=T, warning=F}
191# verbose = 1, print evaluation metric
192bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic", verbose = 1)
193```
194
195```{r trainingVerbose2, message=T, warning=F}
196# verbose = 2, also print information about tree
197bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic", verbose = 2)
198```
199
200## Basic prediction using XGBoost
201
202
203## Perform the prediction
204
205
206The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.
207
208```{r predicting, message=F, warning=F}
209pred <- predict(bst, test$data)
210
211# size of the prediction vector
212print(length(pred))
213
214# limit display of predictions to the first 10
215print(head(pred))
216```
217
218These numbers doesn't look like *binary classification* `{0,1}`. We need to perform a simple transformation before being able to use these results.
219
220## Transform the regression in a binary classification
221
222
223The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model.
224
225How can we use a *regression* model to perform a binary classification?
226
227If we think about the meaning of a regression applied to our data, the numbers we get are probabilities that a datum will be classified as `1`. Therefore, we will set the rule that if this probability for a specific datum is `> 0.5` then the observation is classified as `1` (or `0` otherwise).
228
229```{r predictingTest, message=F, warning=F}
230prediction <- as.numeric(pred > 0.5)
231print(head(prediction))
232```
233
234## Measuring model performance
235
236
237To measure the model performance, we will compute a simple metric, the *average error*.
238
239```{r predictingAverageError, message=F, warning=F}
240err <- mean(as.numeric(pred > 0.5) != test$label)
241print(paste("test-error=", err))
242```
243
244> Note that the algorithm has not seen the `test` data during the model construction.
245
246Steps explanation:
247
2481. `as.numeric(pred > 0.5)` applies our rule that when the probability (<=> regression <=> prediction) is `> 0.5` the observation is classified as `1` and `0` otherwise ;
2492. `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ;
2503. `mean(vectorOfErrors)` computes the *average error* itself.
251
252The most important thing to remember is that **to do a classification, you just do a regression to the** `label` **and then apply a threshold**.
253
254*Multiclass* classification works in a similar way.
255
256This metric is **`r round(err, 2)`** and is pretty low: our yummy mushroom model works well!
257
258## Advanced features
259
260
261Most of the features below have been implemented to help you to improve your model by offering a better understanding of its content.
262
263
264### Dataset preparation
265
266
267For the following advanced features, we need to put data in `xgb.DMatrix` as explained above.
268
269```{r DMatrix, message=F, warning=F}
270dtrain <- xgb.DMatrix(data = train$data, label=train$label)
271dtest <- xgb.DMatrix(data = test$data, label=test$label)
272```
273
274### Measure learning progress with xgb.train
275
276
277Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.
278
279One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
280
281One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
282
283> in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors.
284
285For the purpose of this example, we use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name.
286
287```{r watchlist, message=F, warning=F}
288watchlist <- list(train=dtrain, test=dtest)
289
290bst <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2, nrounds=2, watchlist=watchlist, objective = "binary:logistic")
291```
292
293**XGBoost** has computed at each round the same average error metric than seen above (we set `nrounds` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
294
295Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.
296
297If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/data-splitting.html).
298
299For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.
300
301```{r watchlist2, message=F, warning=F}
302bst <- xgb.train(data=dtrain, max_depth=2, eta=1, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
303```
304
305> `eval_metric` allows us to monitor two new metrics for each round, `logloss` and `error`.
306
307### Linear boosting
308
309
310Until now, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).
311
312```{r linearBoosting, message=F, warning=F}
313bst <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
314```
315
316In this specific case, *linear boosting* gets slightly better performance metrics than decision trees based algorithm.
317
318In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
319
320### Manipulating xgb.DMatrix
321
322
323#### Save / Load
324
325Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome) can also be saved using `xgb.DMatrix.save` function.
326
327```{r DMatrixSave, message=F, warning=F}
328xgb.DMatrix.save(dtrain, "dtrain.buffer")
329# to load it in, simply call xgb.DMatrix
330dtrain2 <- xgb.DMatrix("dtrain.buffer")
331bst <- xgb.train(data=dtrain2, max_depth=2, eta=1, nthread = 2, nrounds=2, watchlist=watchlist, objective = "binary:logistic")
332```
333
334```{r DMatrixDel, include=FALSE}
335file.remove("dtrain.buffer")
336```
337
338#### Information extraction
339
340Information can be extracted from `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data.
341
342```{r getinfo, message=F, warning=F}
343label = getinfo(dtest, "label")
344pred <- predict(bst, dtest)
345err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label)
346print(paste("test-error=", err))
347```
348
349### View feature importance/influence from the learnt model
350
351
352Feature importance is similar to R gbm package's relative influence (rel.inf).
353
354```
355importance_matrix <- xgb.importance(model = bst)
356print(importance_matrix)
357xgb.plot.importance(importance_matrix = importance_matrix)
358```
359
360#### View the trees from a model
361
362
363You can dump the tree you learned using `xgb.dump` into a text file.
364
365```{r dump, message=T, warning=F}
366xgb.dump(bst, with_stats = TRUE)
367```
368
369You can plot the trees from your model using ```xgb.plot.tree``
370
371```
372xgb.plot.tree(model = bst)
373```
374
375> if you provide a path to `fname` parameter you can save the trees to your hard drive.
376
377#### Save and load models
378
379
380Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
381
382Hopefully for you, **XGBoost** implements such functions.
383
384```{r saveModel, message=F, warning=F}
385# save model to binary local file
386xgb.save(bst, "xgboost.model")
387```
388
389> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise.
390
391An interesting test to see how identical our saved model is to the original one would be to compare the two predictions.
392
393```{r loadModel, message=F, warning=F}
394# load binary model to R
395bst2 <- xgb.load("xgboost.model")
396pred2 <- predict(bst2, test$data)
397
398# And now the test
399print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred))))
400```
401
402```{r clean, include=FALSE}
403# delete the created model
404file.remove("./xgboost.model")
405```
406
407> result is `0`? We are good!
408
409In some very specific cases, like when you want to pilot **XGBoost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it.
410
411```{r saveLoadRBinVectorModel, message=F, warning=F}
412# save model to R's raw vector
413rawVec <- xgb.serialize(bst)
414
415# print class
416print(class(rawVec))
417
418# load binary model to R
419bst3 <- xgb.load(rawVec)
420pred3 <- predict(bst3, test$data)
421
422# pred2 should be identical to pred
423print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))
424```
425
426> Again `0`? It seems that `XGBoost` works pretty well!
427
428## References
429