1---
2title: "Selecting Variables"
3vignette: >
4  %\VignetteEngine{knitr::rmarkdown}
5  %\VignetteIndexEntry{Selecting Variables}
6output:
7  knitr:::html_vignette:
8    toc: yes
9---
10
11```{r ex_setup, include=FALSE}
12knitr::opts_chunk$set(
13  message = FALSE,
14  digits = 3,
15  collapse = TRUE,
16  comment = "#>"
17  )
18options(digits = 3)
19```
20
21When recipe steps are used, there are different approaches that can be used to select which variables or features should be used.
22
23The three main characteristics of variables that can be queried:
24
25 * the name of the variable
26 * the data type (e.g. numeric or nominal)
27 * the role that was declared by the recipe
28
29The manual pages for `?selections` and  `?has_role` have details about the available selection methods.
30
31To illustrate this, the credit data will be used:
32
33```{r credit}
34library(recipes)
35library(modeldata)
36
37data("credit_data")
38str(credit_data)
39
40rec <- recipe(Status ~ Seniority + Time + Age + Records, data = credit_data)
41rec
42```
43
44Before any steps are used the information on the original variables is:
45
46```{r var_info_orig}
47summary(rec, original = TRUE)
48```
49
50We can add a step to compute dummy variables on the non-numeric data after we impute any missing data:
51
52```{r dummy_1}
53dummied <- rec %>% step_dummy(all_nominal())
54```
55
56This will capture _any_ variables that are either character strings or factors: `Status` and `Records`. However, since `Status` is our outcome, we might want to keep it as a factor so we can _subtract_ that variable out either by name or by role:
57
58```{r dummy_2}
59dummied <- rec %>% step_dummy(Records) # or
60dummied <- rec %>% step_dummy(all_nominal(), - Status) # or
61dummied <- rec %>% step_dummy(all_nominal(), - all_outcomes())
62```
63
64Using the last definition:
65
66```{r dummy_3}
67dummied <- prep(dummied, training = credit_data)
68with_dummy <- bake(dummied, new_data = credit_data)
69with_dummy
70```
71
72`Status` is unaffected.
73
74One important aspect about selecting variables in steps is that the variable names and types may change as steps are being executed. In the above example, `Records` is a factor variable before the step is executed. Afterwards, `Records` is gone and the binary variable `Records_yes` is in its place. One reason to have general selection routines like `all_predictors()` or `contains()` is to be able to select variables that have not be created yet.
75
76