1--- 2title: "Programming with dplyr" 3description: > 4 Most dplyr verbs use "tidy evaluation", a special type of non-standard 5 evaluation. In this vignette, you'll learn the two basic forms, data masking 6 and tidy selection, and how you can program with them using either functions 7 or for loops. 8output: rmarkdown::html_vignette 9vignette: > 10 %\VignetteIndexEntry{Programming with dplyr} 11 %\VignetteEngine{knitr::rmarkdown} 12 %\usepackage[utf8]{inputenc} 13--- 14 15```{r, echo = FALSE, message = FALSE} 16knitr::opts_chunk$set(collapse = T, comment = "#>") 17options(tibble.print_min = 4L, tibble.print_max = 4L) 18set.seed(1014) 19``` 20 21## Introduction 22 23Most dplyr verbs use **tidy evaluation** in some way. Tidy evaluation is a special type of non-standard evaluation used throughout the tidyverse. There are two basic forms found in dplyr: 24 25* `arrange()`, `count()`, `filter()`, `group_by()`, `mutate()`, and 26 `summarise()` use **data masking** so that you can use data variables as if 27 they were variables in the environment (i.e. you write `my_variable` not 28 `df$myvariable`). 29 30* `across()`, `relocate()`, `rename()`, `select()`, and `pull()` use 31 **tidy selection** so you can easily choose variables based on their position, 32 name, or type (e.g. `starts_with("x")` or `is.numeric`). 33 34To determine whether a function argument uses data masking or tidy selection, look at the documentation: in the arguments list, you'll see `<data-masking>` or `<tidy-select>`. 35 36Data masking and tidy selection make interactive data exploration fast and fluid, but they add some new challenges when you attempt to use them indirectly such as in a for loop or a function. This vignette shows you how to overcome those challenges. We'll first go over the basics of data masking and tidy selection, talk about how to use them indirectly, and then show you a number of recipes to solve common problems. 37 38This vignette will give you the minimum knowledge you need to be an effective programmer with tidy evaluation. If you'd like to learn more about the underlying theory, or precisely how it's different from non-standard evaluation, we recommend that you read the Metaprogramming chapters in [_Advanced R_](https://adv-r.hadley.nz). 39 40```{r setup, message = FALSE} 41library(dplyr) 42``` 43 44## Data masking 45 46Data masking makes data manipulation faster because it requires less typing. In most (but not all[^subset]) base R functions you need to refer to variables with `$`, leading to code that repeats the name of the data frame many times: 47 48```{r, results = FALSE} 49starwars[starwars$homeworld == "Naboo" & starwars$species == "Human", ,] 50``` 51 52[^subset]: dplyr's `filter()` is inspired by base R's `subset()`. `subset()` provides data masking, but not with tidy evaluation, so the techniques described in this chapter don't apply to it. 53 54The dplyr equivalent of this code is more concise because data masking allows you to need to type `starwars` once: 55 56```{r, results = FALSE} 57starwars %>% filter(homeworld == "Naboo", species == "Human") 58``` 59 60### Data- and env-variables 61 62The key idea behind data masking is that it blurs the line between the two different meanings of the word "variable": 63 64* **env-variables** are "programming" variables that live in an environment. 65 They are usually created with `<-`. 66 67* **data-variables** are "statistical" variables that live in a data frame. 68 They usually come from data files (e.g. `.csv`, `.xls`), or are created 69 manipulating existing variables. 70 71To make those definitions a little more concrete, take this piece of code: 72 73```{r} 74df <- data.frame(x = runif(3), y = runif(3)) 75df$x 76``` 77 78It creates a env-variable, `df`, that contains two data-variables, `x` and `y`. Then it extracts the data-variable `x` out of the env-variable `df` using `$`. 79 80I think this blurring of the meaning of "variable" is a really nice feature for interactive data analysis because it allows you to refer to data-vars as is, without any prefix. And this seems to be fairly intuitive since many newer R users will attempt to write `diamonds[x == 0 | y == 0, ]`. 81 82Unfortunately, this benefit does not come for free. When you start to program with these tools, you're going to have to grapple with the distinction. This will be hard because you've never had to think about it before, so it'll take a while for your brain to learn these new concepts and categories. However, once you've teased apart the idea of "variable" into data-variable and env-variable, I think you'll find it fairly straightforward to use. 83 84### Indirection 85 86The main challenge of programming with functions that use data masking arises when you introduce some indirection, i.e. when you want to get the data-variable from an env-variable instead of directly typing the data-variable's name. There are two main cases: 87 88* When you have the data-variable in a function argument (i.e. an env-variable 89 that holds a promise[^promise]), you need to **embrace** the argument by 90 surrounding it in doubled braces, like `filter(df, {{ var }})`. 91 92 The following function uses embracing to create a wrapper around 93 `summarise()` that computes the minimum and maximum values of a variable, 94 as well as the number of observations that were summarised: 95 96 ```{r, results = FALSE} 97 var_summary <- function(data, var) { 98 data %>% 99 summarise(n = n(), min = min({{ var }}), max = max({{ var }})) 100 } 101 mtcars %>% 102 group_by(cyl) %>% 103 var_summary(mpg) 104 ``` 105 106* When you have an env-variable that is a character vector, you need to index 107 into the `.data` pronoun with `[[`, like 108 `summarise(df, mean = mean(.data[[var]]))`. 109 110 The following example uses `.data` to count the number of unique values in 111 each variable of `mtcars`: 112 113 ```{r, results = FALSE} 114 for (var in names(mtcars)) { 115 mtcars %>% count(.data[[var]]) %>% print() 116 } 117 ``` 118 119 Note that `.data` is not a data frame; it's a special construct, a pronoun, 120 that allows you to access the current variables either directly, with 121 `.data$x` or indirectly with `.data[[var]]`. Don't expect other functions 122 to work with it. 123 124[^promise]: In R, arguments are lazily evaluated which means that until you attempt to use, they don't hold a value, just a __promise__ that describes how to compute the value. You can learn more at <https://adv-r.hadley.nz/functions.html#lazy-evaluation> 125 126## Tidy selection 127 128Data masking makes it easy to compute on values within a dataset. Tidy selection is a complementary tool that makes it easy to work with the columns of a dataset. 129 130### The tidyselect DSL 131 132Underneath all functions that use tidy selection is the [tidyselect](https://tidyselect.r-lib.org/) package. It provides a miniature domain specific language that makes it easy to select columns by name, position, or type. For example: 133 134* `select(df, 1)` selects the first column; 135 `select(df, last_col())` selects the last column. 136 137* `select(df, c(a, b, c))` selects columns `a`, `b`, and `c`. 138 139* `select(df, starts_with("a"))` selects all columns whose name starts with "a"; 140 `select(df, ends_with("z"))` selects all columns whose name ends with "z". 141 142* `select(df, where(is.numeric))` selects all numeric columns. 143 144You can see more details in `?dplyr_tidy_select`. 145 146### Indirection 147 148As with data masking, tidy selection makes a common task easier at the cost of making a less common task harder. When you want to use tidy select indirectly with the column specification stored in an intermediate variable, you'll need to learn some new tools. Again, there are two forms of indirection: 149 150* When you have the data-variable in an env-variable that is a function 151 argument, you use the same technique as data masking: you **embrace** the 152 argument by surrounding it in doubled braces. 153 154 The following function summarises a data frame by computing 155 the mean of all variables selected by the user: 156 157 ```{r, results = FALSE} 158 summarise_mean <- function(data, vars) { 159 data %>% summarise(n = n(), across({{ vars }}, mean)) 160 } 161 mtcars %>% 162 group_by(cyl) %>% 163 summarise_mean(where(is.numeric)) 164 ``` 165 166* When you have an env-variable that is a character vector, you need to use 167 `all_of()` or `any_of()` depending on whether you want the 168 function to error if a variable is not found. 169 170 The following code uses `all_of()` to select all of the variables found 171 in a character vector; then `!` plus `all_of()` to select all of the 172 variables *not* found in a character vector: 173 174 ```{r, results = FALSE} 175 vars <- c("mpg", "vs") 176 mtcars %>% select(all_of(vars)) 177 mtcars %>% select(!all_of(vars)) 178 ``` 179 180## How tos 181 182The following examples solve a grab bag of common problems. We show you the minimum amount of code so that you can get the basic idea; most real problems will require more code or combining multiple techniques. 183 184### User-supplied data 185 186If you check the documentation, you'll see that `.data` never uses data masking or tidy select. That means you don't need to do anything special in your function: 187 188```{r} 189mutate_y <- function(data) { 190 mutate(data, y = a + x) 191} 192``` 193 194### Eliminating `R CMD check` `NOTE`s 195 196If you're writing a package and you have a function that uses data-variables: 197 198```{r} 199my_summary_function <- function(data) { 200 data %>% 201 filter(x > 0) %>% 202 group_by(grp) %>% 203 summarise(y = mean(y), n = n()) 204} 205``` 206 207You'll get an `R CMD CHECK` `NOTE`: 208 209``` 210N checking R code for possible problems 211 my_summary_function: no visible binding for global variable ‘x’, ‘grp’, ‘y’ 212 Undefined global functions or variables: 213 x grp y 214``` 215 216You can eliminate this by using `.data$var` and importing `.data` from its source in the [rlang](https://rlang.r-lib.org/) package (the underlying package that implements tidy evaluation): 217 218```{r} 219#' @importFrom rlang .data 220my_summary_function <- function(data) { 221 data %>% 222 filter(.data$x > 0) %>% 223 group_by(.data$grp) %>% 224 summarise(y = mean(.data$y), n = n()) 225} 226``` 227 228### One or more user-supplied expressions 229 230If you want the user to supply an expression that's passed onto an argument which uses data masking or tidy select, embrace the argument: 231 232```{r} 233my_summarise <- function(data, group_var) { 234 data %>% 235 group_by({{ group_var }}) %>% 236 summarise(mean = mean(mass)) 237} 238``` 239 240This generalises in a straightforward way if you want to use one user-supplied expression in multiple places: 241 242```{r} 243my_summarise2 <- function(data, expr) { 244 data %>% summarise( 245 mean = mean({{ expr }}), 246 sum = sum({{ expr }}), 247 n = n() 248 ) 249} 250``` 251 252If you want the user to provide multiple expressions, embrace each of them: 253 254```{r} 255my_summarise3 <- function(data, mean_var, sd_var) { 256 data %>% 257 summarise(mean = mean({{ mean_var }}), sd = sd({{ sd_var }})) 258} 259``` 260 261If you want to use the names of variables in the output, you can use glue syntax in conjunction with `:=`: 262 263```{r} 264my_summarise4 <- function(data, expr) { 265 data %>% summarise( 266 "mean_{{expr}}" := mean({{ expr }}), 267 "sum_{{expr}}" := sum({{ expr }}), 268 "n_{{expr}}" := n() 269 ) 270} 271my_summarise5 <- function(data, mean_var, sd_var) { 272 data %>% 273 summarise( 274 "mean_{{mean_var}}" := mean({{ mean_var }}), 275 "sd_{{sd_var}}" := sd({{ sd_var }}) 276 ) 277} 278``` 279 280### Any number of user-supplied expressions 281 282If you want to take an arbitrary number of user supplied expressions, use `...`. This is most often useful when you want to give the user full control over a single part of the pipeline, like a `group_by()` or a `mutate()`. 283 284```{r} 285my_summarise <- function(.data, ...) { 286 .data %>% 287 group_by(...) %>% 288 summarise(mass = mean(mass, na.rm = TRUE), height = mean(height, na.rm = TRUE)) 289} 290 291starwars %>% my_summarise(homeworld) 292starwars %>% my_summarise(sex, gender) 293``` 294 295When you use `...` in this way, make sure that any other arguments start with `.` to reduce the chances of argument clashes; see <https://design.tidyverse.org/dots-prefix.html> for more details. 296 297### Transforming user-supplied variables 298 299If you want the user to provide a set of data-variables that are then transformed, use `across()`: 300 301```{r} 302my_summarise <- function(data, summary_vars) { 303 data %>% 304 summarise(across({{ summary_vars }}, ~ mean(., na.rm = TRUE))) 305} 306starwars %>% 307 group_by(species) %>% 308 my_summarise(c(mass, height)) 309``` 310 311You can use this same idea for multiple sets of input data-variables: 312 313```{r} 314my_summarise <- function(data, group_var, summarise_var) { 315 data %>% 316 group_by(across({{ group_var }})) %>% 317 summarise(across({{ summarise_var }}, mean)) 318} 319``` 320 321Use the `.names` argument to `across()` to control the names of the output. 322 323```{r} 324my_summarise <- function(data, group_var, summarise_var) { 325 data %>% 326 group_by(across({{ group_var }})) %>% 327 summarise(across({{ summarise_var }}, mean, .names = "mean_{.col}")) 328} 329``` 330 331### Loop over multiple variables 332 333If you have a character vector of variable names, and want to operate on them with a for loop, index into the special `.data` pronoun: 334 335```{r, results = FALSE} 336for (var in names(mtcars)) { 337 mtcars %>% count(.data[[var]]) %>% print() 338} 339``` 340 341This same technique works with for loop alternatives like the base R `apply()` family and the purrr `map()` family: 342 343```{r, results = FALSE} 344mtcars %>% 345 names() %>% 346 purrr::map(~ count(mtcars, .data[[.x]])) 347``` 348 349### Use a variable from an Shiny input 350 351Many Shiny input controls return character vectors, so you can use the same approach as above: `.data[[input$var]]`. 352 353```{r, eval = FALSE} 354library(shiny) 355ui <- fluidPage( 356 selectInput("var", "Variable", choices = names(diamonds)), 357 tableOutput("output") 358) 359server <- function(input, output, session) { 360 data <- reactive(filter(diamonds, .data[[input$var]] > 0)) 361 output$output <- renderTable(head(data())) 362} 363``` 364 365See <https://mastering-shiny.org/action-tidy.html> for more details and case studies. 366