1--- 2title: "Reference semantics" 3date: "`r Sys.Date()`" 4output: 5 rmarkdown::html_vignette 6vignette: > 7 %\VignetteIndexEntry{Reference semantics} 8 %\VignetteEngine{knitr::rmarkdown} 9 \usepackage[utf8]{inputenc} 10--- 11 12```{r, echo = FALSE, message = FALSE} 13require(data.table) 14knitr::opts_chunk$set( 15 comment = "#", 16 error = FALSE, 17 tidy = FALSE, 18 cache = FALSE, 19 collapse = TRUE) 20``` 21This vignette discusses *data.table*'s reference semantics which allows to *add/update/delete* columns of a *data.table by reference*, and also combine them with `i` and `by`. It is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, and perform aggregations by group. If you're not familiar with these concepts, please read the *"Introduction to data.table"* vignette first. 22 23*** 24 25## Data {#data} 26 27We will use the same `flights` data as in the *"Introduction to data.table"* vignette. 28 29```{r echo = FALSE} 30options(width = 100L) 31``` 32 33```{r} 34flights <- fread("flights14.csv") 35flights 36dim(flights) 37``` 38 39## Introduction 40 41In this vignette, we will 42 431. first discuss reference semantics briefly and look at the two different forms in which the `:=` operator can be used 44 452. then see how we can *add/update/delete* columns *by reference* in `j` using the `:=` operator and how to combine with `i` and `by`. 46 473. and finally we will look at using `:=` for its *side-effect* and how we can avoid the side effects using `copy()`. 48 49## 1. Reference semantics 50 51All the operations we have seen so far in the previous vignette resulted in a new data set. We will see how to *add* new column(s), *update* or *delete* existing column(s) on the original data. 52 53### a) Background 54 55Before we look at *reference semantics*, consider the *data.frame* shown below: 56 57```{r} 58DF = data.frame(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18) 59DF 60``` 61 62When we did: 63 64```{r eval = FALSE} 65DF$c <- 18:13 # (1) -- replace entire column 66# or 67DF$c[DF$ID == "b"] <- 15:13 # (2) -- subassign in column 'c' 68``` 69 70both (1) and (2) resulted in deep copy of the entire data.frame in versions of `R` versions `< 3.1`. [It copied more than once](https://stackoverflow.com/q/23898969/559784). To improve performance by avoiding these redundant copies, *data.table* utilised the [available but unused `:=` operator in R](https://stackoverflow.com/q/7033106/559784). 71 72Great performance improvements were made in `R v3.1` as a result of which only a *shallow* copy is made for (1) and not *deep* copy. However, for (2) still, the entire column is *deep* copied even in `R v3.1+`. This means the more columns one subassigns to in the *same query*, the more *deep* copies R does. 73 74#### *shallow* vs *deep* copy {.bs-callout .bs-callout-info} 75 76A *shallow* copy is just a copy of the vector of column pointers (corresponding to the columns in a *data.frame* or *data.table*). The actual data is not physically copied in memory. 77 78A *deep* copy on the other hand copies the entire data to another location in memory. 79 80# 81With *data.table's* `:=` operator, absolutely no copies are made in *both* (1) and (2), irrespective of R version you are using. This is because `:=` operator updates *data.table* columns *in-place* (by reference). 82 83### b) The `:=` operator 84 85It can be used in `j` in two ways: 86 87(a) The `LHS := RHS` form 88 89 ```{r eval = FALSE} 90 DT[, c("colA", "colB", ...) := list(valA, valB, ...)] 91 92 # when you have only one column to assign to you 93 # can drop the quotes and list(), for convenience 94 DT[, colA := valA] 95 ``` 96 97(b) The functional form 98 99 ```{r eval = FALSE} 100 DT[, `:=`(colA = valA, # valA is assigned to colA 101 colB = valB, # valB is assigned to colB 102 ... 103 )] 104 ``` 105 106#### {.bs-callout .bs-callout-warning} 107 108Note that the code above explains how `:=` can be used. They are not working examples. We will start using them on `flights` *data.table* from the next section. 109 110# 111 112#### {.bs-callout .bs-callout-info} 113 114* In (a), `LHS` takes a character vector of column names and `RHS` a *list of values*. `RHS` just needs to be a `list`, irrespective of how its generated (e.g., using `lapply()`, `list()`, `mget()`, `mapply()` etc.). This form is usually easy to program with and is particularly useful when you don't know the columns to assign values to in advance. 115 116* On the other hand, (b) is handy if you would like to jot some comments down for later. 117 118* The result is returned *invisibly*. 119 120* Since `:=` is available in `j`, we can combine it with `i` and `by` operations just like the aggregation operations we saw in the previous vignette. 121 122# 123 124In the two forms of `:=` shown above, note that we don't assign the result back to a variable. Because we don't need to. The input *data.table* is modified by reference. Let's go through examples to understand what we mean by this. 125 126For the rest of the vignette, we will work with `flights` *data.table*. 127 128## 2. Add/update/delete columns *by reference* 129 130### a) Add columns by reference {#ref-j} 131 132#### -- How can we add columns *speed* and *total delay* of each flight to `flights` *data.table*? 133 134```{r} 135flights[, `:=`(speed = distance / (air_time/60), # speed in mph (mi/h) 136 delay = arr_delay + dep_delay)] # delay in minutes 137head(flights) 138 139## alternatively, using the 'LHS := RHS' form 140# flights[, c("speed", "delay") := list(distance/(air_time/60), arr_delay + dep_delay)] 141``` 142 143#### Note that {.bs-callout .bs-callout-info} 144 145* We did not have to assign the result back to `flights`. 146 147* The `flights` *data.table* now contains the two newly added columns. This is what we mean by *added by reference*. 148 149* We used the functional form so that we could add comments on the side to explain what the computation does. You can also see the `LHS := RHS` form (commented). 150 151### b) Update some rows of columns by reference - *sub-assign* by reference {#ref-i-j} 152 153Let's take a look at all the `hours` available in the `flights` *data.table*: 154 155```{r} 156# get all 'hours' in flights 157flights[, sort(unique(hour))] 158``` 159 160We see that there are totally `25` unique values in the data. Both *0* and *24* hours seem to be present. Let's go ahead and replace *24* with *0*. 161 162#### -- Replace those rows where `hour == 24` with the value `0` 163 164```{r} 165# subassign by reference 166flights[hour == 24L, hour := 0L] 167``` 168 169#### {.bs-callout .bs-callout-info} 170 171* We can use `i` along with `:=` in `j` the very same way as we have already seen in the *"Introduction to data.table"* vignette. 172 173* Column `hour` is replaced with `0` only on those *row indices* where the condition `hour == 24L` specified in `i` evaluates to `TRUE`. 174 175* `:=` returns the result invisibly. Sometimes it might be necessary to see the result after the assignment. We can accomplish that by adding an empty `[]` at the end of the query as shown below: 176 177 ```{r} 178 flights[hour == 24L, hour := 0L][] 179 ``` 180 181# 182Let's look at all the `hours` to verify. 183 184```{r} 185# check again for '24' 186flights[, sort(unique(hour))] 187``` 188 189#### Exercise: {.bs-callout .bs-callout-warning #update-by-reference-question} 190 191What is the difference between `flights[hour == 24L, hour := 0L]` and `flights[hour == 24L][, hour := 0L]`? Hint: The latter needs an assignment (`<-`) if you would want to use the result later. 192 193If you can't figure it out, have a look at the `Note` section of `?":="`. 194 195### c) Delete column by reference 196 197#### -- Remove `delay` column 198 199```{r} 200flights[, c("delay") := NULL] 201head(flights) 202 203## or using the functional form 204# flights[, `:=`(delay = NULL)] 205``` 206 207#### {.bs-callout .bs-callout-info #delete-convenience} 208 209* Assigning `NULL` to a column *deletes* that column. And it happens *instantly*. 210 211* We can also pass column numbers instead of names in the `LHS`, although it is good programming practice to use column names. 212 213* When there is just one column to delete, we can drop the `c()` and double quotes and just use the column name *unquoted*, for convenience. That is: 214 215 ```{r eval = FALSE} 216 flights[, delay := NULL] 217 ``` 218 219 is equivalent to the code above. 220 221### d) `:=` along with grouping using `by` {#ref-j-by} 222 223We have already seen the use of `i` along with `:=` in [Section 2b](#ref-i-j). Let's now see how we can use `:=` along with `by`. 224 225#### -- How can we add a new column which contains for each `orig,dest` pair the maximum speed? 226 227```{r} 228flights[, max_speed := max(speed), by = .(origin, dest)] 229head(flights) 230``` 231 232#### {.bs-callout .bs-callout-info} 233 234* We add a new column `max_speed` using the `:=` operator by reference. 235 236* We provide the columns to group by the same way as shown in the *Introduction to data.table* vignette. For each group, `max(speed)` is computed, which returns a single value. That value is recycled to fit the length of the group. Once again, no copies are being made at all. `flights` *data.table* is modified *in-place*. 237 238* We could have also provided `by` with a *character vector* as we saw in the *Introduction to data.table* vignette, e.g., `by = c("origin", "dest")`. 239 240# 241 242### e) Multiple columns and `:=` 243 244#### -- How can we add two more columns computing `max()` of `dep_delay` and `arr_delay` for each month, using `.SD`? 245 246```{r} 247in_cols = c("dep_delay", "arr_delay") 248out_cols = c("max_dep_delay", "max_arr_delay") 249flights[, c(out_cols) := lapply(.SD, max), by = month, .SDcols = in_cols] 250head(flights) 251``` 252#### {.bs-callout .bs-callout-info} 253 254* We use the `LHS := RHS` form. We store the input column names and the new columns to add in separate variables and provide them to `.SDcols` and for `LHS` (for better readability). 255 256* Note that since we allow assignment by reference without quoting column names when there is only one column as explained in [Section 2c](#delete-convenience), we can not do `out_cols := lapply(.SD, max)`. That would result in adding one new column named `out_col`. Instead we should do either `c(out_cols)` or simply `(out_cols)`. Wrapping the variable name with `(` is enough to differentiate between the two cases. 257 258* The `LHS := RHS` form allows us to operate on multiple columns. In the RHS, to compute the `max` on columns specified in `.SDcols`, we make use of the base function `lapply()` along with `.SD` in the same way as we have seen before in the *"Introduction to data.table"* vignette. It returns a list of two elements, containing the maximum value corresponding to `dep_delay` and `arr_delay` for each group. 259 260# 261Before moving on to the next section, let's clean up the newly created columns `speed`, `max_speed`, `max_dep_delay` and `max_arr_delay`. 262 263```{r} 264# RHS gets automatically recycled to length of LHS 265flights[, c("speed", "max_speed", "max_dep_delay", "max_arr_delay") := NULL] 266head(flights) 267``` 268 269## 3) `:=` and `copy()` 270 271`:=` modifies the input object by reference. Apart from the features we have discussed already, sometimes we might want to use the update by reference feature for its side effect. And at other times it may not be desirable to modify the original object, in which case we can use `copy()` function, as we will see in a moment. 272 273### a) `:=` for its side effect 274 275Let's say we would like to create a function that would return the *maximum speed* for each month. But at the same time, we would also like to add the column `speed` to *flights*. We could write a simple function as follows: 276 277```{r} 278foo <- function(DT) { 279 DT[, speed := distance / (air_time/60)] 280 DT[, .(max_speed = max(speed)), by = month] 281} 282ans = foo(flights) 283head(flights) 284head(ans) 285``` 286#### {.bs-callout .bs-callout-info} 287 288* Note that the new column `speed` has been added to `flights` *data.table*. This is because `:=` performs operations by reference. Since `DT` (the function argument) and `flights` refer to the same object in memory, modifying `DT` also reflects on `flights`. 289 290* And `ans` contains the maximum speed for each month. 291 292### b) The `copy()` function 293 294In the previous section, we used `:=` for its side effect. But of course this may not be always desirable. Sometimes, we would like to pass a *data.table* object to a function, and might want to use the `:=` operator, but *wouldn't* want to update the original object. We can accomplish this using the function `copy()`. 295 296#### {.bs-callout .bs-callout-info} 297 298The `copy()` function *deep* copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object. 299 300# 301 302There are two particular places where `copy()` function is essential: 303 3041. Contrary to the situation we have seen in the previous point, we may not want the input data.table to a function to be modified *by reference*. As an example, let's consider the task in the previous section, except we don't want to modify `flights` by reference. 305 306 Let's first delete the `speed` column we generated in the previous section. 307 308 ```{r} 309 flights[, speed := NULL] 310 ``` 311 Now, we could accomplish the task as follows: 312 313 ```{r} 314 foo <- function(DT) { 315 DT <- copy(DT) ## deep copy 316 DT[, speed := distance / (air_time/60)] ## doesn't affect 'flights' 317 DT[, .(max_speed = max(speed)), by = month] 318 } 319 ans <- foo(flights) 320 head(flights) 321 head(ans) 322 ``` 323 324#### {.bs-callout .bs-callout-info} 325 326* Using `copy()` function did not update `flights` *data.table* by reference. It doesn't contain the column `speed`. 327 328* And `ans` contains the maximum speed corresponding to each month. 329 330However we could improve this functionality further by *shallow* copying instead of *deep* copying. In fact, we would very much like to [provide this functionality for `v1.9.8`](https://github.com/Rdatatable/data.table/issues/617). We will touch up on this again in the *data.table design* vignette. 331 332# 333 3342. When we store the column names on to a variable, e.g., `DT_n = names(DT)`, and then *add/update/delete* column(s) *by reference*. It would also modify `DT_n`, unless we do `copy(names(DT))`. 335 336 ```{r} 337 DT = data.table(x = 1L, y = 2L) 338 DT_n = names(DT) 339 DT_n 340 341 ## add a new column by reference 342 DT[, z := 3L] 343 344 ## DT_n also gets updated 345 DT_n 346 347 ## use `copy()` 348 DT_n = copy(names(DT)) 349 DT[, w := 4L] 350 351 ## DT_n doesn't get updated 352 DT_n 353 ``` 354 355## Summary 356 357#### The `:=` operator {.bs-callout .bs-callout-info} 358 359* It is used to *add/update/delete* columns by reference. 360 361* We have also seen how to use `:=` along with `i` and `by` the same way as we have seen in the *Introduction to data.table* vignette. We can in the same way use `keyby`, chain operations together, and pass expressions to `by` as well all in the same way. The syntax is *consistent*. 362 363* We can use `:=` for its side effect or use `copy()` to not modify the original object while updating by reference. 364 365# 366 367So far we have seen a whole lot in `j`, and how to combine it with `by` and little of `i`. Let's turn our attention back to `i` in the next vignette *"Keys and fast binary search based subset"* to perform *blazing fast subsets* by *keying data.tables*. 368 369*** 370 371