1---
2title: "Introduction to data.table"
3date: "`r Sys.Date()`"
4output:
5  rmarkdown::html_vignette
6vignette: >
7  %\VignetteIndexEntry{Introduction to data.table}
8  %\VignetteEngine{knitr::rmarkdown}
9  \usepackage[utf8]{inputenc}
10---
11
12```{r, echo = FALSE, message = FALSE}
13require(data.table)
14knitr::opts_chunk$set(
15  comment = "#",
16    error = FALSE,
17     tidy = FALSE,
18    cache = FALSE,
19 collapse = TRUE
20)
21```
22
23This vignette introduces the `data.table` syntax, its general form, how to *subset* rows, *select and compute* on columns, and perform aggregations *by group*. Familiarity with `data.frame` data structure from base R is useful, but not essential to follow this vignette.
24
25***
26
27## Data analysis using `data.table`
28
29Data manipulation operations such as *subset*, *group*, *update*, *join* etc., are all inherently related. Keeping these *related operations together* allows for:
30
31* *concise* and *consistent* syntax irrespective of the set of operations you would like to perform to achieve your end goal.
32
33* performing analysis *fluidly* without the cognitive burden of having to map each operation to a particular function from a potentially huge set of functions available before performing the analysis.
34
35* *automatically* optimising operations internally, and very effectively, by knowing precisely the data required for each operation, leading to very fast and memory efficient code.
36
37Briefly, if you are interested in reducing *programming* and *compute* time tremendously, then this package is for you. The philosophy that `data.table` adheres to makes this possible. Our goal is to illustrate it through this series of vignettes.
38
39## Data {#data}
40
41In this vignette, we will use [NYC-flights14](https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv) data obtained by [flights](https://github.com/arunsrinivasan/flights) package (available on GitHub only). It contains On-Time flights data from the Bureau of Transporation Statistics for all the flights that departed from New York City airports in 2014 (inspired by [nycflights13](https://github.com/hadley/nycflights13)). The data is available only for Jan-Oct'14.
42
43We can use `data.table`'s fast-and-friendly file reader `fread` to load `flights` directly as follows:
44
45```{r echo = FALSE}
46options(width = 100L)
47```
48
49```{r}
50input <- if (file.exists("flights14.csv")) {
51   "flights14.csv"
52} else {
53  "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
54}
55flights <- fread(input)
56flights
57dim(flights)
58```
59
60Aside: `fread` accepts `http` and `https` URLs directly as well as operating system commands such as `sed` and `awk` output. See `?fread` for examples.
61
62## Introduction
63
64In this vignette, we will
65
661. Start with basics - what is a `data.table`, its general form, how to *subset* rows, how to *select and compute* on columns;
67
682. Then we will look at performing data aggregations by group
69
70## 1. Basics {#basics-1}
71
72### a) What is `data.table`? {#what-is-datatable-1a}
73
74`data.table` is an R package that provides **an enhanced version** of `data.frame`s, which are the standard data structure for storing data in `base` R. In the [Data](#data) section above, we already created a `data.table` using `fread()`. We can also create one using the `data.table()` function. Here is an example:
75
76```{r}
77DT = data.table(
78  ID = c("b","b","b","a","a","c"),
79  a = 1:6,
80  b = 7:12,
81  c = 13:18
82)
83DT
84class(DT$ID)
85```
86
87You can also convert existing objects to a `data.table` using `setDT()` (for `data.frame`s and `list`s) and `as.data.table()` (for other structures); the difference is beyond the scope of this vignette, see `?setDT` and `?as.data.table` for more details.
88
89#### Note that: {.bs-callout .bs-callout-info}
90
91* Unlike `data.frame`s, columns of `character` type are *never* converted to `factors` by default.
92
93* Row numbers are printed with a `:` in order to visually separate the row number from the first column.
94
95* When the number of rows to print exceeds the global option `datatable.print.nrows` (default = `r getOption("datatable.print.nrows")`), it automatically prints only the top 5 and bottom 5 rows (as can be seen in the [Data](#data) section). If you've had a lot of experience with `data.frame`s, you may have found yourself waiting around while larger tables print-and-page, sometimes seemingly endlessly. You can query the default number like so:
96
97    ```{.r}
98    getOption("datatable.print.nrows")
99    ```
100
101* `data.table` doesn't set or use *row names*, ever. We will see why in the *"Keys and fast binary search based subset"* vignette.
102
103### b) General form - in what way is a `data.table` *enhanced*? {#enhanced-1b}
104
105In contrast to a `data.frame`, you can do *a lot more* than just subsetting rows and selecting columns within the frame of a `data.table`, i.e., within `[ ... ]` (NB: we might also refer to writing things inside `DT[...]` as "querying `DT`", in analogy to SQL). To understand it we will have to first look at the *general form* of `data.table` syntax, as shown below:
106
107```{r eval = FALSE}
108DT[i, j, by]
109
110##   R:                 i                 j        by
111## SQL:  where | order by   select | update  group by
112```
113
114Users who have an SQL background might perhaps immediately relate to this syntax.
115
116#### The way to read it (out loud) is: {.bs-callout .bs-callout-info}
117
118Take `DT`, subset/reorder rows using `i`, then calculate `j`, grouped by `by`.
119
120Let's begin by looking at `i` and `j` first - subsetting rows and operating on columns.
121
122### c) Subset rows in `i` {#subset-i-1c}
123
124#### -- Get all the flights with "JFK" as the origin airport in the month of June.
125
126```{r}
127ans <- flights[origin == "JFK" & month == 6L]
128head(ans)
129```
130
131#### {.bs-callout .bs-callout-info}
132
133* Within the frame of a `data.table`, columns can be referred to *as if they are variables*, much like in SQL or Stata. Therefore, we simply refer to `origin` and `month` as if they are variables. We do not need to add the prefix `flights$` each time. Nevertheless, using `flights$origin` and `flights$month` would work just fine.
134
135* The *row indices* that satisfy the condition `origin == "JFK" & month == 6L` are computed, and since there is nothing else left to do, all columns from `flights` at rows corresponding to those *row indices* are simply returned as a `data.table`.
136
137* A comma after the condition in `i` is not required. But `flights[origin == "JFK" & month == 6L, ]` would work just fine. In `data.frame`s, however, the comma is necessary.
138
139#### -- Get the first two rows from `flights`. {#subset-rows-integer}
140
141```{r}
142ans <- flights[1:2]
143ans
144```
145#### {.bs-callout .bs-callout-info}
146
147* In this case, there is no condition. The row indices are already provided in `i`. We therefore return a `data.table` with all columns from `flights` at rows for those *row indices*.
148
149#### -- Sort `flights` first by column `origin` in *ascending* order, and then by `dest` in *descending* order:
150
151We can use the R function `order()` to accomplish this.
152
153```{r}
154ans <- flights[order(origin, -dest)]
155head(ans)
156```
157
158#### `order()` is internally optimised {.bs-callout .bs-callout-info}
159
160* We can use "-" on a `character` columns within the frame of a `data.table` to sort in decreasing order.
161
162* In addition, `order(...)` within the frame of a `data.table` uses `data.table`'s internal fast radix order `forder()`. This sort provided such a compelling improvement over R's `base::order` that the R project adopted the `data.table` algorithm as its default sort in 2016 for R 3.3.0, see `?sort` and the [R Release NEWS](https://cran.r-project.org/doc/manuals/r-release/NEWS.pdf).
163
164We will discuss `data.table`'s fast order in more detail in the *`data.table` internals* vignette.
165
166### d) Select column(s) in `j` {#select-j-1d}
167
168#### -- Select `arr_delay` column, but return it as a *vector*.
169
170```{r}
171ans <- flights[, arr_delay]
172head(ans)
173```
174
175#### {.bs-callout .bs-callout-info}
176
177* Since columns can be referred to as if they are variables within the frame of `data.table`s, we directly refer to the *variable* we want to subset. Since we want *all the rows*, we simply skip `i`.
178
179* It returns *all* the rows for the column `arr_delay`.
180
181#### -- Select `arr_delay` column, but return as a `data.table` instead.
182
183```{r}
184ans <- flights[, list(arr_delay)]
185head(ans)
186```
187
188#### {.bs-callout .bs-callout-info}
189
190* We wrap the *variables* (column names) within `list()`, which ensures that a `data.table` is  returned. In case of a single column name, not wrapping with `list()` returns a vector instead, as seen in the [previous example](#select-j-1d).
191
192* `data.table` also allows wrapping columns with `.()` instead of `list()`. It is an *alias* to `list()`; they both mean the same. Feel free to use whichever you prefer; we have noticed most users seem to prefer `.()` for conciseness, so we will continue to use `.()` hereafter.
193
194`data.table`s (and `data.frame`s) are internally `list`s as well, with the stipulation that each element has the same length and the `list` has a `class` attribute. Allowing `j` to return a `list` enables converting and returning `data.table` very efficiently.
195
196#### Tip: {.bs-callout .bs-callout-warning #tip-1}
197
198As long as `j-expression` returns a `list`, each element of the list will be converted to a column in the resulting `data.table`. This makes `j` quite powerful, as we will see shortly. It is also very important to understand this for when you'd like to make more complicated queries!!
199
200#### -- Select both `arr_delay` and `dep_delay` columns.
201
202```{r}
203ans <- flights[, .(arr_delay, dep_delay)]
204head(ans)
205
206## alternatively
207# ans <- flights[, list(arr_delay, dep_delay)]
208```
209
210#### {.bs-callout .bs-callout-info}
211
212* Wrap both columns within `.()`, or `list()`. That's it.
213
214#### -- Select both `arr_delay` and `dep_delay` columns *and* rename them to `delay_arr` and `delay_dep`.
215
216Since `.()` is just an alias for `list()`, we can name columns as we would while creating a `list`.
217
218```{r}
219ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]
220head(ans)
221```
222
223That's it.
224
225### e) Compute or *do* in `j`
226
227#### -- How many trips have had total delay < 0?
228
229```{r}
230ans <- flights[, sum( (arr_delay + dep_delay) < 0 )]
231ans
232```
233
234#### What's happening here? {.bs-callout .bs-callout-info}
235
236* `data.table`'s `j` can handle more than just *selecting columns* - it can handle *expressions*, i.e., *computing on columns*. This shouldn't be surprising, as columns can be referred to as if they are variables. Then we should be able to *compute* by calling functions on those variables. And that's what precisely happens here.
237
238### f) Subset in `i` *and* do in `j`
239
240#### -- Calculate the average arrival and departure delay for all flights with "JFK" as the origin airport in the month of June.
241
242```{r}
243ans <- flights[origin == "JFK" & month == 6L,
244               .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
245ans
246```
247
248#### {.bs-callout .bs-callout-info}
249
250* We first subset in `i` to find matching *row indices* where `origin` airport equals `"JFK"`, and `month` equals `6L`. We *do not* subset the _entire_ `data.table` corresponding to those rows _yet_.
251
252* Now, we look at `j` and find that it uses only *two columns*. And what we have to do is to compute their `mean()`. Therefore we subset just those columns corresponding to the matching rows, and compute their `mean()`.
253
254Because the three main components of the query (`i`, `j` and `by`) are *together* inside `[...]`, `data.table` can see all three and optimise the query altogether *before evaluation*, not each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns _besides_ `arr_delay` and `dep_delay`), for both speed and memory efficiency.
255
256#### -- How many trips have been made in 2014 from "JFK" airport in the month of June?
257
258```{r}
259ans <- flights[origin == "JFK" & month == 6L, length(dest)]
260ans
261```
262
263The function `length()` requires an input argument. We just needed to compute the number of rows in the subset. We could have used any other column as input argument to `length()` really. This approach is reminiscent of `SELECT COUNT(dest) FROM flights WHERE origin = 'JFK' AND month = 6` in SQL.
264
265This type of operation occurs quite frequently, especially while grouping (as we will see in the next section), to the point where `data.table` provides a *special symbol* `.N` for it.
266
267#### Special symbol `.N`: {.bs-callout .bs-callout-info #special-N}
268
269`.N` is a special built-in variable that holds the number of observations _in the current group_. It is particularly useful when combined with `by` as we'll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.
270
271So we can now accomplish the same task by using `.N` as follows:
272
273```{r}
274ans <- flights[origin == "JFK" & month == 6L, .N]
275ans
276```
277
278#### {.bs-callout .bs-callout-info}
279
280* Once again, we subset in `i` to get the *row indices* where `origin` airport equals *"JFK"*, and `month` equals *6*.
281
282* We see that `j` uses only `.N` and no other columns. Therefore the entire subset is not materialised. We simply return the number of rows in the subset (which is just the length of row indices).
283
284* Note that we did not wrap `.N` with `list()` or `.()`. Therefore a vector is returned.
285
286We could have accomplished the same operation by doing `nrow(flights[origin == "JFK" & month == 6L])`. However, it would have to subset the entire `data.table` first corresponding to the *row indices* in `i` *and then* return the rows using `nrow()`, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the *`data.table` design* vignette.
287
288### g) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
289
290If you're writing out the column names explicitly, there's no difference vis-a-vis `data.frame` (since v1.9.8).
291
292#### -- Select both `arr_delay` and `dep_delay` columns the `data.frame` way.
293
294```{r j_cols_no_with}
295ans <- flights[, c("arr_delay", "dep_delay")]
296head(ans)
297```
298
299If you've stored the desired columns in a character vector, there are two options: Using the `..` prefix, or using the `with` argument.
300
301#### -- Select columns named in a variable using the `..` prefix
302
303```{r j_cols_dot_prefix}
304select_cols = c("arr_delay", "dep_delay")
305flights[ , ..select_cols]
306```
307
308For those familiar with the Unix terminal, the `..` prefix should be reminiscent of the "up-one-level" command, which is analogous to what's happening here -- the `..` signals to `data.table` to look for the `select_cols` variable "up-one-level", i.e., in the global environment in this case.
309
310#### -- Select columns named in a variable using `with = FALSE`
311
312```{r j_cols_with}
313flights[ , select_cols, with = FALSE]
314```
315
316The argument is named `with` after the R function `with()` because of similar functionality. Suppose you have a `data.frame` `DF` and you'd like to subset all rows where `x > 1`. In `base` R you can do the following:
317
318```{r}
319DF = data.frame(x = c(1,1,1,2,2,3,3,3), y = 1:8)
320
321## (1) normal way
322DF[DF$x > 1, ] # data.frame needs that ',' as well
323
324## (2) using with
325DF[with(DF, x > 1), ]
326```
327
328* Using `with()` in (2) allows using `DF`'s column `x` as if it were a variable.
329
330    Hence the argument name `with` in `data.table`. Setting `with = FALSE` disables the ability to refer to columns as if they are variables, thereby restoring the "`data.frame` mode".
331
332* We can also *deselect* columns using `-` or `!`. For example:
333
334    ```{r eval = FALSE}
335    ## not run
336
337    # returns all columns except arr_delay and dep_delay
338    ans <- flights[, !c("arr_delay", "dep_delay")]
339    # or
340    ans <- flights[, -c("arr_delay", "dep_delay")]
341    ```
342
343* From `v1.9.5+`, we can also select by specifying start and end column names, e.g., `year:day` to select the first three columns.
344
345    ```{r eval = FALSE}
346    ## not run
347
348    # returns year,month and day
349    ans <- flights[, year:day]
350    # returns day, month and year
351    ans <- flights[, day:year]
352    # returns all columns except year, month and day
353    ans <- flights[, -(year:day)]
354    ans <- flights[, !(year:day)]
355    ```
356
357    This is particularly handy while working interactively.
358
359`with = TRUE` is the default in `data.table` because we can do much more by allowing `j` to handle expressions - especially when combined with `by`, as we'll see in a moment.
360
361## 2. Aggregations
362
363We've already seen `i` and `j` from `data.table`'s general form in the previous section. In this section, we'll see how they can be combined together with `by` to perform operations *by group*. Let's look at some examples.
364
365### a) Grouping using `by`
366
367#### -- How can we get the number of trips corresponding to each origin airport?
368
369```{r}
370ans <- flights[, .(.N), by = .(origin)]
371ans
372
373## or equivalently using a character vector in 'by'
374# ans <- flights[, .(.N), by = "origin"]
375```
376
377#### {.bs-callout .bs-callout-info}
378
379* We know `.N` [is a special variable](#special-N) that holds the number of rows in the current group. Grouping by `origin` obtains the number of rows, `.N`, for each group.
380
381* By doing `head(flights)` you can see that the origin airports occur in the order *"JFK"*, *"LGA"* and *"EWR"*. The original order of grouping variables is preserved in the result. _This is important to keep in mind!_
382
383* Since we did not provide a name for the column returned in `j`, it was named `N`  automatically by recognising the special symbol `.N`.
384
385* `by` also accepts a character vector of column names. This is particularly useful for  coding programmatically, e.g., designing a function with the grouping columns as a (`character` vector) function argument.
386
387* When there's only one column or expression to refer to in `j` and `by`, we can drop the `.()` notation. This is purely for convenience. We could instead do:
388
389    ```{r}
390    ans <- flights[, .N, by = origin]
391    ans
392    ```
393
394    We'll use this convenient form wherever applicable hereafter.
395
396#### -- How can we calculate the number of trips for each origin airport for carrier code `"AA"`? {#origin-.N}
397
398The unique carrier code `"AA"` corresponds to *American Airlines Inc.*
399
400```{r}
401ans <- flights[carrier == "AA", .N, by = origin]
402ans
403```
404
405#### {.bs-callout .bs-callout-info}
406
407* We first obtain the row indices for the expression `carrier == "AA"` from `i`.
408
409* Using those *row indices*, we obtain the number of rows while grouped by `origin`. Once again no columns are actually materialised here, because the `j-expression` does not require any columns to be actually subsetted and is therefore fast and memory efficient.
410
411#### -- How can we get the total number of trips for each `origin, dest` pair for carrier code `"AA"`? {#origin-dest-.N}
412
413```{r}
414ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
415head(ans)
416
417## or equivalently using a character vector in 'by'
418# ans <- flights[carrier == "AA", .N, by = c("origin", "dest")]
419```
420
421#### {.bs-callout .bs-callout-info}
422
423* `by` accepts multiple columns. We just provide all the columns by which to group by. Note the use of `.()` again in `by` -- again, this is just shorthand for `list()`, and `list()` can be used here as well. Again, we'll stick with `.()` in this vignette.
424
425#### -- How can we get the average arrival and departure delay for each `orig,dest` pair for each month for carrier code `"AA"`? {#origin-dest-month}
426
427```{r}
428ans <- flights[carrier == "AA",
429        .(mean(arr_delay), mean(dep_delay)),
430        by = .(origin, dest, month)]
431ans
432```
433
434#### {.bs-callout .bs-callout-info}
435
436* Since we did not provide column names for the expressions in `j`, they were automatically generated as `V1` and `V2`.
437
438* Once again, note that the input order of grouping columns is preserved in the result.
439
440Now what if we would like to order the result by those grouping columns `origin`, `dest` and `month`?
441
442### b) Sorted `by`: `keyby`
443
444`data.table` retaining the original order of groups is intentional and by design. There are cases when preserving the original order is essential. But at times we would like to automatically sort by the variables in our grouping.
445
446#### -- So how can we directly order by all the grouping variables?
447
448```{r}
449ans <- flights[carrier == "AA",
450        .(mean(arr_delay), mean(dep_delay)),
451        keyby = .(origin, dest, month)]
452ans
453```
454
455#### {.bs-callout .bs-callout-info}
456
457* All we did was to change `by` to `keyby`. This automatically orders the result by the grouping variables in increasing order. In fact, due to the internal implementation of `by` first requiring a sort before recovering the original table's order, `keyby` is typically faster than `by` because it doesn't require this second step.
458
459**Keys:** Actually `keyby` does a little more than *just ordering*. It also *sets a key* after ordering by setting an `attribute` called `sorted`.
460
461We'll learn more about `keys` in the *Keys and fast binary search based subset* vignette; for now, all you have to know is that you can use `keyby` to automatically order the result by the columns specified in `by`.
462
463### c) Chaining
464
465Let's reconsider the task of [getting the total number of trips for each `origin, dest` pair for carrier *"AA"*](#origin-dest-.N).
466
467```{r}
468ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
469```
470
471#### -- How can we order `ans` using the columns `origin` in ascending order, and `dest` in descending order?
472
473We can store the intermediate result in a variable, and then use `order(origin, -dest)` on that variable. It seems fairly straightforward.
474
475```{r}
476ans <- ans[order(origin, -dest)]
477head(ans)
478```
479
480#### {.bs-callout .bs-callout-info}
481
482* Recall that we can use `-` on a `character` column in `order()` within the frame of a `data.table`. This is possible to due `data.table`'s internal query optimisation.
483
484* Also recall that `order(...)` with the frame of a `data.table` is *automatically optimised* to use `data.table`'s internal fast radix order `forder()` for speed.
485
486But this requires having to assign the intermediate result and then overwriting that result. We can do one better and avoid this intermediate assignment to a temporary variable altogether by *chaining* expressions.
487
488```{r}
489ans <- flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]
490head(ans, 10)
491```
492
493#### {.bs-callout .bs-callout-info}
494
495* We can tack expressions one after another, *forming a chain* of operations, i.e., `DT[ ... ][ ... ][ ... ]`.
496
497* Or you can also chain them vertically:
498
499    ```{r eval = FALSE}
500    DT[ ...
501       ][ ...
502         ][ ...
503           ]
504    ```
505
506### d) Expressions in `by`
507
508#### -- Can `by` accept *expressions* as well or does it just take columns?
509
510Yes it does. As an example, if we would like to find out how many flights started late but arrived early (or on time), started and arrived late etc...
511
512```{r}
513ans <- flights[, .N, .(dep_delay>0, arr_delay>0)]
514ans
515```
516
517#### {.bs-callout .bs-callout-info}
518
519* The last row corresponds to `dep_delay > 0 = TRUE` and `arr_delay > 0 = FALSE`. We can see that `r flights[!is.na(arr_delay) & !is.na(dep_delay), .N, .(dep_delay>0, arr_delay>0)][, N[4L]]` flights started late but arrived early (or on time).
520
521* Note that we did not provide any names to `by-expression`. Therefore, names have been automatically assigned in the result. As with `j`, you can name these expressions as you would elements of any `list`, e.g. `DT[, .N, .(dep_delayed = dep_delay>0, arr_delayed = arr_delay>0)]`.
522
523* You can provide other columns along with expressions, for example: `DT[, .N, by = .(a, b>0)]`.
524
525### e) Multiple columns in `j` - `.SD`
526
527#### -- Do we have to compute `mean()` for each column individually?
528
529It is of course not practical to have to type `mean(myCol)` for every column one by one. What if you had 100 columns to average `mean()`?
530
531How can we do this efficiently, concisely? To get there, refresh on [this tip](#tip-1) - *"As long as the `j`-expression returns a `list`, each element of the `list` will be converted to a column in the resulting `data.table`"*. Suppose we can refer to the *data subset for each group* as a variable *while grouping*, then we can loop through all the columns of that variable using the already- or soon-to-be-familiar base function `lapply()`. No new names to learn specific to `data.table`.
532
533#### Special symbol `.SD`: {.bs-callout .bs-callout-info #special-SD}
534
535`data.table` provides a *special* symbol, called `.SD`. It stands for **S**ubset of **D**ata. It by itself is a `data.table` that holds the data for *the current group* defined using `by`.
536
537Recall that a `data.table` is internally a `list` as well with all its columns of equal length.
538
539Let's use the [`data.table` `DT` from before](#what-is-datatable-1a) to get a glimpse of what `.SD` looks like.
540
541```{r}
542DT
543
544DT[, print(.SD), by = ID]
545```
546
547#### {.bs-callout .bs-callout-info}
548
549* `.SD` contains all the columns *except the grouping columns* by default.
550
551* It is also generated by preserving the original order - data corresponding to `ID = "b"`, then `ID = "a"`, and then `ID = "c"`.
552
553To compute on (multiple) columns, we can then simply use the base R function `lapply()`.
554
555```{r}
556DT[, lapply(.SD, mean), by = ID]
557```
558
559#### {.bs-callout .bs-callout-info}
560
561* `.SD` holds the rows corresponding to columns `a`, `b` and `c` for that group. We compute the `mean()` on each of these columns using the already-familiar base function `lapply()`.
562
563* Each group returns a list of three elements containing the mean value which will become the columns of the resulting `data.table`.
564
565* Since `lapply()` returns a `list`, so there is no need to wrap it with an additional `.()` (if necessary, refer to [this tip](#tip-1)).
566
567We are almost there. There is one little thing left to address. In our `flights` `data.table`, we only wanted to calculate the `mean()` of two columns `arr_delay` and `dep_delay`. But `.SD` would contain all the columns other than the grouping variables by default.
568
569#### -- How can we specify just the columns we would like to compute the `mean()` on?
570
571#### .SDcols {.bs-callout .bs-callout-info}
572
573Using the argument `.SDcols`. It accepts either column names or column indices. For example, `.SDcols = c("arr_delay", "dep_delay")` ensures that `.SD` contains only these two columns for each group.
574
575Similar to [part g)](#refer_j), you can also provide the columns to remove instead of columns to keep using `-` or `!` sign as well as select consecutive columns as `colA:colB` and deselect consecutive columns as `!(colA:colB)` or `-(colA:colB)`.
576
577Now let us try to use `.SD` along with `.SDcols` to get the `mean()` of `arr_delay` and `dep_delay` columns grouped by `origin`, `dest` and `month`.
578
579```{r}
580flights[carrier == "AA",                       ## Only on trips with carrier "AA"
581        lapply(.SD, mean),                     ## compute the mean
582        by = .(origin, dest, month),           ## for every 'origin,dest,month'
583        .SDcols = c("arr_delay", "dep_delay")] ## for just those specified in .SDcols
584```
585
586### f) Subset `.SD` for each group:
587
588#### -- How can we return the first two rows for each `month`?
589
590```{r}
591ans <- flights[, head(.SD, 2), by = month]
592head(ans)
593```
594
595#### {.bs-callout .bs-callout-info}
596
597* `.SD` is a `data.table` that holds all the rows for *that group*. We simply subset the first two rows as we have seen [here](#subset-rows-integer) already.
598
599* For each group, `head(.SD, 2)` returns the first two rows as a `data.table`, which is also a `list`, so we do not have to wrap it with `.()`.
600
601### g) Why keep `j` so flexible?
602
603So that we have a consistent syntax and keep using already existing (and familiar) base functions instead of learning new functions. To illustrate, let us use the `data.table` `DT` that we created at the very beginning under [What is a data.table?](#what-is-datatable-1a) section.
604
605#### -- How can we concatenate columns `a` and `b` for each group in `ID`?
606
607```{r}
608DT[, .(val = c(a,b)), by = ID]
609```
610
611#### {.bs-callout .bs-callout-info}
612
613* That's it. There is no special syntax required. All we need to know is the base function `c()` which concatenates vectors and [the tip from before](#tip-1).
614
615#### -- What if we would like to have all the values of column `a` and `b` concatenated, but returned as a list column?
616
617```{r}
618DT[, .(val = list(c(a,b))), by = ID]
619```
620
621#### {.bs-callout .bs-callout-info}
622
623* Here, we first concatenate the values with `c(a,b)` for each group, and wrap that with `list()`. So for each group, we return a list of all concatenated values.
624
625* Note those commas are for display only. A list column can contain any object in each cell, and in this example, each cell is itself a vector and some cells contain longer vectors than others.
626
627Once you start internalising usage in `j`, you will realise how powerful the syntax can be. A very useful way to understand it is by playing around, with the help of `print()`.
628
629For example:
630
631```{r}
632## (1) look at the difference between
633DT[, print(c(a,b)), by = ID]
634
635## (2) and
636DT[, print(list(c(a,b))), by = ID]
637```
638
639In (1), for each group, a vector is returned, with length = 6,4,2 here. However (2) returns a list of length 1 for each group, with its first element holding vectors of length 6,4,2. Therefore (1) results in a length of ` 6+4+2 = `r 6+4+2``, whereas (2) returns `1+1+1=`r 1+1+1``.
640
641## Summary
642
643The general form of `data.table` syntax is:
644
645```{r eval = FALSE}
646DT[i, j, by]
647```
648
649We have seen so far that,
650
651#### Using `i`: {.bs-callout .bs-callout-info}
652
653* We can subset rows similar to a `data.frame`- except you don't have to use `DT$` repetitively since columns within the frame of a `data.table` are seen as if they are *variables*.
654
655* We can also sort a `data.table` using `order()`, which internally uses `data.table`'s fast order for performance.
656
657We can do much more in `i` by keying a `data.table`, which allows blazing fast subsets and joins. We will see this in the *"Keys and fast binary search based subsets"* and *"Joins and rolling joins"* vignette.
658
659#### Using `j`: {.bs-callout .bs-callout-info}
660
6611. Select columns the `data.table` way: `DT[, .(colA, colB)]`.
662
6632. Select columns the `data.frame` way: `DT[, c("colA", "colB")]`.
664
6653. Compute on columns: `DT[, .(sum(colA), mean(colB))]`.
666
6674. Provide names if necessary: `DT[, .(sA =sum(colA), mB = mean(colB))]`.
668
6695. Combine with `i`: `DT[colA > value, sum(colB)]`.
670
671#### Using `by`: {.bs-callout .bs-callout-info}
672
673* Using `by`, we can group by columns by specifying a *list of columns* or a *character vector of column names* or even *expressions*. The flexibility of `j`, combined with `by` and `i` makes for a very powerful syntax.
674
675* `by` can handle multiple columns and also *expressions*.
676
677* We can `keyby` grouping columns to automatically sort the grouped result.
678
679* We can use `.SD` and `.SDcols` in `j` to operate on multiple columns using already familiar base functions. Here are some examples:
680
681    1. `DT[, lapply(.SD, fun), by = ..., .SDcols = ...]` - applies `fun` to all columns specified in `.SDcols` while grouping by the columns specified in `by`.
682
683    2. `DT[, head(.SD, 2), by = ...]` - return the first two rows for each group.
684
685    3. `DT[col > val, head(.SD, 1), by = ...]` - combine `i` along with `j` and `by`.
686
687#### And remember the tip: {.bs-callout .bs-callout-warning}
688
689As long as `j` returns a `list`, each element of the list will become a column in the resulting `data.table`.
690
691We will see how to *add/update/delete* columns *by reference* and how to combine them with `i` and `by` in the next vignette.
692
693***
694
695