1---
2title: "Conversion semantics"
3output: rmarkdown::html_vignette
4vignette: >
5  %\VignetteIndexEntry{Conversion semantics}
6  %\VignetteEngine{knitr::rmarkdown}
7  %\VignetteEncoding{UTF-8}
8---
9
10```{r, include = FALSE}
11library(haven)
12knitr::opts_chunk$set(
13  collapse = TRUE,
14  comment = "#>"
15)
16```
17
18There are some differences between the way that R, SAS, SPSS, and Stata represented labelled data and missing values. While SAS, SPSS, and Stata share some obvious similarities, R is little different. This vignette explores the differences, and shows you how haven bridges the gap.
19
20## Value labels
21
22Base R has one data type that effectively maintains a mapping between integers and character labels: the factor. This however, is not the primary use of factors: they are instead designed to automatically generate useful contrasts for linear models. Factors differ from the labelled values provided by the other tools in important ways:
23
24* SPSS and SAS can label numeric and character values, not just
25  integer values.
26
27* The value do not need to be exhaustive. It is common to label the
28  special missing values (e.g. `.D` = did not respond, `.N` =
29  not applicable), while leaving other values as is.
30
31Value labels in SAS are a little different again. In SAS, labels are just special case of general formats. Formats include currencies and dates, but user-defined just assigns labels to individual values (including special missings value). Formats have names and existing independently of the variables they are associated with. You create a named format with `PROC FORMAT` and then associated with variables in a `DATA` step (the names of character formats thealways start with `$`).
32
33### `labelled()`
34
35To allow you to import labelled vectors into R, haven provides the S3 labelled class, created with `labelled()`. This class allows you to associated arbitrary labels with numeric or character vectors:
36
37```{r}
38x1 <- labelled(
39  sample(1:5),
40  c(Good = 1, Bad = 5)
41)
42x1
43
44x2 <- labelled(
45  c("M", "F", "F", "F", "M"),
46  c(Male = "M", Female = "F")
47)
48x2
49```
50
51The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels:
52
53```{r}
54as_factor(x1)
55zap_labels(x1)
56
57as_factor(x2)
58zap_labels(x2)
59```
60
61See the documentation for `as_factor()` for more options to control exactly what the factor uses for levels.
62
63Both `as_factor()` and `zap_labels()` have data frame methods if you want to apply the same strategy to every column in a data frame:
64
65```{r}
66df <- tibble::data_frame(x1, x2, z = 1:5)
67df
68
69zap_labels(df)
70as_factor(df)
71```
72
73## Missing values
74
75All three tools provide a global "system missing value" which is displayed as `.`. This is roughly equivalent to R's `NA`, although neither Stata nor SAS propagate missingness in numeric comparisons: SAS treats the missing value as the smallest possible number (i.e. `-inf`), and Stata treats it as the largest possible number (i.e. `inf`).
76
77Each tool also provides a mechanism for recording multiple types of missingness:
78
79* Stata has "extended" missing values, `.A` through `.Z`.
80
81* SAS has "special" missing values, `.A` through `.Z` plus `._`.
82
83* SPSS has per-column "user" missing values. Each column can declare
84  up to three distinct values or a range of values (plus one distinct
85  value) that should be treated as missing.
86
87Stata and SAS only support tagged missing values for numeric columns. SPSS supports up to three distinct values for character columns. Generally, operations involving a user-missing type return a system missing value.
88
89Haven models these missing values in two different ways:
90
91* For SAS and Stata, haven provides "tagged" missing values which extend R's
92  regular `NA` to add a single character label.
93
94* For SPSS, haven provides a subclass of `labelled` that also provides
95  user defined values and ranges.
96
97### Tagged missing values
98
99To support Stata's extended and SAS's special missing value, haven implements a tagged NA. It does this by taking advantage of the internal structure of a floating point NA. That allows these values to behave identical to NA in regular R operations, while still preserving the value of the tag.
100
101The R interface for creating with tagged NAs is a little clunky because generally they'll be created by haven for you. But you can create your own with `tagged_na()`:
102
103```{r}
104x <- c(1:3, tagged_na("a", "z"), 3:1)
105x
106```
107
108Note these tagged NAs behave identically to regular NAs, even when printing. To see their tags, use `print_tagged_na()`:
109
110```{r}
111print_tagged_na(x)
112```
113
114To test if a value is a tagged NA, use `is_tagged_na()`, and to extract the value of the tag, use `na_tag()`:
115
116```{r}
117is_tagged_na(x)
118is_tagged_na(x, "a")
119
120na_tag(x)
121```
122
123My expectation is that tagged missings are most often used in conjuction with labels (described below), so labelled vectors print the tags for you, and `as_factor()` knows how to relabel:
124
125```{r}
126y <- labelled(x, c("Not home" = tagged_na("a"), "Refused" = tagged_na("z")))
127y
128
129as_factor(y)
130```
131
132### User defined missing values
133
134SPSS's user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing, or a range.  Haven provides `labelled_spss()` as a subclass of `labelled()` to model these additional user-defined missings.
135
136```{r}
137x1 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_value = 99)
138x2 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_range = c(90, Inf))
139
140x1
141x2
142```
143
144These objects are somewhat dangerous to work with in R because most R functions don't know those values are missing:
145
146```{r}
147mean(x1)
148```
149
150Because of that danger, the default behaviour of `read_spss()` is to return regular labelled objects where user-defined missing values have been converted to `NA`s. To get `read_spss()` to return `labelled_spss()` objects, you'll need to set `user_na = TRUE`.
151
152I've defined an `is.na()` method so you can find them yourself:
153
154```{r}
155is.na(x1)
156```
157
158And the presence of that method does mean many functions with an `na.rm` argument will work correctly:
159
160```{r}
161mean(x1, na.rm = TRUE)
162```
163
164But generally you should either convert to a factor, convert to regular missing vaues, or strip the all the labels:
165
166```{r}
167as_factor(x1)
168zap_missing(x1)
169zap_labels(x1)
170```
171
172
173