1\name{data.frame.create.modify.check}
2\alias{data.frame.create.modify.check}
3\title{
4  Tips for Creating, Modifying, and Checking Data Frames
5}
6\description{
7  This help file contains a template for importing data to create an R
8  data frame, correcting some problems resulting from the import and
9  making the data frame be stored more efficiently, modifying the data
10  frame (including better annotating it and changing the names of some
11  of its variables), and checking and inspecting the data frame for
12  reasonableness of the values of its variables and to describe patterns
13  of missing data.  Various built-in functions and functions in the
14  Hmisc library are used.  At the end some methods for creating data
15  frames \dQuote{from scratch} within \R are presented.
16
17
18  The examples below attempt to clarify the separation of operations
19  that are done on a data frame as a whole, operations that are done on
20  a small subset of its variables without attaching the whole data
21  frame, and operations that are done on many variables after attaching
22  the data frame in search position one.  It also tries to clarify that
23  for analyzing several separate variables using \R commands that do not
24  support a \code{data} argument, it is helpful to attach the data frame
25  in a search position later than position one.
26
27  It is often useful to create, modify, and process datasets in the
28  following order.
29  \enumerate{
30    \item{
31      Import external data into a data frame (if the raw data do not
32      contain column names, provide these during the import if possible)
33    }
34    \item{
35      Make global changes to a data frame (e.g., changing variable
36      names)
37    }
38    \item{
39      Change attributes or values of variables within a data frame
40    }
41    \item{
42      Do analyses involving the whole data frame (without attaching it)\cr
43      (Data frame still in .Data)
44    }
45    \item{
46      Do analyses of individual variables (after attaching the data
47      frame in search position two or later)
48    }
49  }
50}
51\details{
52  The examples below use the \code{FEV} dataset from
53  \cite{Rosner 1995}. Almost any dataset would do.  The jcetable data
54  are taken from \cite{Galobardes, etal.}
55
56  Presently, giving a variable the \code{"units"} attribute (using the
57  \pkg{Hmisc} \code{\link{units}} function) only benefits the
58  \pkg{Hmisc} \code{\link{describe}} function and the \pkg{rms}
59  library's version of the \code{link[rms]{Surv}} function.  Variables
60  labels defined with the Hmisc \code{\link{label}} function are used by
61  \code{\link{describe}}, \code{\link{summary.formula}},  and many of
62  the plotting functions in \pkg{Hmisc} and \pkg{rms}.
63}
64\references{
65  Alzola CF, Harrell FE (2001):
66  \emph{An Introduction to S and the Hmisc and Design Libraries.}
67  Chapters 3 and 4,
68  \url{http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf}.
69
70  Galobardes, et al. (1998), \emph{J Clin Epi} 51:875-881.
71
72  Rosner B (1995): \emph{Fundamentals of Biostatistics, 4th Edition.  }
73  New York: Duxbury Press.
74}
75\seealso{
76  \code{\link{scan}}, \code{\link{read.table}},
77  \code{\link{cleanup.import}}, \code{\link{sas.get}},
78  \code{\link{data.frame}}, \code{\link{attach}}, \code{\link{detach}},
79  \code{\link{describe}}, \code{\link{datadensity}},
80  \code{\link{plot.data.frame}}, \code{\link{hist.data.frame}},
81  \code{\link{naclus}}, \code{\link{factor}}, \code{\link{label}},
82  \code{\link{units}}, \code{\link{names}}, \code{\link{expand.grid}},
83  \code{\link{summary.formula}}, \code{\link{summary.data.frame}},
84  \code{\link{casefold}}, \code{\link{edit}}, \code{\link{page}},
85  \code{\link{plot.data.frame}}, \code{\link{Cs}},
86  \code{\link{combine.levels}},\code{\link{upData}}
87}
88\examples{
89\dontrun{
90# First, we do steps that create or manipulate the data
91# frame in its entirety.  For S-Plus, these are done with
92# .Data in search position one (the default at the
93# start of the session).
94#
95# -----------------------------------------------------------------------
96# Step 1: Create initial draft of data frame
97#
98# We usually begin by importing a dataset from
99# # another application.  ASCII files may be imported
100# using the scan and read.table functions.  SAS
101# datasets may be imported using the Hmisc sas.get
102# function (which will carry more attributes from
103# SAS than using File \dots  Import) from the GUI
104# menus.  But for most applications (especially
105# Excel), File \dots Import will suffice.  If using
106# the GUI, it is often best to provide variable
107# names during the import process, using the Options
108# tab, rather than renaming all fields later Of
109# course, if the data to be imported already have
110# field names (e.g., in Excel), let S use those
111# automatically.  If using S-Plus, you can use a
112# command to execute File \dots  Import, e.g.:
113
114
115import.data(FileName = "/windows/temp/fev.asc",
116            FileType = "ASCII", DataFrame = "FEV")
117
118
119# Here we name the new data frame FEV rather than
120# fev, because we wanted to distinguish a variable
121# in the data frame named fev from the data frame
122# name.  For S-Plus the command will look
123# instead like the following:
124
125
126FEV <- importData("/tmp/fev.asc")
127
128
129
130
131# -----------------------------------------------------------------------
132# Step 2: Clean up data frame / make it be more
133# efficiently stored
134#
135# Unless using sas.get to import your dataset
136# (sas.get already stores data efficiently), it is
137# usually a good idea to run the data frame through
138# the Hmisc cleanup.import function to change
139# numeric variables that are always whole numbers to
140# be stored as integers, the remaining numerics to
141# single precision, strange values from Excel to
142# NAs, and character variables that always contain
143# legal numeric values to numeric variables.
144# cleanup.import typically halves the size of the
145# data frame.  If you do not specify any parameters
146# to cleanup.import, the function assumes that no
147# numeric variable needs more than 7 significant
148# digits of precision, so all non-integer-valued
149# variables will be converted to single precision.
150
151
152FEV <- cleanup.import(FEV)
153
154
155
156
157# -----------------------------------------------------------------------
158# Step 3: Make global changes to the data frame
159#
160# A data frame has attributes that are "external" to
161# its variables.  There are the vector of its
162# variable names ("names" attribute), the
163# observation identifiers ("row.names"), and the
164# "class" (whose value is "data.frame").  The
165# "names" attribute is the one most commonly in need
166# of modification.  If we had wanted to change all
167# the variable names to lower case, we could have
168# specified lowernames=TRUE to the cleanup.import
169# invocation above, or type
170
171
172names(FEV) <- casefold(names(FEV))
173
174
175# The upData function can also be used to change
176# variable names in two ways (see below).
177# To change names in a non-systematic way we use
178# other options.  Under Windows/NT the most
179# straigtforward approach is to change the names
180# interactively.  Click on the data frame in the
181# left panel of the Object Browser, then in the
182# right pane click twice (slowly) on a variable.
183# Use the left arrow and other keys to edit the
184# name.  Click outside that name field to commit the
185# change.  You can also rename columns while in a
186# Data Sheet.  To instead use programming commands
187# to change names, use something like:
188
189
190names(FEV)[6] <- 'smoke'   # assumes you know the positions!
191names(FEV)[names(FEV)=='smoking'] <- 'smoke'
192names(FEV) <- edit(names(FEV))
193
194
195# The last example is useful if you are changing
196# many names.  But none of the interactive
197# approaches such as edit() are handy if you will be
198# re-importing the dataset after it is updated in
199# its original application.  This problem can be
200# addressed by saving the new names in a permanent
201# vector in .Data:
202
203
204new.names <- names(FEV)
205
206
207# Then if the data are re-imported, you can type
208
209
210names(FEV) <- new.names
211
212
213# to rename the variables.
214
215
216
217
218# -----------------------------------------------------------------------
219# Step 4: Delete unneeded variables
220#
221# To delete some of the variables, you can
222# right-click on variable names in the Object
223# Browser's right pane, then select Delete.  You can
224# also set variables to have NULL values, which
225# causes the system to delete them.  We don't need
226# to delete any variables from FEV but suppose we
227# did need to delete some from mydframe.
228
229
230mydframe$x1 <- NULL
231mydframe$x2 <- NULL
232mydframe[c('age','sex')] <- NULL   # delete 2 variables
233mydframe[Cs(age,sex)]    <- NULL   # same thing
234
235
236# The last example uses the Hmisc short-cut quoting
237# function Cs.  See also the drop parameter to upData.
238
239
240
241
242# -----------------------------------------------------------------------
243# Step 5: Make changes to individual variables
244#         within the data frame
245#
246# After importing data, the resulting variables are
247# seldom self - documenting, so we commonly need to
248# change or enhance attributes of individual
249# variables within the data frame.
250#
251# If you are only changing a few variables, it is
252# efficient to change them directly without
253# attaching the entire data frame.
254
255
256FEV$sex   <- factor(FEV$sex,   0:1, c('female','male'))
257FEV$smoke <- factor(FEV$smoke, 0:1,
258                    c('non-current smoker','current smoker'))
259units(FEV$age)    <- 'years'
260units(FEV$fev)    <- 'L'
261label(FEV$fev)    <- 'Forced Expiratory Volume'
262units(FEV$height) <- 'inches'
263
264
265# When changing more than one or two variables it is
266# more convenient change the data frame using the
267# Hmisc upData function.
268
269
270FEV2 <- upData(FEV,
271  rename=c(smoking='smoke'),
272  # omit if renamed above
273  drop=c('var1','var2'),
274  levels=list(sex  =list(female=0,male=1),
275              smoke=list('non-current smoker'=0,
276                         'current smoker'=1)),
277  units=list(age='years', fev='L', height='inches'),
278  labels=list(fev='Forced Expiratory Volume'))
279
280
281# An alternative to levels=list(\dots) is for example
282# upData(FEV, sex=factor(sex,0:1,c('female','male'))).
283#
284# Note that we saved the changed data frame into a
285# new data frame FEV2.  If we were confident of the
286# correctness of our changes we could have stored
287# the new data frame on top of the old one, under
288# the original name FEV.
289
290
291# -----------------------------------------------------------------------
292# Step 6:  Check the data frame
293#
294# The Hmisc describe function is perhaps the first
295# function that should be used on the new data
296# frame.  It provides documentation of all the
297# variables and the frequency tabulation, counts of
298# NAs,  and 5 largest and smallest values are
299# helpful in detecting data errors.  Typing
300# describe(FEV) will write the results to the
301# current output window.  To put the results in a
302# new window that can persist, even upon exiting
303# S, we use the page function.  The describe
304# output can be minimized to an icon but kept ready
305# for guiding later steps of the analysis.
306
307
308page(describe(FEV2), multi=TRUE)
309# multi=TRUE allows that window to persist while
310# control is returned to other windows
311
312
313# The new data frame is OK.  Store it on top of the
314# old FEV and then use the graphical user interface
315# to delete FEV2 (click on it and hit the Delete
316# key) or type rm(FEV2) after the next statement.
317
318
319FEV <- FEV2
320
321
322# Next, we can use a variety of other functions to
323# check and describe all of the variables.  As we
324# are analyzing all or almost all of the variables,
325# this is best done without attaching the data
326# frame.  Note that plot.data.frame plots inverted
327# CDFs for continuous variables and dot plots
328# showing frequency distributions of categorical
329# ones.
330
331
332summary(FEV)
333# basic summary function (summary.data.frame)
334
335
336plot(FEV)                # plot.data.frame
337datadensity(FEV)
338# rug plots and freq. bar charts for all var.
339
340
341hist.data.frame(FEV)
342# for variables having > 2 values
343
344
345by(FEV, FEV$smoke, summary)
346# use basic summary function with stratification
347
348
349
350
351# -----------------------------------------------------------------------
352# Step 7:  Do detailed analyses involving individual
353#          variables
354#
355# Analyses based on the formula language can use
356# data= so attaching the data frame may not be
357# required.  This saves memory.  Here we use the
358# Hmisc summary.formula function to compute 5
359# statistics on height, stratified separately by age
360# quartile and by sex.
361
362
363options(width=80)
364summary(height ~ age + sex, data=FEV,
365        fun=function(y)c(smean.sd(y),
366                         smedian.hilow(y,conf.int=.5)))
367# This computes mean height, S.D., median, outer quartiles
368
369
370fit <- lm(height ~ age*sex, data=FEV)
371summary(fit)
372
373
374# For this analysis we could also have attached the
375# data frame in search position 2.  For other
376# analyses, it is mandatory to attach the data frame
377# unless FEV$ prefixes each variable name.
378# Important: DO NOT USE attach(FEV, 1) or
379# attach(FEV, pos=1, \dots) if you are only analyzing
380# and not changing the variables, unless you really
381# need to avoid conflicts with variables in search
382# position 1 that have the same names as the
383# variables in FEV.  Attaching into search position
384# 1 will cause S-Plus to be more of a memory hog.
385
386
387attach(FEV)
388# Use e.g. attach(FEV[,Cs(age,sex)]) if you only
389# want to analyze a small subset of the variables
390# Use e.g. attach(FEV[FEV$sex=='male',]) to
391# analyze a subset of the observations
392
393
394summary(height ~ age + sex,
395        fun=function(y)c(smean.sd(y),
396          smedian.hilow(y,conf.int=.5)))
397fit <- lm(height ~ age*sex)
398
399
400# Run generic summary function on height and fev,
401# stratified by sex
402by(data.frame(height,fev), sex, summary)
403
404
405# Cross-classify into 4 sex x smoke groups
406by(FEV, list(sex,smoke), summary)
407
408
409# Plot 5 quantiles
410s <- summary(fev ~ age + sex + height,
411              fun=function(y)quantile(y,c(.1,.25,.5,.75,.9)))
412
413
414plot(s, which=1:5, pch=c(1,2,15,2,1), #pch=c('=','[','o',']','='),
415     main='A Discovery', xlab='FEV')
416
417
418# Use the nonparametric bootstrap to compute a
419# 0.95 confidence interval for the population mean fev
420smean.cl.boot(fev)    # in Hmisc
421
422
423# Use the Statistics \dots Compare Samples \dots One Sample
424# keys to get a normal-theory-based C.I.  Then do it
425# more manually.  The following method assumes that
426# there are no NAs in fev
427
428
429sd <- sqrt(var(fev))
430xbar <- mean(fev)
431xbar
432sd
433n <- length(fev)
434qt(.975,n-1)
435# prints 0.975 critical value of t dist. with n-1 d.f.
436
437
438xbar + c(-1,1)*sd/sqrt(n)*qt(.975,n-1)
439# prints confidence limits
440
441
442# Fit a linear model
443# fit <- lm(fev ~ other variables \dots)
444
445
446detach()
447
448
449# The last command is only needed if you want to
450# start operating on another data frame and you want
451# to get FEV out of the way.
452
453
454
455
456# -----------------------------------------------------------------------
457# Creating data frames from scratch
458#
459# Data frames can be created from within S.  To
460# create a small data frame containing ordinary
461# data, you can use something like
462
463
464dframe <- data.frame(age=c(10,20,30),
465                     sex=c('male','female','male'))
466
467
468# You can also create a data frame using the Data
469# Sheet.  Create an empty data frame with the
470# correct variable names and types, then edit in the
471# data.
472
473
474dd <- data.frame(age=numeric(0),sex=character(0))
475
476
477# The sex variable will be stored as a factor, and
478# levels will be automatically added to it as you
479# define new values for sex in the Data Sheet's sex
480# column.
481#
482# When the data frame you need to create is defined
483# by systematically varying variables (e.g., all
484# possible combinations of values of each variable),
485# the expand.grid function is useful for quickly
486# creating the data.  Then you can add
487# non-systematically-varying variables to the object
488# created by expand.grid, using programming
489# statements or editing the Data Sheet.  This
490# process is useful for creating a data frame
491# representing all the values in a printed table.
492# In what follows we create a data frame
493# representing the combinations of values from an 8
494# x 2 x 2 x 2 (event x method x sex x what) table,
495# and add a non-systematic variable percent to the
496# data.
497
498
499jcetable <- expand.grid(
500 event=c('Wheezing at any time',
501         'Wheezing and breathless',
502         'Wheezing without a cold',
503         'Waking with tightness in the chest',
504         'Waking with shortness of breath',
505         'Waking with an attack of cough',
506         'Attack of asthma',
507         'Use of medication'),
508 method=c('Mail','Telephone'),
509 sex=c('Male','Female'),
510 what=c('Sensitivity','Specificity'))
511
512
513jcetable$percent <-
514c(756,618,706,422,356,578,289,333,
515  576,421,789,273,273,212,212,212,
516  613,763,713,403,377,541,290,226,
517  613,684,632,290,387,613,258,129,
518  656,597,438,780,732,679,938,919,
519  714,600,494,877,850,703,963,987,
520  755,420,480,794,779,647,956,941,
521  766,423,500,833,833,604,955,986) / 10
522
523
524# In jcetable, event varies most rapidly, then
525# method, then sex, and what.
526}
527}
528\keyword{data}
529\keyword{manip}
530\keyword{programming}
531\keyword{interface}
532\keyword{htest}
533\concept{overview}
534