1\name{data.frame.create.modify.check} 2\alias{data.frame.create.modify.check} 3\title{ 4 Tips for Creating, Modifying, and Checking Data Frames 5} 6\description{ 7 This help file contains a template for importing data to create an R 8 data frame, correcting some problems resulting from the import and 9 making the data frame be stored more efficiently, modifying the data 10 frame (including better annotating it and changing the names of some 11 of its variables), and checking and inspecting the data frame for 12 reasonableness of the values of its variables and to describe patterns 13 of missing data. Various built-in functions and functions in the 14 Hmisc library are used. At the end some methods for creating data 15 frames \dQuote{from scratch} within \R are presented. 16 17 18 The examples below attempt to clarify the separation of operations 19 that are done on a data frame as a whole, operations that are done on 20 a small subset of its variables without attaching the whole data 21 frame, and operations that are done on many variables after attaching 22 the data frame in search position one. It also tries to clarify that 23 for analyzing several separate variables using \R commands that do not 24 support a \code{data} argument, it is helpful to attach the data frame 25 in a search position later than position one. 26 27 It is often useful to create, modify, and process datasets in the 28 following order. 29 \enumerate{ 30 \item{ 31 Import external data into a data frame (if the raw data do not 32 contain column names, provide these during the import if possible) 33 } 34 \item{ 35 Make global changes to a data frame (e.g., changing variable 36 names) 37 } 38 \item{ 39 Change attributes or values of variables within a data frame 40 } 41 \item{ 42 Do analyses involving the whole data frame (without attaching it)\cr 43 (Data frame still in .Data) 44 } 45 \item{ 46 Do analyses of individual variables (after attaching the data 47 frame in search position two or later) 48 } 49 } 50} 51\details{ 52 The examples below use the \code{FEV} dataset from 53 \cite{Rosner 1995}. Almost any dataset would do. The jcetable data 54 are taken from \cite{Galobardes, etal.} 55 56 Presently, giving a variable the \code{"units"} attribute (using the 57 \pkg{Hmisc} \code{\link{units}} function) only benefits the 58 \pkg{Hmisc} \code{\link{describe}} function and the \pkg{rms} 59 library's version of the \code{link[rms]{Surv}} function. Variables 60 labels defined with the Hmisc \code{\link{label}} function are used by 61 \code{\link{describe}}, \code{\link{summary.formula}}, and many of 62 the plotting functions in \pkg{Hmisc} and \pkg{rms}. 63} 64\references{ 65 Alzola CF, Harrell FE (2001): 66 \emph{An Introduction to S and the Hmisc and Design Libraries.} 67 Chapters 3 and 4, 68 \url{http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf}. 69 70 Galobardes, et al. (1998), \emph{J Clin Epi} 51:875-881. 71 72 Rosner B (1995): \emph{Fundamentals of Biostatistics, 4th Edition. } 73 New York: Duxbury Press. 74} 75\seealso{ 76 \code{\link{scan}}, \code{\link{read.table}}, 77 \code{\link{cleanup.import}}, \code{\link{sas.get}}, 78 \code{\link{data.frame}}, \code{\link{attach}}, \code{\link{detach}}, 79 \code{\link{describe}}, \code{\link{datadensity}}, 80 \code{\link{plot.data.frame}}, \code{\link{hist.data.frame}}, 81 \code{\link{naclus}}, \code{\link{factor}}, \code{\link{label}}, 82 \code{\link{units}}, \code{\link{names}}, \code{\link{expand.grid}}, 83 \code{\link{summary.formula}}, \code{\link{summary.data.frame}}, 84 \code{\link{casefold}}, \code{\link{edit}}, \code{\link{page}}, 85 \code{\link{plot.data.frame}}, \code{\link{Cs}}, 86 \code{\link{combine.levels}},\code{\link{upData}} 87} 88\examples{ 89\dontrun{ 90# First, we do steps that create or manipulate the data 91# frame in its entirety. For S-Plus, these are done with 92# .Data in search position one (the default at the 93# start of the session). 94# 95# ----------------------------------------------------------------------- 96# Step 1: Create initial draft of data frame 97# 98# We usually begin by importing a dataset from 99# # another application. ASCII files may be imported 100# using the scan and read.table functions. SAS 101# datasets may be imported using the Hmisc sas.get 102# function (which will carry more attributes from 103# SAS than using File \dots Import) from the GUI 104# menus. But for most applications (especially 105# Excel), File \dots Import will suffice. If using 106# the GUI, it is often best to provide variable 107# names during the import process, using the Options 108# tab, rather than renaming all fields later Of 109# course, if the data to be imported already have 110# field names (e.g., in Excel), let S use those 111# automatically. If using S-Plus, you can use a 112# command to execute File \dots Import, e.g.: 113 114 115import.data(FileName = "/windows/temp/fev.asc", 116 FileType = "ASCII", DataFrame = "FEV") 117 118 119# Here we name the new data frame FEV rather than 120# fev, because we wanted to distinguish a variable 121# in the data frame named fev from the data frame 122# name. For S-Plus the command will look 123# instead like the following: 124 125 126FEV <- importData("/tmp/fev.asc") 127 128 129 130 131# ----------------------------------------------------------------------- 132# Step 2: Clean up data frame / make it be more 133# efficiently stored 134# 135# Unless using sas.get to import your dataset 136# (sas.get already stores data efficiently), it is 137# usually a good idea to run the data frame through 138# the Hmisc cleanup.import function to change 139# numeric variables that are always whole numbers to 140# be stored as integers, the remaining numerics to 141# single precision, strange values from Excel to 142# NAs, and character variables that always contain 143# legal numeric values to numeric variables. 144# cleanup.import typically halves the size of the 145# data frame. If you do not specify any parameters 146# to cleanup.import, the function assumes that no 147# numeric variable needs more than 7 significant 148# digits of precision, so all non-integer-valued 149# variables will be converted to single precision. 150 151 152FEV <- cleanup.import(FEV) 153 154 155 156 157# ----------------------------------------------------------------------- 158# Step 3: Make global changes to the data frame 159# 160# A data frame has attributes that are "external" to 161# its variables. There are the vector of its 162# variable names ("names" attribute), the 163# observation identifiers ("row.names"), and the 164# "class" (whose value is "data.frame"). The 165# "names" attribute is the one most commonly in need 166# of modification. If we had wanted to change all 167# the variable names to lower case, we could have 168# specified lowernames=TRUE to the cleanup.import 169# invocation above, or type 170 171 172names(FEV) <- casefold(names(FEV)) 173 174 175# The upData function can also be used to change 176# variable names in two ways (see below). 177# To change names in a non-systematic way we use 178# other options. Under Windows/NT the most 179# straigtforward approach is to change the names 180# interactively. Click on the data frame in the 181# left panel of the Object Browser, then in the 182# right pane click twice (slowly) on a variable. 183# Use the left arrow and other keys to edit the 184# name. Click outside that name field to commit the 185# change. You can also rename columns while in a 186# Data Sheet. To instead use programming commands 187# to change names, use something like: 188 189 190names(FEV)[6] <- 'smoke' # assumes you know the positions! 191names(FEV)[names(FEV)=='smoking'] <- 'smoke' 192names(FEV) <- edit(names(FEV)) 193 194 195# The last example is useful if you are changing 196# many names. But none of the interactive 197# approaches such as edit() are handy if you will be 198# re-importing the dataset after it is updated in 199# its original application. This problem can be 200# addressed by saving the new names in a permanent 201# vector in .Data: 202 203 204new.names <- names(FEV) 205 206 207# Then if the data are re-imported, you can type 208 209 210names(FEV) <- new.names 211 212 213# to rename the variables. 214 215 216 217 218# ----------------------------------------------------------------------- 219# Step 4: Delete unneeded variables 220# 221# To delete some of the variables, you can 222# right-click on variable names in the Object 223# Browser's right pane, then select Delete. You can 224# also set variables to have NULL values, which 225# causes the system to delete them. We don't need 226# to delete any variables from FEV but suppose we 227# did need to delete some from mydframe. 228 229 230mydframe$x1 <- NULL 231mydframe$x2 <- NULL 232mydframe[c('age','sex')] <- NULL # delete 2 variables 233mydframe[Cs(age,sex)] <- NULL # same thing 234 235 236# The last example uses the Hmisc short-cut quoting 237# function Cs. See also the drop parameter to upData. 238 239 240 241 242# ----------------------------------------------------------------------- 243# Step 5: Make changes to individual variables 244# within the data frame 245# 246# After importing data, the resulting variables are 247# seldom self - documenting, so we commonly need to 248# change or enhance attributes of individual 249# variables within the data frame. 250# 251# If you are only changing a few variables, it is 252# efficient to change them directly without 253# attaching the entire data frame. 254 255 256FEV$sex <- factor(FEV$sex, 0:1, c('female','male')) 257FEV$smoke <- factor(FEV$smoke, 0:1, 258 c('non-current smoker','current smoker')) 259units(FEV$age) <- 'years' 260units(FEV$fev) <- 'L' 261label(FEV$fev) <- 'Forced Expiratory Volume' 262units(FEV$height) <- 'inches' 263 264 265# When changing more than one or two variables it is 266# more convenient change the data frame using the 267# Hmisc upData function. 268 269 270FEV2 <- upData(FEV, 271 rename=c(smoking='smoke'), 272 # omit if renamed above 273 drop=c('var1','var2'), 274 levels=list(sex =list(female=0,male=1), 275 smoke=list('non-current smoker'=0, 276 'current smoker'=1)), 277 units=list(age='years', fev='L', height='inches'), 278 labels=list(fev='Forced Expiratory Volume')) 279 280 281# An alternative to levels=list(\dots) is for example 282# upData(FEV, sex=factor(sex,0:1,c('female','male'))). 283# 284# Note that we saved the changed data frame into a 285# new data frame FEV2. If we were confident of the 286# correctness of our changes we could have stored 287# the new data frame on top of the old one, under 288# the original name FEV. 289 290 291# ----------------------------------------------------------------------- 292# Step 6: Check the data frame 293# 294# The Hmisc describe function is perhaps the first 295# function that should be used on the new data 296# frame. It provides documentation of all the 297# variables and the frequency tabulation, counts of 298# NAs, and 5 largest and smallest values are 299# helpful in detecting data errors. Typing 300# describe(FEV) will write the results to the 301# current output window. To put the results in a 302# new window that can persist, even upon exiting 303# S, we use the page function. The describe 304# output can be minimized to an icon but kept ready 305# for guiding later steps of the analysis. 306 307 308page(describe(FEV2), multi=TRUE) 309# multi=TRUE allows that window to persist while 310# control is returned to other windows 311 312 313# The new data frame is OK. Store it on top of the 314# old FEV and then use the graphical user interface 315# to delete FEV2 (click on it and hit the Delete 316# key) or type rm(FEV2) after the next statement. 317 318 319FEV <- FEV2 320 321 322# Next, we can use a variety of other functions to 323# check and describe all of the variables. As we 324# are analyzing all or almost all of the variables, 325# this is best done without attaching the data 326# frame. Note that plot.data.frame plots inverted 327# CDFs for continuous variables and dot plots 328# showing frequency distributions of categorical 329# ones. 330 331 332summary(FEV) 333# basic summary function (summary.data.frame) 334 335 336plot(FEV) # plot.data.frame 337datadensity(FEV) 338# rug plots and freq. bar charts for all var. 339 340 341hist.data.frame(FEV) 342# for variables having > 2 values 343 344 345by(FEV, FEV$smoke, summary) 346# use basic summary function with stratification 347 348 349 350 351# ----------------------------------------------------------------------- 352# Step 7: Do detailed analyses involving individual 353# variables 354# 355# Analyses based on the formula language can use 356# data= so attaching the data frame may not be 357# required. This saves memory. Here we use the 358# Hmisc summary.formula function to compute 5 359# statistics on height, stratified separately by age 360# quartile and by sex. 361 362 363options(width=80) 364summary(height ~ age + sex, data=FEV, 365 fun=function(y)c(smean.sd(y), 366 smedian.hilow(y,conf.int=.5))) 367# This computes mean height, S.D., median, outer quartiles 368 369 370fit <- lm(height ~ age*sex, data=FEV) 371summary(fit) 372 373 374# For this analysis we could also have attached the 375# data frame in search position 2. For other 376# analyses, it is mandatory to attach the data frame 377# unless FEV$ prefixes each variable name. 378# Important: DO NOT USE attach(FEV, 1) or 379# attach(FEV, pos=1, \dots) if you are only analyzing 380# and not changing the variables, unless you really 381# need to avoid conflicts with variables in search 382# position 1 that have the same names as the 383# variables in FEV. Attaching into search position 384# 1 will cause S-Plus to be more of a memory hog. 385 386 387attach(FEV) 388# Use e.g. attach(FEV[,Cs(age,sex)]) if you only 389# want to analyze a small subset of the variables 390# Use e.g. attach(FEV[FEV$sex=='male',]) to 391# analyze a subset of the observations 392 393 394summary(height ~ age + sex, 395 fun=function(y)c(smean.sd(y), 396 smedian.hilow(y,conf.int=.5))) 397fit <- lm(height ~ age*sex) 398 399 400# Run generic summary function on height and fev, 401# stratified by sex 402by(data.frame(height,fev), sex, summary) 403 404 405# Cross-classify into 4 sex x smoke groups 406by(FEV, list(sex,smoke), summary) 407 408 409# Plot 5 quantiles 410s <- summary(fev ~ age + sex + height, 411 fun=function(y)quantile(y,c(.1,.25,.5,.75,.9))) 412 413 414plot(s, which=1:5, pch=c(1,2,15,2,1), #pch=c('=','[','o',']','='), 415 main='A Discovery', xlab='FEV') 416 417 418# Use the nonparametric bootstrap to compute a 419# 0.95 confidence interval for the population mean fev 420smean.cl.boot(fev) # in Hmisc 421 422 423# Use the Statistics \dots Compare Samples \dots One Sample 424# keys to get a normal-theory-based C.I. Then do it 425# more manually. The following method assumes that 426# there are no NAs in fev 427 428 429sd <- sqrt(var(fev)) 430xbar <- mean(fev) 431xbar 432sd 433n <- length(fev) 434qt(.975,n-1) 435# prints 0.975 critical value of t dist. with n-1 d.f. 436 437 438xbar + c(-1,1)*sd/sqrt(n)*qt(.975,n-1) 439# prints confidence limits 440 441 442# Fit a linear model 443# fit <- lm(fev ~ other variables \dots) 444 445 446detach() 447 448 449# The last command is only needed if you want to 450# start operating on another data frame and you want 451# to get FEV out of the way. 452 453 454 455 456# ----------------------------------------------------------------------- 457# Creating data frames from scratch 458# 459# Data frames can be created from within S. To 460# create a small data frame containing ordinary 461# data, you can use something like 462 463 464dframe <- data.frame(age=c(10,20,30), 465 sex=c('male','female','male')) 466 467 468# You can also create a data frame using the Data 469# Sheet. Create an empty data frame with the 470# correct variable names and types, then edit in the 471# data. 472 473 474dd <- data.frame(age=numeric(0),sex=character(0)) 475 476 477# The sex variable will be stored as a factor, and 478# levels will be automatically added to it as you 479# define new values for sex in the Data Sheet's sex 480# column. 481# 482# When the data frame you need to create is defined 483# by systematically varying variables (e.g., all 484# possible combinations of values of each variable), 485# the expand.grid function is useful for quickly 486# creating the data. Then you can add 487# non-systematically-varying variables to the object 488# created by expand.grid, using programming 489# statements or editing the Data Sheet. This 490# process is useful for creating a data frame 491# representing all the values in a printed table. 492# In what follows we create a data frame 493# representing the combinations of values from an 8 494# x 2 x 2 x 2 (event x method x sex x what) table, 495# and add a non-systematic variable percent to the 496# data. 497 498 499jcetable <- expand.grid( 500 event=c('Wheezing at any time', 501 'Wheezing and breathless', 502 'Wheezing without a cold', 503 'Waking with tightness in the chest', 504 'Waking with shortness of breath', 505 'Waking with an attack of cough', 506 'Attack of asthma', 507 'Use of medication'), 508 method=c('Mail','Telephone'), 509 sex=c('Male','Female'), 510 what=c('Sensitivity','Specificity')) 511 512 513jcetable$percent <- 514c(756,618,706,422,356,578,289,333, 515 576,421,789,273,273,212,212,212, 516 613,763,713,403,377,541,290,226, 517 613,684,632,290,387,613,258,129, 518 656,597,438,780,732,679,938,919, 519 714,600,494,877,850,703,963,987, 520 755,420,480,794,779,647,956,941, 521 766,423,500,833,833,604,955,986) / 10 522 523 524# In jcetable, event varies most rapidly, then 525# method, then sex, and what. 526} 527} 528\keyword{data} 529\keyword{manip} 530\keyword{programming} 531\keyword{interface} 532\keyword{htest} 533\concept{overview} 534