1\documentclass{article} 2\usepackage{url} 3%\VignetteIndexEntry{Estimates in subpopulations} 4\usepackage{Sweave} 5\author{Thomas Lumley} 6\title{Estimates in subpopulations.} 7 8\begin{document} 9\maketitle 10 11Estimating a mean or total in a subpopulation (domain) from a survey, eg the 12mean blood pressure in women, is not done simply by taking the subset 13of data in that subpopulation and pretending it is a new survey. This 14approach would give correct point estimates but incorrect standard 15errors. 16 17The standard way to derive domain means is as ratio estimators. I 18think it is easier to derive them as regression coefficients. These 19derivations are not important for R users, since subset operations on 20survey design objects automatically do the necessary adjustments, but 21they may be of interest. The various ways of constructing domain mean 22estimators are useful in quality control for the survey package, and 23some of the examples here are taken from 24\texttt{survey/tests/domain.R}. 25 26 27Suppose that in the artificial \texttt{fpc} data set we want to 28estimate the mean of \texttt{x} when \texttt{x>4}. 29<<>>= 30library(survey) 31data(fpc) 32dfpc<-svydesign(id=~psuid,strat=~stratid,weight=~weight,data=fpc,nest=TRUE) 33dsub<-subset(dfpc,x>4) 34svymean(~x,design=dsub) 35@ 36 37The \texttt{subset} function constructs a survey design object with 38information about this subpopulation and \texttt{svymean} computes the 39mean. The same operation can be done for a set of subpopulations with 40\texttt{svyby}. 41<<>>= 42svyby(~x,~I(x>4),design=dfpc, svymean) 43@ 44 45In a regression model with a binary covariate $Z$ and no intercept, 46there are two coefficients that estimate the mean of the outcome 47variable in the subpopulations with $Z=0$ and $Z=1$, so we can 48construct the domain mean estimator by regression. 49<<>>= 50summary(svyglm(x~I(x>4)+0,design=dfpc)) 51@ 52 53Finally, the classical derivation of the domain mean estimator is as a 54ratio where the numerator is $X$ for observations in the domain and 0 55otherwise and the denominator is 1 for observations in the domain and 560 otherwise 57<<>>= 58svyratio(~I(x*(x>4)),~as.numeric(x>4), dfpc) 59@ 60 61The estimator is implemented by setting the sampling weight to zero 62for observations not in the domain. For most survey design objects 63this allows a reduction in memory use, since only the number of zero 64weights in each sampling unit needs to be kept. For more complicated 65survey designs, such as post-stratified designs, all the data are kept 66and there is no reduction in memory use. 67 68 69\subsection*{More complex examples} 70Verifying that \texttt{svymean} agrees with the ratio and regression 71derivations is particularly useful for more complicated designs where 72published examples are less readily available. 73 74This example shows calibration (GREG) estimators of domain means for 75the California Academic Performance Index (API). 76<<>>= 77data(api) 78dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) 79pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018) 80gclus1 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069)) 81 82svymean(~api00, subset(gclus1, comp.imp=="Yes")) 83svyratio(~I(api00*(comp.imp=="Yes")), ~as.numeric(comp.imp=="Yes"), gclus1) 84summary(svyglm(api00~comp.imp-1, gclus1)) 85@ 86 87Two-stage samples with full finite-population corrections 88<<>>= 89data(mu284) 90dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284) 91 92svymean(~y1, subset(dmu284,y1>40)) 93svyratio(~I(y1*(y1>40)),~as.numeric(y1>40),dmu284) 94summary(svyglm(y1~I(y1>40)+0,dmu284)) 95@ 96 97Stratified two-phase sampling of children with Wilm's Tumor, 98estimating relapse probability for those older than 3 years (36 99months) at diagnosis 100<<>>= 101library("survival") 102data(nwtco) 103nwtco$incc2<-as.logical(with(nwtco, ifelse(rel | instit==2,1,rbinom(nrow(nwtco),1,.1)))) 104dccs8<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~interaction(rel,stage,instit)), 105 data=nwtco, subset=~incc2) 106svymean(~rel, subset(dccs8,age>36)) 107svyratio(~I(rel*as.numeric(age>36)), ~as.numeric(age>36), dccs8) 108summary(svyglm(rel~I(age>36)+0, dccs8)) 109@ 110 111\end{document} 112