1\documentclass{article}
2\usepackage{url}
3%\VignetteIndexEntry{Estimates in subpopulations}
4\usepackage{Sweave}
5\author{Thomas Lumley}
6\title{Estimates in subpopulations.}
7
8\begin{document}
9\maketitle
10
11Estimating a mean or total in a subpopulation (domain) from a survey, eg the
12mean blood pressure in women, is not done simply by taking the subset
13of data in that subpopulation and pretending it is a new survey.  This
14approach would give correct point estimates but incorrect standard
15errors.
16
17The standard way to derive domain means is as ratio estimators. I
18think it is easier to derive them as regression coefficients.  These
19derivations are not important for R users, since subset operations on
20survey design objects automatically do the necessary adjustments, but
21they may be of interest.  The various ways of constructing domain mean
22estimators are useful in quality control for the survey package, and
23some of the examples here are taken from
24\texttt{survey/tests/domain.R}.
25
26
27Suppose that in the artificial \texttt{fpc} data set we want to
28estimate the mean of \texttt{x} when \texttt{x>4}.
29<<>>=
30library(survey)
31data(fpc)
32dfpc<-svydesign(id=~psuid,strat=~stratid,weight=~weight,data=fpc,nest=TRUE)
33dsub<-subset(dfpc,x>4)
34svymean(~x,design=dsub)
35@
36
37The \texttt{subset} function constructs a survey design object with
38information about this subpopulation and \texttt{svymean} computes the
39mean. The same operation can be done for a set of subpopulations with
40\texttt{svyby}.
41<<>>=
42svyby(~x,~I(x>4),design=dfpc, svymean)
43@
44
45In a regression model with a binary covariate $Z$ and no intercept,
46there are two coefficients that estimate the mean of the outcome
47variable in the subpopulations with $Z=0$ and $Z=1$, so we can
48construct the domain mean estimator by regression.
49<<>>=
50summary(svyglm(x~I(x>4)+0,design=dfpc))
51@
52
53Finally, the classical derivation of the domain mean estimator is as a
54ratio where the numerator is $X$ for observations in the domain and 0
55otherwise and the denominator is 1 for observations in the domain and
560 otherwise
57<<>>=
58svyratio(~I(x*(x>4)),~as.numeric(x>4), dfpc)
59@
60
61The estimator is implemented by setting the sampling weight to zero
62for observations not in the domain.  For most survey design objects
63this allows a reduction in memory use, since only the number of zero
64weights in each sampling unit needs to be kept. For more complicated
65survey designs, such as post-stratified designs, all the data are kept
66and there is no reduction in memory use.
67
68
69\subsection*{More complex examples}
70Verifying that \texttt{svymean} agrees with the ratio and regression
71derivations is particularly useful for more complicated designs where
72published examples are less readily available.
73
74This example shows calibration (GREG) estimators of domain means for
75the California Academic Performance Index (API).
76<<>>=
77data(api)
78dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
79pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018)
80gclus1 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069))
81
82svymean(~api00, subset(gclus1, comp.imp=="Yes"))
83svyratio(~I(api00*(comp.imp=="Yes")), ~as.numeric(comp.imp=="Yes"), gclus1)
84summary(svyglm(api00~comp.imp-1, gclus1))
85@
86
87Two-stage samples with full finite-population corrections
88<<>>=
89data(mu284)
90dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284)
91
92svymean(~y1, subset(dmu284,y1>40))
93svyratio(~I(y1*(y1>40)),~as.numeric(y1>40),dmu284)
94summary(svyglm(y1~I(y1>40)+0,dmu284))
95@
96
97Stratified two-phase sampling of children with Wilm's Tumor,
98estimating relapse probability for those older than 3 years (36
99months) at diagnosis
100<<>>=
101library("survival")
102data(nwtco)
103nwtco$incc2<-as.logical(with(nwtco, ifelse(rel | instit==2,1,rbinom(nrow(nwtco),1,.1))))
104dccs8<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~interaction(rel,stage,instit)),
105                data=nwtco, subset=~incc2)
106svymean(~rel, subset(dccs8,age>36))
107svyratio(~I(rel*as.numeric(age>36)), ~as.numeric(age>36), dccs8)
108summary(svyglm(rel~I(age>36)+0, dccs8))
109@
110
111\end{document}
112