Weighting and Variance Estimation
The complex
sample design of California Health Interview Survey (CHIS) requires proper
weighting and variance calculation using specialized code when analyzing
data. CHIS employs a two‐stage geographically stratified random‐digit‐dial
(RDD) sample design. In the first stage, telephone numbers are randomly sampled within
counties and in the second stage, one adult is selected from all adult members
of a sampled household; if eligible, teens and children (by adult proxy) may
also be selected for interview. This page describes how to appropriately account
for this complex sample design when generating weighted estimates and calculating
variance. All of the estimates provided by AskCHIS are generated using these methods.
How is analyzing CHIS data different from
analyzing other data?
Most
statistical software packages calculate variance with the assumption that the
data were produced from a simple random sample (SRS). Since CHIS does not use a
SRS, this approach would underestimate the variance of estimates produced from
CHIS data. In order to accurately
estimate variance in analyses of CHIS data, either replicate weights or the Taylor
series linearization method should be used. For
an overview of the CHIS sample design and important analysis considerations,
view this CHIS webinar. For
detailed descriptions of CHIS sampling and weighting methodology, please refer
to CHIS Methodology
Reports.
Which CHIS variables do I use to account for the
complex sample design?
In order to obtain
correct point estimates, the final weight variable must be used along with
strata and cluster indicators or replicate weights to obtain correct variance
estimates. The final
weight (rakedw0) accounts for the
sample selection probabilities and adjusts for other known potential sources of
bias. When the weight variable is applied, it ensures that point estimates from
the CHIS sample represent the California population. However, using only rakedw0 may produce incorrect
variance estimates despite providing unbiased point estimates, because it does
not account for the complex sample design employed in CHIS. Incorrect variance
calculation may lead to errors in confidence intervals and hypothesis testing.
The variable tsvarstr accounts
for sample stratification, and the variable tsvrunit describes the clustering in sample design. Both variables are
required when estimating variances with the Taylor series linearization method
(sample code). These two variables are only available in confidential data that
are accessible via our Data Access Center (DAC).
The variables rakedw1-rakedw80 are replicate
weights designed for valid variance estimation in the absence of the sample
design variables. There are 80 replicate weights for each CHIS cycle, and all 80
should be used simultaneously (sample code). The replicate weights are available in the Public Use Files (PUFs) as well as in the confidential data.
Special notes on using the Taylor series linearization
method
The Taylor
series linearization method requires specifying clustering in the sample design
and special handling of subpopulation analyses.
Analyzing Adult, Child, and Teen data. When the adult, teen, or child data are analyzed
separately, cases can be considered to be independent because we only sample
one adult, teen, and child per household. However, when the adult, teen, or
child data files are concatenated, each household becomes a cluster. Therefore,
the clustering unit variable tsvrunit should be included when analyzing more than one
age group together in one data file. The cluster indicator tsvrunit is not needed if
the data files (adult, teen and child) are analyzed separately.
Subpopulation analysis. Data should not be
subset when using the Taylor series linearization method. This means when you
would like to analyze a specific subpopulation (for example, population in a
particular region, a certain ethnic group), the analysis should not be
performed on a subset of the data including only the target subpopulation. Usually,
this implicates that you should not simply use “where” or “by” statements in
your data preparation or analysis. The subpopulation analysis should be performed
on the entire CHIS data using the appropriate command (“subpopulation” or “domain”
statement) or by setting the excluded population to missing (sample code).
What statistical software can I use to analyze
CHIS data?
CHIS data can be analyzed
appropriately using complex survey procedures in most major statistical
software packages (follow the links to complex sample survey analysis documentation for
each package), including SAS/STAT V.9.2 and higher, SUDAAN, Stata V.9 and higher, SPSS V.13 and higher, WesVar, and the “survey” R package. From a user’s
perspective, the main difference between these software packages is how the sample
design variables are specified. For example, in SAS/STAT and SUDAAN, design
variables are specified within each procedure, whereas design variables are
specified during a single step preceding the analyses in Stata and SPSS. In
addition, Mplus also handles
survey data appropriately for latent variable models.
How do I write programs to obtain correct point
estimates and variances?
We have provided basic sample codes for
analyzing CHIS data. We also encourage
you to explore examples on UCLA’s Statistical Consulting website.
Which weights do I use in the continuous CHIS cycle?
Starting in 2011-2012, CHIS began continuous
data collection. The weights provided in the CHIS 2011-2012 two-year dataset
remain consistent with the weights provided in previous CHIS cycles (Methodology
Reports). One-year
datasets (2011 or 2012) are only available through the Data Access Center and
weights are produced to represent California’s population in that specific
year. (Learn more about what’s new in 2011-2012 CHIS data.)