Analyze CHIS Data - Weighting and Variance Estimation

print share

Weighting and Variance Estimation​

The complex sample design of California Health Interview Survey (CHIS) requires proper weighting and variance calculation using specialized code when analyzing data. CHIS employs a two‐stage geographically stratified random‐digit‐dial (RDD) sample design. In the first stage, telephone numbers are randomly sampled within counties and in the second stage, one adult is selected from all adult members of a sampled household; if eligible, teens and children (by adult proxy) may also be selected for interview. This page describes how to appropriately account for this complex sample design when generating weighted estimates and calculating variance. All of the estimates provided by AskCHIS are generated using these methods.

How is analyzing CHIS data different from analyzing other data?

Most statistical software packages calculate variance with the assumption that the data were produced from a simple random sample (SRS). Since CHIS does not use a SRS, this approach would underestimate the variance of estimates produced from CHIS data. In order to accurately estimate variance in analyses of CHIS data, either replicate weights or the Taylor series linearization method should be used. For an overview of the CHIS sample design and important analysis considerations, view this CHIS webinar. For detailed descriptions of CHIS sampling and weighting methodology, please refer to CHIS Methodology Reports.

Which CHIS variables do I use to account for the complex sample design?

In order to obtain correct point estimates, the final weight variable must be used along with strata and cluster indicators or replicate weights to obtain correct variance estimates. The final weight (rakedw0) accounts for the sample selection probabilities and adjusts for other known potential sources of bias. When the weight variable is applied, it ensures that point estimates from the CHIS sample represent the California population. However, using only rakedw0 may produce incorrect variance estimates despite providing unbiased point estimates, because it does not account for the complex sample design employed in CHIS. Incorrect variance calculation may lead to errors in confidence intervals and hypothesis testing.

The variable tsvarstr accounts for sample stratification, and the variable tsvrunit describes the clustering in sample design. Both variables are required when estimating variances with the Taylor series linearization method (sample code). These two variables are only available in confidential data that are accessible via our Data Access Center (DAC).

The variables rakedw1-rakedw80 are replicate weights designed for valid variance estimation in the absence of the sample design variables. There are 80 replicate weights for each CHIS cycle, and all 80 should be used simultaneously (sample code). The replicate weights are available in the Public Use Files (PUFs) as well as in the confidential data.

Special notes on using the Taylor series linearization method

The Taylor series linearization method requires specifying clustering in the sample design and special handling of subpopulation analyses.

Analyzing Adult, Child, and Teen data. When the adult, teen, or child data are analyzed separately, cases can be considered to be independent because we only sample one adult, teen, and child per household. However, when the adult, teen, or child data files are concatenated, each household becomes a cluster. Therefore, the clustering unit variable tsvrunit should be included when analyzing more than one age group together in one data file. The cluster indicator tsvrunit is not needed if the data files (adult, teen and child) are analyzed separately.

Subpopulation analysis. Data should not be subset when using the Taylor series linearization method. This means when you would like to analyze a specific subpopulation (for example, population in a particular region, a certain ethnic group), the analysis should not be performed on a subset of the data including only the target subpopulation. Usually, this implicates that you should not simply use “where” or “by” statements in your data preparation or analysis. The subpopulation analysis should be performed on the entire CHIS data using the appropriate command (“subpopulation” or “domain” statement) or by setting the excluded population to missing (sample code).

What statistical software can I use to analyze CHIS data?

CHIS data can be analyzed appropriately using complex survey procedures in most major statistical software packages (follow the links to complex sample survey analysis documentation for each package), including SAS/STAT V.9.2 and higher, SUDAAN, Stata V.9 and higher, SPSS V.13 and higher, WesVar, and the “survey” R package. From a user’s perspective, the main difference between these software packages is how the sample design variables are specified. For example, in SAS/STAT and SUDAAN, design variables are specified within each procedure, whereas design variables are specified during a single step preceding the analyses in Stata and SPSS. In addition, Mplus also handles survey data appropriately for latent variable models.

How do I write programs to obtain correct point estimates and variances?

We have provided basic sample codes for analyzing CHIS data. We also encourage you to explore examples on UCLA’s Statistical Consulting website.

Which weights do I use in the continuous CHIS cycle?

Starting in 2011-2012, CHIS began continuous data collection. The weights provided in the CHIS 2011-2012 two-year dataset remain consistent with the weights provided in previous CHIS cycles (Methodology Reports). One-year datasets (2011 or 2012) are only available through the Data Access Center and weights are produced to represent California’s population in that specific year. (Learn more about what’s new in 2011-2012 CHIS data.)

 ​​