Resources

Analyze CHIS Data

Weighting and Variance Estimation

How is analyzing CHIS data different from analyzing other data?

Many statistical software packages typically assume that data originate from a simple random sample (SRS). However, the California Health Interview Survey (CHIS) utilizes a geographically stratified ABS (address-based sampling) sample design with multi-mode data collection through the web and by phone. To accurately handle the complex sample design inherent in CHIS, it is essential to apply appropriate weighting and variance estimation techniques when conducting data analysis. This page provides guidance on how to effectively account for the complex sample design when generating weighted estimates and calculating variance. For detailed descriptions of CHIS sampling and weighting methodology, please refer to CHIS Methodology Reports.

Replicate Weights versus Taylor Series

In order to accurately estimate variance in analyses of CHIS data, either the replicate weight method or the Taylor series linearization method should be used.

Both the replicate weight method and the Taylor series linearization method are commonly employed in survey data analysis to estimate variance and standard errors. In the case of CHIS data, the use of the replicate weight method is particularly beneficial due to the complex sampling design employed. This is because the replicate weights are constructed to account for intricacies in the sampling design, which are then reflected in the calculation of the variability of the estimator due to said sampling design.

While the Center for Health Policy Research recommends the use of the replicate weight method, we understand CHIS data users may still want to use the Taylor series method in their analyses. The Taylor series strata and cluster indicators are only available in the confidential data that are accessible via our Data Access Center (DAC). In most cases, the difference in the variance estimate from both methods is negligible, and a performance boost may be noticed due to the increased computational resources used for the replicate weight method.

Which CHIS variables do I use to account for the complex sample design?

The base weight variable is used to obtain correct point estimates and must be used along with the 1) replicate weights or 2) strata and cluster indicators to obtain correct variance estimates.

Base Weights:
The base weight (rakedw0) accounts for the sample selection probabilities and statistical adjustments for potential under-coverage and nonresponse biases. In addition, when this weight is applied, it ensures that estimates from the CHIS sample are an unbiased representation of the California population. Using only rakedw0 without replicate weights rakedw1-rakedw80 or the sample design information will produce unbiased point estimates, but would underestimate the variability. This is due to the incorrect assumption that the sample was drawn with simple random sampling, as well as not accounting for the complex sample design employed in CHIS. Incorrect variance calculation may lead to errors in confidence intervals and hypothesis testing.

Replicate Weights:
The replicate weight variables (rakedw1-rakedw80) are designed for valid variance estimation in the absence of the sample design variables. There are 80 replicate weights for each CHIS data set, and all 80 should be used simultaneously (sample code). The replicate weights are available in the Public Use Files (PUFs) as well as in the confidential data.

Sample Design Indicators:
The variable tsvarstr accounts for sample stratification, and the variable tsvrunit describes the clustering in sample design. Both variables are required when estimating variances with the Taylor series linearization method (sample code). These two variables are only available in the confidential data that are accessible via our Data Access Center (DAC).

What statistical software can I use to analyze CHIS data?

CHIS data can be analyzed appropriately using complex survey procedures in most major statistical software packages, including SAS/STAT V.9.2 and higher, SUDAAN, Stata V.9 and higher, SPSS V.13 and higher, WesVar, and the “survey” R package. From a user’s perspective, the main difference between these software packages is how the sample design variables are specified. For example, in SAS/STAT, design variables are specified within each procedure, whereas design variables are specified during a single step preceding the analyses in Stata and SPSS.

How do I write programs to obtain correct point estimates and variances?

We have provided basic sample codes for analyzing CHIS data. We also encourage you to explore the examples provided by UCLA’S Office of Advanced Research Computing.

Special notes on using the Taylor series linearization method

In comparison with replicate weights, the Taylor series linearization method requires much less computing time, but requires more attention on specifying the sampling design and on appropriate handling of subpopulation analyses.

Analyzing Adult, Child, and Teen data. When the adult, teen, or child data are analyzed separately, cases can be considered to be independent because we only sample one adult, teen, and child per household. However, when the adult, teen, or child data files are concatenated, each household becomes a cluster. Therefore, the clustering unit variable tsvrunit should be included when analyzing more than one age group together in one data file. The cluster indicator tsvrunit is not needed if the data files (adult, teen and child) are analyzed separately.

Subpopulation analysis. When analyzing a specific subpopulation (e.g., a certain ethnic group or the population in a particular region), the analysis should not be performed on a subset of the data that includes only the target subpopulation. The subpopulation analysis should be performed on the entire CHIS data using the appropriate command (“subpopulation” or “domain” statement) or by setting the excluded population to missing (sample code ). It is usually not appropriate to use “where” or “by” statements in your data preparation or analysis when working with subpopulations.

Pooling CHIS Data

The California Health Interview Survey (CHIS) was conducted as a biennial survey from 2001 through 2009. Beginning in 2011, CHIS data have been collected continuously across a two-year data collection cycle. Continuous data collection allows for the release of one-year data files and estimates for each calendar year. The following sections provide general guidelines for producing estimates and testing hypotheses with pooled/combined data.

Furthermore, users who wish to analyze multi-year CHIS data and create adjusted replicate weight variables can access a SAS macro by downloading the macro.

General Ideas of Weights Adjustment for Pooled Multiple CHIS Data Files

This section covers how to develop a combined data file when pooling multiple CHIS data files using replicate weights method.

Within each one-year data set, the base weight, rakedw0, reflects the number of Californians each respondent represents in the data – for example, a case with a weight of 2355 means that the respondent (and his/her answers) represents 2355 Californians. Thus, the sum of base weight across all age groups is an estimate of the total California population based on the control totals used for this survey. You can check this number against California Department of Finance or Census Bureau estimates for the same time period, but you should not expect it to match exactly.

The next section provides detailed example on how to combine either a CHIS one-year data set with another one-year data set or a one-year data set with a two-year data set. CHIS does not recommend pooling continuous data (CHIS 2011 and beyond) with CHIS data collected prior to 2010 due to methodological changes that affect the comparability of data collected before and after the 2010 U.S. Census.

We will like to note that variables that will be used in the pooled-year analyses should have the same name and categories in all pooled CHIS year data files. For example, make sure that education in CHIS 2018 and CHIS 2019 files has four categories that mean the same thing while pooled CHIS 2018 and 2019 data. This is something a data user will need to confirm independently.

A general rule of thumb is:

The final number of replicate weights created after pooling equals the number of data files used (regardless of whether it is a one-year data file or two-year file) times 80.
- Pooling CHIS 2018 and CHIS 2019 will result in 2 × 80 = 160 replicate weights.
- Pooling CHIS 2017-2018 two-year data with CHIS 2019 one-year will also create 2 × 80 = 160 replicate weights.
The proportion each data file will take in final base and replicate weights depends on the number of year(s) it represents, i.e., a one-year data file takes one portion, and a two-year data file take two portions.
- Pooling CHIS 2018 and CHIS 2019 (both one-year files) will create a data set representing two years of data. For each year, the final base and replicate weights will be 1/2 of the original base and replicate weight.
- Pooling CHIS 2017-2018 and CHIS 2019: the final data should represent three years of data, and CHIS 2017-2018 should take two portions (2/3) and CHIS 2019 takes one portion (1/3).

Weights Adjustment When Pooling Two One-Year Data Files

As example, we concatenated the CHIS 2018 and 2019 Public Use Files (i.e., append the 2019 file to the 2018 file to create a single data file). The number of respondents in the combined data file is the sum of the respondents in the two individual data files, refer to Table 1. Using the rule to combined three one-year PUF file, refer to Table 2.

To combine the data from CHIS 2018 and CHIS 2019, follow these steps:

Start with the Base Weight: For both years, the new base weight will be half of the original base weight (i.e., 1/2 * RAKEDW0).
Create Replicate Weights: generate 160 replicate weights based on the original replicate weights.
For CHIS 2018:
- The first 80 replicate weights will be set to half of the corresponding original replicate weights (e.g., RAKEDW1 becomes half of RAKEDW1).
- The remaining 80 replicate weights will be set to half of the original base weight (i.e., 1/2 * RAKEDW0).
For CHIS 2019:
- The first 80 replicate weights will be set to half of the original base weight (i.e., 1/2 * RAKEDW0).
- The second set of 80 replicate weights will be set to half of the corresponding original replicate weights (e.g., NEW REPLICATE WEIGHT 81 will be half of RAKEDW1).

Table 1. Weights Adjustment for the Combined Data File for CHIS 2018 and 2019

Year	Final Weight	Replicate Weight 1-80	Replicate Weight 81-160
CHIS 2018	RAKEDW0 x (1/2)	RAKEDW1x (1/2) , … , RAKEDW80 x (1/2)	RAKEDW0 x (1/2)
CHIS 2019	RAKEDW0 x (1/2)	RAKEDW0 x (1/2)	RAKEDW1 x (1/2) , … , RAKEDW80 x (1/2)

Table 2. (3-Year) Weights Adjustment for the Combined Data File for CHIS 2018 and 2019 and 2020

Year	Final Weight	Replicate Weight 1-80	Replicate Weight 81-160	Replicate Weight 161-240
CHIS 2018	RAKEDW0 x (1/3)	RAKEDW1 x (1/3) , … , RAKEDW80 x (1/3)	RAKEDW0 x (1/3)	RAKEDW0 x (1/3)
CHIS 2019	RAKEDW0 x (1/3)	RAKEDW0 x (1/3)	RAKEDW1 x (1/3) , … , RAKEDW80 x (1/3)	RAKEDW0 x (1/3)
CHIS 2020	RAKEDW0 x (1/3)	RAKEDW0 x (1/3)	RAKEDW0 x (1/3)	RAKEDW1 x (1/3) , … , RAKEDW80 x (1/3)

Weight Adjustment When Pooling One-Year Data with Two-Year CHIS Data

When pooling the a one-year data with a two-year data set, the final weight must be adjusted to account for the fact that the population estimates for the two-year data are weighted to reflect the total California population for the two-year period. Dividing the final weight by two in this instance would result in giving equal importance to the one-year data and the two-year data while the two-year data should be twice as important as the one-year data file.

To combine the data from CHIS 2017-2018 and CHIS 2019, follow these steps:

Start with the Base Weight:
- For CHIS 2017-2018, the new base weight = RAKEDW0 x 2/3
- For CHIS 2019, the new base weight = RAKEDW0 x 1/3
Creating Replicate Weights: We'll generate 160 replicate weights based on the original replicate weights.
- For CHIS 2017-2018:
  - The first 80 replicate weights will be set to 2/3 of the corresponding original replicate weights (2/3 x RAKEDW1, …., 2/3 x RAKEDW80)
  - The remaining 80 replicate weights will be set to half of the original base weight (i.e., 2/3 x RAKEDW0).
- For CHIS 2019:
  - The first 80 replicate weights will be set to one third of the original base weight (i.e., 1/3 x RAKEDW0).
  - The second set of 80 replicate weights will be set to one third of the corresponding base weight. (1/3 x RAKEDW1, …., 1/3 RAKEDW80)

Need to analyze multi-year CHIS data or create adjusted replicate weight variables?

Download SAS Macro

Sample Code

This page provides sample code for appropriately analyzing California Health Interview Survey (CHIS) data in several statistical software packages using the replicate weights method or the Taylor series linearization method. For a summary of how weighting and variance estimation work in analyzing CHIS Data, refer to Weighting and Variance Estimation.

In order to accurately estimate variance in analyses of CHIS data, a complex sample, either replicate weights or the Taylor series linearization method may be used.

Below are sample codes, additionally a PDF version of the sample code is available for download:

Replicate Weighting examples are available for:
- Stata, R, SAS, SUDAAN.
Taylor Series examples are available for:
- Stata, R, SAS, SUDAAN, SPSS.
Pooling examples are available for:
- Stata, R, SAS.

Replicate Weights

Examples here illustrate how CHIS PUFs can be analyzed with replicate weights to produce valid variance estimates using statistical software packages. For continuous variables, sample calculations of means and linear regression analysis are presented; for categorical variables, sample calculations of frequencies and logistic regression analysis are presented. Estimates and standard errors are identical across the software packages examined, but confidence intervals may differ because of different default methods of computation (for SAS, the default is Wald confidence intervals; for SUDAAN and Stata, it is logit confidence intervals).

Stata

Mean Calculation
In the following sample code, the distribution of BMI (bmi) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

*Sample design specification step* 
use "DATASET LOCATION” 
svyset [pw=rakedw0], jkrw(rakedw1-rakedw80, multiplier(1)) vce(jack) mse 
  
*Analysis* 
svy: mean bmi, over(racehpr2) 
svy: mean bmi, over(srsex racehpr2)

In Stata, the sample design specification step should be included before conducting any analysis.

Frequency Calculation
In the following sample code, the percentage of people who currently have asthma (astcur) is examined by race (racehpr2) and by race and sex (racehpr2*srsex). Weighted counts are also given.

*Sample design specification step* 
use "DATASET LOCATION” 
svyset [pw=rakedw0], jkrw(rakedw1-rakedw80, multiplier(1)) vce(jack) mse 
 
*Analysis* 
svy: tabulate astcur racehpr2, col se ci  
svy, subpop (if srsex==1): tab astcur racehpr2, col se ci 
svy, subpop (if srsex==2): tab astcur racehpr2, col se ci 
svy: tabulate astcur racehpr2, count format(%11.0fc)

In Stata, the sample design specification step should be included before conducting any analysis.

Stata V.10 and higher cannot accommodate 3 or more variables in the tab command.

Linear Regression
In the following sample code, Body Mass Index (bmi) is examined in relation to race (racehpr2), sex (srsex), and age (srage) while controlling for each other. Note that racehpr2 and srsex are categorical variables; White (racehpr2=6) and Male (srsex=1) are used as their reference categories.

*Sample design specification step* 
use "DATASET LOCATION” 
svyset [pw=rakedw0], jkrw(rakedw1-rakedw80, multiplier(1)) vce(jack) mse 
 
*Analysis*  
recode racehpr2 (6=1) (1=2) (2=3) (3=4) (4=5) (5=6) (7=7), gen(race) 
  
xi: svy: regress bmi i.srsex i.race srage

In Stata, the sample design specification step should be included before conducting any analysis.

Recoding is done in order to choose “White” (racehpr2=6) as the reference group.

Logistic Regression
In the following sample code, current asthma status (astcur) is examined, controlling for race (racehpr2), sex (srsex), and age (srage). SUDAAN and Stata require the dependent variables to be coded as 0 and 1 for logistic regression, so a new dependent variable ast is created and assigned 1 where astcur=1 (“Current asthma”) and 0 where astcur=2 (“No current asthma”). The category “No current asthma” is used as the reference in the analysis.

*Sample design specification step* 
use "DATASET LOCATION” 
svyset [pw=rakedw0], jkrw(rakedw1-rakedw80, multiplier(1)) vce(jack) mse 
 
*Analysis*  
recode astcur (2=0) (1=1) (-9=.) (-1=.), gen (ast) 
  
xi: svy: logit ast srage i.race i.srsex 

xi: svy: logistic ast srage i.race i.srsex

In Stata, the sample design specification step should be included before conducting any analysis.

logit

produces parameter estimates.

logistic

produces odds ratios. Stata automatically chooses the lowest value of the categorical variable as the reference group for the independent and dependent variables.

R

Mean Calculation
In the following sample code, the distribution of BMI (bmi) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

# Sample design specification step

# Developed with 'R version 4.2.3' and 'survey version 4.1-1'
library(survey)

chis_design <- svrepdesign(data = data , weights = ~ rakedw0 , 
                           repweights = "rakedw[1-9]" , 
                           type = "other" , scale = 1 ,
                           rscales = 1 , mse = TRUE)
# Analysis

chis_design |> 
  svyby( formula = ~ bmi , by = ~ racehpr2,
         FUN=svymean, vartype=c(“se”,”ci”), deff = TRUE)

chis_design |> 
  svyby( formula = ~ bmi , by = ~ racehpr2+srsex,
         FUN=svymean, vartype=c(“se”,”ci”), deff = TRUE)

Frequency Calculation
In the following sample code, the percentage of people who currently have asthma (astcur) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

# Sample design specification step

# Developed with 'R version 4.2.3' and 'survey version 4.1-1'
library(survey)

chis_design <- svrepdesign(data = data , weights = ~ rakedw0 , 
                           repweights = "rakedw[1-9]" , 
                           type = "other" , scale = 1 ,
                           rscales = 1 , mse = TRUE)
# Analysis

# sub pop 1
chis_design |> subset(srsex=="1") |> 
  svytable(formula=~astcur+racehpr2) |>
  prop.table()  # proportions

chis_design |> subset(srsex=="1") |> 
  svytable(formula=~astcur+ racehpr2) |>
  # prop.table() |> 
  summary()  # totals with chi-square stats

# sub pop 2
chis_design |> subset(srsex=="2") |> 
  svytotal(x=~interaction(astcur, racehpr2))  # total as interaction

chis_design |> subset(srsex=="2") |> 
  svymean(x=~interaction(astcur, racehpr2)) |>
  confint()  # confidence intervals

Linear Regression
In the following sample code, Body Mass Index (bmi) is examined in relation to race (racehpr2), sex (srsex), and age (srage) while controlling for each other. Note that racehpr2 and srsex are categorical variables; White (racehpr2=6) and Male (srsex=1) are used as their reference categories.

# Sample design specification step

# Developed with 'R version 4.2.3' and 'survey version 4.1-1'
library(survey)

chis_design <- svrepdesign(data = data , weights = ~ rakedw0 , 
                           repweights = "rakedw[1-9]" , 
                           type = "other" , scale = 1 ,
                           rscales = 1 , mse = TRUE)
# Analysis

# use relevel(ref=,...) inside the formula to pick reference category
# can alternatively pre process data before the formula call

chis_design |> 
  svyglm(formula = bmi ~ relevel(as.factor(srsex),ref="1") + 
           relevel(as.factor(racehpr2),ref="6") + 
           srage,
         family=stats::gaussian(),
         rescale=TRUE) |> 
         summary()

Logistic Regression
In the following sample code, current asthma status (astcur) is examined, controlling for race (racehpr2), sex (srsex), and age (srage). SUDAAN and Stata require the dependent variables to be coded as 0 and 1 for logistic regression, so a new dependent variable ast is created and assigned 1 where astcur=1 (“Current asthma”) and 0 where astcur=2 (“No current asthma”). The category “No current asthma” is used as the reference in the analysis.

# Sample design specification step

# Developed with 'R version 4.2.3' and 'survey version 4.1-1'
library(survey)

chis_design <- svrepdesign(data = data , weights = ~ rakedw0 , 
                           repweights = "rakedw[1-9]" , 
                           type = "other" , scale = 1 ,
                           rscales = 1 , mse = TRUE)

# Analysis

# use stats::update() to transform ASTCUR after the survey design spec
# can alternatively pre process data before the survey design spec

# use relevel(ref=,...) inside the formula to pick reference category
# can alternatively pre process data before the formula call

chis_design |> 
  stats::update(ast = ifelse(astcur==2,0,1)) |>
  svyglm(formula = ast ~ srage + 
           relevel(as.factor(racehpr2),ref="6") + 
           relevel(as.factor(srsex),ref="1"),
         family=quasibinomial(),
         rescale=TRUE) |> 
         summary()

SAS

Mean Calculation
In the following sample code, the distribution of BMI (bmi) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

PROC SORT DATA = data; 
BY racehpr2; 
RUN; 
 
PROC SURVEYMEANS DATA = data VARMETHOD=JACKKNIFE; 
WEIGHT rakedw0; 
REPWEIGHT rakedw1-rakedw80/JKCOEFS=1; 
VAR bmi; 
BY racehpr2; RUN; 
 
PROC SORT DATA = data; 
BY racehpr2 srsex; 
RUN; 
  
PROC SURVEYMEANS DATA = data VARMETHOD=JACKKNIFE; 
WEIGHT rakedw0; 
REPWEIGHT rakedw1-rakedw80/JKCOEFS=1; 
VAR bmi; 
BY racehpr2 srsex; 
RUN;

Jackknife coefficients are necessary for accurate variance calculations, and jackknife coefficients of 1 in SAS will produce equal variance calculations as those produced in SUDAAN. However, for SAS V.9.2(TS1M0) and earlier, a value of 1 will not be accepted; as a substitute, 0.9999 can be entered. Without this specification, the default value of the jackknife coefficients will be [(# replicate weights ‐ 1)/# replicate weights]; for CHIS, this would be [(80 ‐ 1)/80] = 0.9875.

Frequency Calculation
In the following sample code, the percentage of people who currently have asthma (astcur) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

PROC SURVEYFREQ DATA = data VARMETHOD=JACKKNIFE; 
WEIGHT rakedw0; 
REPWEIGHT rakedw1-rakedw80/JKCOEFS=1; 
TABLES racehpr2*astcur/row; 
RUN; 

PROC SURVEYFREQ DATA = data VARMETHOD=JACKKNIFE; 
WEIGHT rakedw0; 
REPWEIGHT rakedw1-rakedw80/JKCOEFS=1; 
TABLES srsex*racehpr2*astcur/row; 
RUN;

One caveat in creating multiple tables in one PROC SURVEYFREQ procedure is that the procedure takes the smallest applicable sample sizes among all variables. Therefore, creating one table per one PROC SURVEYFREQ procedure is recommended:
Jackknife coefficients are necessary for accurate variance calculations, and jackknife coefficients of 1 in SAS will produce equal variance calculations as those produced in SUDAAN. However, for SAS V.9.2(TS1M0) and earlier, a value of 1 will not be accepted; as a substitute, 0.9999 can be entered. Without this specification, the default value of the jackknife coefficients will be [(# replicate weights ‐ 1)/# replicate weights]; for CHIS, this would be [(80 ‐ 1)/80] = 0.9875.

Linear Regression
In the following sample code, Body Mass Index (bmi) is examined in relation to race (racehpr2), sex (srsex), and age (srage) while controlling for each other. Note that racehpr2 and srsex are categorical variables; White (racehpr2=6) and Male (srsex=1) are used as their reference categories.

PROC SURVEYREG DATA = data VARMETHOD=JACKKNIFE; 
WEIGHT rakedw0; 
REPWEIGHT rakedw1-rakedw80/JKCOEFS=1; 
FORMAT racehpr2 racehprf. srsex srsex.; 
CLASS racehpr2 srsex; 
MODEL bmi = srsex racehpr2 srage/SOLUTION;
RUN;

Jackknife coefficients are necessary for accurate variance calculations, and jackknife coefficients of 1 in SAS will produce equal variance calculations as those produced in SUDAAN. However, for SAS V.9.2(TS1M0) and earlier, a value of 1 will not be accepted; as a substitute, 0.9999 can be entered. Without this specification, the default value of the jackknife coefficients will be [(# replicate weights ‐ 1)/# replicate weights]; for CHIS, this would be [(80 ‐ 1)/80] = 0.9875
When the values are formatted either in the data step or in the procedure, SAS automatically picks the category of the categorical variables whose label is alphabetically last as a reference group.
SOLUTION option provides the parameter estimates when using a CLASS statement.

Logistic Regression
In the following sample code, current asthma status (astcur) is examined, controlling for race (racehpr2), sex (srsex), and age (srage). SUDAAN and Stata require the dependent variables to be coded as 0 and 1 for logistic regression, so a new dependent variable ast is created and assigned 1 where astcur=1 (“Current asthma”) and 0 where astcur=2 (“No current asthma”). The category “No current asthma” is used as the reference in the analysis.

PROC SURVEYLOGISTIC DATA = data VARMETHOD=JACKKNIFE;
FORMAT astcur astcurf. racehpr2 racehprf. srsex srsex.;
WEIGHT rakedw0; 
REPWEIGHT rakedw1-rakedw80/JKCOEFS=1; 
CLASS astcur (REF=“NO CURRENT ASTHMA”) racehpr2 (REF=“WHITE”) srsex (REF=“MALE”)/PARAM=REF; 
MODEL astcur = racehpr2 srsex srage; 
RUN;

In PROC SURVEYLOGISTIC, the reference category of the independent and dependent variables may be specified in a CLASS statement. PARAM=REF is specified to ensure dummy coding of the categorical independent variables.

Jackknife coefficients are necessary for accurate variance calculations, and jackknife coefficients of 1 in SAS will produce equal variance calculations as those produced in SUDAAN. However, for SAS V.9.2 (TS1M0) and earlier, a value of 1 will not be accepted; as a substitute, 0.9999 can be entered. Without this specification, the default value of the jackknife coefficients will be [(# replicate weights ‐ 1)/# replicate weights]; for CHIS, this would be [(80 ‐ 1)/80] = 0.9875.

SUDAAN

Mean Calculation
In the following sample code, the distribution of BMI (bmi) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

PROC DESCRIPT DATA = data FILETYPE=SAS DESIGN=JACKKNIFE; 
WEIGHT rakedw0; 
JACKWGTS rakedw1-rakedw80/ADJJACK=1;
VAR bmi; 
TABLES racehpr2 racehpr2*srsex; 
SUBGROUP racehpr2 srsex; 
LEVELS 7 2; 
RUN;

Frequency Calculation
In the following sample code, the percentage of people who currently have asthma (astcur) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

PROC CROSSTAB DATA = data FILETYPE=SAS DESIGN=JACKKNIFE; 
WEIGHT rakedw0; 
JACKWGTS rakedw1-rakedw80/ADJJACK=1; 
TABLES racehpr2*astcur srsex*racehpr2*astcur; 
SUBGROUP astcur racehpr2 srsex; 
LEVELS 2 7 2; 
RUN;

Linear Regression
In the following sample code, Body Mass Index (bmi) is examined in relation to race (racehpr2), sex (srsex), and age (srage) while controlling for each other. Note that racehpr2 and srsex are categorical variables; White (racehpr2=6) and Male (srsex=1) are used as their reference categories.

PROC REGRESS DATA = data FILETYPE=SAS DESIGN=JACKKNIFE; 
WEIGHT rakedw0; 
JACKWGTS rakedw1-rakedw80/ADJJACK=1; 
SUBGROUP racehpr2 srsex; 
LEVELS 7 2; 
REFLEVEL racehpr2=6 srsex=1; 
MODEL bmi = racehpr2 srsex srage; 
RUN;

Logistic Regression
In the following sample code, current asthma status (astcur) is examined, controlling for race (racehpr2), sex (srsex), and age (srage). SUDAAN and Stata require the dependent variables to be coded as 0 and 1 for logistic regression, so a new dependent variable ast is created and assigned 1 where astcur=1 (“Current asthma”) and 0 where astcur=2 (“No current asthma”). The category “No current asthma” is used as the reference in the analysis.

DATA newdata; 
SET data; 
IF astcur=1 THEN ast=1; 
ELSE IF astcur=2 THEN ast=0; 
RUN;   

PROC RLOGIST data = newdata FILETYPE=SAS DESIGN=JACKKNIFE; 
WEIGHT rakedw0; 
JACKWGTS rakedw1-rakedw80/ADJJACK=1; 
SUBGROUP racehpr2 srsex; 
LEVELS 7 2; 
REFLEVEL racehpr2 = 6 srsex = 1; 
MODEL ast = racehpr2 srsex srage; RUN;

Taylor Series Linearization Method

Examples here illustrate how CHIS data can be analyzed with Taylor series linearization to produce valid variance estimates using statistical software packages. The required variables for the Taylor series linearization method (tsvarstr and tsvrunit) have not been included in the CHIS Public Use Files.

For continuous variables, sample calculations of means and linear regression analysis are presented; for categorical variables, sample calculations of frequencies and logistic regression analysis are presented. Estimates and standard errors are identical across the software packages examined, but confidence intervals may differ because of different default methods of computation (for SAS, the default is Wald confidence intervals; for SUDAAN and Stata, it is logit confidence intervals).

Stata

Mean Calculation
In the following sample code, the distribution of BMI (bmi) is examined by race (racehpr2) and by the interaction between race and sex (racehpr2*srsex).

*Sample design specification step*
use "DATASET LOCATION” 
svyset tsvrunit [pw=rakedw0], strata (tsvarstr)
  
*Analysis* 
svy: mean bmi, over(racehpr2) 
svy: mean bmi, over(srsex racehpr2)

In Stata, the sample design specification step should be included before conducting any analysis.
When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

Frequency Calculation
In the following sample code, the percentage of people who currently have asthma (astcur) is examined by race (racehpr2) and by race and sex (racehpr2*srsex). Weighted counts are also given.

*Sample design specification step*
use "DATASET LOCATION” 
svyset tsvrunit [pw=rakedw0], strata (tsvarstr)
 
*Analysis*
svy: tabulate astcur racehpr2, col se ci  
svy, subpop (if srsex==1): tab astcur racehpr2, col se ci 
svy, subpop (if srsex==2): tab astcur racehpr2, col se ci 
svy: tabulate astcur racehpr2, count format(%11.0fc)

In Stata, the sample design specification step should be included before conducting any analysis.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

Stata V.10 and higher cannot accommodate 3 or more variables in the tab command.

Linear Regression
In the following sample code, Body Mass Index (bmi) is examined in relation to race (racehpr2), sex (srsex), and age (srage) while controlling for each other. Note that racehpr2 and srsex are categorical variables, and White (racehpr2=6) and Male (srsex=1) are used as their reference categories.

*Sample design specification step*
use "DATASET LOCATION” 
svyset tsvrunit [pw=rakedw0], strata (tsvarstr)
 
*Analysis*   
recode racehpr2 (6=1) (1=2) (2=3) (3=4) (4=5) (5=6) (7=7), gen(race)
 
xi: svy: regress bmi i.srsex i.race srage

In Stata, the sample design specification step should be included before conducting any analysis.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

Reordering is done in order to choose “White” (racehpr2=6) as the reference group.

Logistic Regression
In the following sample code, current asthma status (astcur) is examined, controlling for race (racehpr2), sex (srsex), and age (srage). As SUDAAN and Stata require the dependent variables coded as 0 and 1 for logistic regression, a new dependent variable ast is created and assigned 1 where astcur=1 (“Current asthma”) and 0 where astcur=2 (“No current asthma”). The category “No current asthma” is used as the reference in the analysis.

*Sample design specification step*
use "DATASET LOCATION” 
svyset tsvrunit [pw=rakedw0], strata (tsvarstr)
 
*Analysis*  
recode astcur (2=0) (1=1) (-9=.) (-1=.), gen (ast) 
  
xi: svy: logit ast srage i.race i.srsex 

xi: svy: logistic ast srage i.race i.srsex

In Stata, the sample design specification step should be included before conducting any analysis.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

logit produces parameter estimates.

logistic produces odds ratios. Stata automatically chooses the lowest value of the categorical variable as the reference group for the independent and dependent variables.

R

The Taylor Series estimation examples in R using the survey package are exactly the same as the Replicate Weighting examples. However, the single important difference is telling R to change the survey design setup at the very beginning as we demonstrate below.

library(survey)

# instead of ?svrepdesign() for replicate weights
# chis_design <- svrepdesign(data=your_chis_data, ... )

# use ?svydesign() for taylor series
chis_design <- svydesign(id=~tsvrunit, 
                         strata=~tsvarstr, 
                         weights=~rakedw0, 
                         data=your_chis_data,
                         nest=TRUE)

# Now, all downstream estimation functions are the same
svymean(design=chis_design, … )

Below are the Taylor series examples, correctly using svydesign() instead of svrepdesign()
">

Mean Calculation
In the following sample code, the distribution of BMI (bmi) is examined by race (racehpr2) and by the interaction between race and sex (racehpr2*srsex).

# Sample design specification step

# Developed with 'R version 4.2.3' and 'survey version 4.1-1'
library(survey)

chis_design <- svydesign(id=~tsvrunit, 
                         strata=~tsvarstr, 
                         weights=~rakedw0, 
                         data=your_chis_data,
                         nest=TRUE)
# Analysis

chis_design |> 
  svyby( formula = ~ bmi , by = ~ racehpr2,
         FUN=svymean, vartype=c(“se”,”ci”), deff = TRUE)

chis_design |> 
  svyby( formula = ~ bmi , by = ~ racehpr2+srsex,
         FUN=svymean, vartype=c(“se”,”ci”), deff = TRUE)

Frequency Calculation
In the following sample code, the percentage of people who currently have asthma (astcur) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

# Sample design specification step

# Developed with 'R version 4.2.3' and 'survey version 4.1-1'
library(survey)

chis_design <- svydesign(id=~tsvrunit, 
                         strata=~tsvarstr, 
                         weights=~rakedw0, 
                         data=your_chis_data,
                         nest=TRUE)
# Analysis

# sub pop 1
chis_design |> subset(srsex=="1") |> 
  svytable(formula=~astcur+racehpr2) |>
  prop.table()  # proportions

chis_design |> subset(srsex=="1") |> 
  svytable(formula=~astcur+ racehpr2) |>
  # prop.table() |> 
  summary()  # totals with chi-square stats

# sub pop 2
chis_design |> subset(srsex=="2") |> 
  svytotal(x=~interaction(astcur, racehpr2))  # total as interaction

chis_design |> subset(srsex=="2") |> 
  svymean(x=~interaction(astcur, racehpr2)) |>
  confint()  # confidence intervals

Linear Regression
In the following sample code, Body Mass Index (bmi) is examined in relation to race (racehpr2), sex (srsex), and age (srage) while controlling for each other. Note that racehpr2 and srsex are categorical variables, and White (racehpr2=6) and Male (srsex=1) are used as their reference categories.

# Sample design specification step

# Developed with 'R version 4.2.3' and 'survey version 4.1-1'
library(survey)

chis_design <- svydesign(id=~tsvrunit, 
                         strata=~tsvarstr, 
                         weights=~rakedw0, 
                         data=your_chis_data,
                         nest=TRUE)
# Analysis

# use relevel(ref=,...) inside the formula to pick reference category
# can alternatively pre-process data before the formula call

chis_design |> 
  svyglm(formula = bmi ~ relevel(as.factor(srsex),ref="1") + 
           relevel(as.factor(racehpr2),ref="6") + 
           srage,
         family=stats::gaussian(),
         rescale=TRUE) |> 
         summary()

Logistic Regression
In the following sample code, current asthma status (astcur) is examined, controlling for race (racehpr2), sex (srsex), and age (srage). As SUDAAN and Stata require the dependent variables coded as 0 and 1 for logistic regression, a new dependent variable ast is created and assigned 1 where astcur=1 (“Current asthma”) and 0 where astcur=2 (“No current asthma”). The category “No current asthma” is used as the reference in the analysis.

# Sample design specification step

# Developed with 'R version 4.2.3' and 'survey version 4.1-1'
library(survey)

chis_design <- svydesign(id=~tsvrunit, 
                         strata=~tsvarstr, 
                         weights=~rakedw0, 
                         data=your_chis_data,
                         nest=TRUE)

# Analysis

# use stats::update() to transform ASTCUR after the survey design spec
# can alternatively pre-process data before the survey design spec

# use relevel(ref=,...) inside the formula to pick reference category
# can alternatively pre-process data before the formula call

chis_design |> 
  stats::update(ast = ifelse(astcur==2,0,1)) |>
  svyglm(formula = ast ~ srage + 
           relevel(as.factor(racehpr2),ref="6") + 
           relevel(as.factor(srsex),ref="1"),
         family=quasibinomial(),
         rescale=TRUE) |> 
         summary()

SAS

Mean Calculation
In the following sample code, the distribution of BMI (bmi) is examined by race (racehpr2) and by the interaction between race and sex (racehpr2*srsex).

PROC SURVEYMEANS DATA = data mean stderr NOMCAR VARMETHOD=TAYLOR;
STRATA tsvarstr; CLUSTER tsvrunit;
WEIGHT rakedw0; 
VAR bmi; 
DOMAIN racehpr2 racehpr2*srsex;
RUN;

If conducting a domain analysis, the DOMAIN statement is necessary for accurate variance estimation. Using BY or WHERE statements will not produce valid variance estimates for the subpopulation/domain. In SAS, the NOMCAR option presents the assumption that missing values are not completely at random. This, along with the DOMAIN statement, is the appropriate approach for domain analyses, which uses the entire sample for variance estimation.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

Frequency Calculation
In the following sample code, the percentage of people who currently have asthma (astcur) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

PROC SURVEYFREQ DATA = data NOMCAR VARMETHOD=TAYLOR;
STRATA tsvarstr; CLUSTER tsvrunit; 
WEIGHT rakedw0; 
TABLE racehpr2*asctur/row; 
RUN;   

PROC SURVEYFREQ DATA = data NOMCAR VARMETHOD=TAYLOR;
STRATA tsvarstr; CLUSTER tsvrunit; 
WEIGHT rakedw0; 
TABLE srsex*racehpr2*astcur/row; 
RUN;

One caveat in creating multiple tables in one PROC SURVEYFREQ procedure is that the procedure takes the smallest applicable sample sizes among all variables. Therefore, creating one table per one PROC SURVEYFREQ procedure is recommended.

If conducting a domain analysis, the DOMAIN statement is necessary for accurate variance estimation. Using BY or WHERE statements will not produce valid variance estimates for the subpopulation/domain. The NOMCAR option presents the assumption that missing values are not completely at random. This, along with the DOMAIN statement, is the appropriate approach for domain analyses, which uses the entire sample for variance estimation.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

Linear Regression
In the following sample code, Body Mass Index (bmi) is examined in relation to race (racehpr2), sex (srsex), and age (srage) while controlling for each other. Note that racehpr2 and srsex are categorical variables, and White (racehpr2=6) and Male (srsex=1) are used as their reference categories.

PROC SURVEYREG DATA = data NOMCAR VARMETHOD=TAYLOR;
STRATA tsvarstr; CLUSTER tsvrunit; 
WEIGHT rakedw0; 
FORMAT racehpr2 racehprf. srsex srsex.; 
CLASS racehpr2 srsex;
MODEL bmi = srsex racehpr2 srage/SOLUTION; 
RUN;

If conducting a domain analysis, the DOMAIN statement is necessary for accurate variance estimation. Using BY or WHERE statements will not produce valid variance estimates for the subpopulation/domain. The NOMCAR option presents the assumption that missing values are not completely at random. This, along with the DOMAIN statement, is the appropriate approach for domain analyses, which uses the entire sample for variance estimation.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

SOLUTION option provides the parameter estimates when using a CLASS statement.

Logistic Regression
In the following sample code, current asthma status (astcur) is examined, controlling for race (racehpr2), sex (srsex), and age (srage). As SUDAAN and Stata require the dependent variables coded as 0 and 1 for logistic regression, a new dependent variable ast is created and assigned 1 where astcur=1 (“Current asthma”) and 0 where astcur=2 (“No current asthma”). The category “No current asthma” is used as the reference in the analysis..

PROC SURVEYLOGISTIC DATA = data NOMCAR VARMETHOD=TAYLOR;
FORMAT astcur astcurf. racehpr2 racehprf. srsex srsex.;
STRATA tsvarstr; 
CLUSTER tsvrunit;
WEIGHT rakedw0; 
CLASS astcur (REF=“NO CURRENT ASTHMA”) racehpr2 (REF=“WHITE”) srsex (REF=“MALE”)/PARAM=REF;
MODEL astcur = racehpr2 srsex srage;
RUN;

If conducting a domain analysis, the DOMAIN statement is necessary for accurate variance estimation. Using BY or WHERE statements will not produce valid variance estimates for the subpopulation/domain. The NOMCAR option presents the assumption that missing values are not completely at random. This, along with the DOMAIN statement, is the appropriate approach for domain analyses, which uses the entire sample for variance estimation.

In PROC SURVEYLOGISTIC, the reference category of the independent and dependent variables may be specified in a CLASS statement. PARAM=REF is specified to ensure dummy coding of the categorical independent variables.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

SUDAAN

Mean Calculation
In the following sample code, the distribution of BMI (bmi) is examined by race (racehpr2) and by the interaction between race and sex (racehpr2*srsex)..

PROC SORT DATA = data; 
BY tsvarstr tsvrunit;
RUN; 
 
PROC DESCRIPTIVE DATA = data FILETYPE=SAS DESIGN=WR; 
NEST tsvarstr tsvrunit;  
WEIGHT rakedw0; 
CLASS racehpr2 racehpr2*srsex; 
VAR bmi; 
TABLES racehpr2 racehpr2*srsex; 
SUBGROUP racehpr2 srsex; 
LEVELS 7 2; 
RUN;

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

Frequency Calculation
In the following sample code, the percentage of people who currently have asthma (astcur) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

PROC SORT DATA = data; 
BY tsvarstr tsvrunit;
RUN; 
  
PROC CROSSTAB DATA = data FILETYPE=SAS DESIGN=WR; 
NEST tsvarstr tsvrunit;  
WEIGHT rakedw0; 
TABLES racehpr2*astcur srsex*racehpr2*astcur; 
SUBGROUP astcur racehpr2 srsex; 
LEVELS 2 7 2; 
RUN;

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit

Linear Regression
In the following sample code, Body Mass Index (bmi) is examined in relation to race (racehpr2), sex (srsex), and age (srage) while controlling for each other. Note that racehpr2 and srsex are categorical variables, and White (racehpr2=6) and Male (srsex=1) are used as their reference categories.

PROC SORT DATA = data; 
BY tsvarstr tsvrunit; 
RUN; 
 
PROC REGRESS DATA = data FILETYPE=SAS DESIGN=WR; 
NEST tsvarstr tsvrunit; 
WEIGHT rakedw0; 
SUBGROUP racehpr2 srsex; 
LEVELS 7 2; 
REFLEVEL racehpr2=6 srsex=1; 
MODEL bmi = racehpr2 srsex srage; 
RUN;

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

Logistic Regression
In the following sample code, current asthma status (astcur) is examined, controlling for race (racehpr2), sex (srsex), and age (srage). As SUDAAN and Stata require the dependent variables coded as 0 and 1 for logistic regression, a new dependent variable ast is created and assigned 1 where astcur=1 (“Current asthma”) and 0 where astcur=2 (“No current asthma”). The category “No current asthma” is used as the reference in the analysis.

DATA newdata; 
SET data; 
IF astcur=1 THEN ast=1; 
ELSE IF astcur=2 THEN ast=0; 
RUN;   

PROC RLOGIST data = newdata FILETYPE=SAS DESIGN=WR; 
WEIGHT rakedw0; 
NEST tsvarstr tsvrunit; 
SUBGROUP racehpr2 srsex; 
LEVELS 7 2; 
REFLEVEL racehpr2 = 6 srsex = 1; 
MODEL ast = racehpr2 srsex srage; RUN;

SPSS

Mean Calculation
In the following sample code, the distribution of BMI (bmi) is examined by race (racehpr2) and by the interaction between race and sex (racehpr2*srsex).

*Sample design specification step*
*   Analysis Preparation Wizard. 

CSPLAN ANALYSIS 
 /PLAN FILE=‘\\PATH FOR COMPLEX SURVEY PLAN FILE\FILENAME.csaplan’ 
 /PLANVARS ANALYSISWEIGHT=RAKEDW0 
 /PRINT PLAN 
 /DESIGN STRATA= TSVARSTR CLUSTER=TSVRUNIT
 /ESTIMATOR TYPE=WR. 
  
*Analysis* 
*   Complex Samples Descriptives. 
CSDESCRIPTIVES 
 /PLAN FILE = ‘\\PATH FOR COMPLEX SURVEY PLAN FILE\FILENAME.csaplan’ 
 /SUMMARY VARIABLES = bmi 
 /SUBPOP TABLE = racehpr2 DISPLAY=LAYERED 
 /MEAN 
 /STATISTICS SE CV POPSIZE CIN (95) 
 /MISSING SCOPE = ANALYSIS CLASSMISSING = EXCLUDE. 
*   Complex Samples Descriptives. 
CSDESCRIPTIVES 
 /PLAN FILE = ‘\\PATH FOR COMPLEX SURVEY PLAN FILE\FILENAME.csaplan’ 
 /SUMMARY VARIABLES = bmi 
 /SUBPOP TABLE = racehpr2 BY sex DISPLAY=LAYERED 
 /MEAN 
 /STATISTICS SE CV POPSIZE CIN (95) 
 /MISSING SCOPE = ANALYSIS CLASSMISSING = EXCLUDE.

In SPSS, the sample design specification step should be included before conducting any analysis.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

Frequency Calculation
In the following sample code, the percentage of people who currently have asthma (astcur) is examined by race (racehpr2) and by race and sex (racehpr2*srsex).

*Sample design specification step*
*   Analysis Preparation Wizard. 

CSPLAN ANALYSIS 
 /PLAN FILE=‘\\PATH FOR COMPLEX SURVEY PLAN FILE\FILENAME.csaplan’ 
 /PLANVARS ANALYSISWEIGHT=RAKEDW0 
 /PRINT PLAN 
 /DESIGN STRATA= TSVARSTR CLUSTER=TSVRUNIT
 /ESTIMATOR TYPE=WR. 
 
*Analysis*  
*   Complex Samples Crosstabs. 
CSTABULATE 
 /PLAN FILE = ‘\\PATH FOR COMPLEX SURVEY PLAN FILE\FILENAME.csaplan’ 
 /TABLES VARIABLES = astcur BY racehpr2 
 /SUBPOP TABLE = srsex DISPLAY=LAYERED 
 /CELLS POPSIZE COLPCT 
 /STATISTICS SE CV CIN (95) 
 /MISSING SCOPE = TABLE CLASSMISSING = EXCLUDE.

In SPSS, the sample design specification step should be included before conducting any analysis.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

Linear Regression
In the following sample code, Body Mass Index (bmi) is examined in relation to race (racehpr2), sex (srsex), and age (srage) while controlling for each other. Note that racehpr2 and srsex are categorical variables, and White (racehpr2=6) and Male (srsex=1) are used as their reference categories.

*Sample design specification step*
* Analysis Preparation Wizard. 

CSPLAN ANALYSIS 
 /PLAN FILE=‘\\PATH FOR COMPLEX SURVEY PLAN FILE\FILENAME.csaplan’ 
 /PLANVARS ANALYSISWEIGHT=RAKEDW0 
 /PRINT PLAN 
 /DESIGN STRATA= TSVARSTR CLUSTER=TSVRUNIT
 /ESTIMATOR TYPE=WR. 
 
*Analysis*   
RECODE 
srsex 
 (1=2) (2=1) INTO newsex. 
VARIABLE LABELS newsex ‘NEWSEX’. 
EXECUTE. 
 
RECODE  
racehpr2 
 (1=1) (2=2) (3=3) (4=4) (5=5) (6=7) (7=6) INTO newrace. 
 
VARIABLE LABELS newrace ‘NEWRACEHPR2’. 
EXECUTE. 

* Complex Samples General Linear Model. 
CSGLM bmi BY newsex newrace WITH srage 
 /PLAN FILE = ‘\\PATH FOR COMPLEX SURVEY PLAN FILE\FILENAME.csaplan’ 
 /MODEL newsex newrace srage 
 /INTERCEPT INCLUDE=YES SHOW=YES 
 /STATISTICS PARAMETER SE CINTERVAL 
 /PRINT SUMMARY VARIABLEINFO SAMPLEINFO 
 /TEST TYPE=F PADJUST=LSD 
 /MISSING CLASSMISSING=EXCLUDE 
 /CRITERIA CILEVEL=95.

In SPSS, the sample design specification step should be included before conducting any analysis.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

SPSS CSGLM automatically chooses the highest value of the categorical independent variables as the reference groups. Therefore, recoding categorical variables is necessary to select the desired reference categories if they are different than the categories with the highest values.

Logistic Regression
In the following sample code, current asthma status (astcur) is examined, controlling for race (racehpr2), sex (srsex), and age (srage). As SUDAAN and Stata require the dependent variables coded as 0 and 1 for logistic regression, a new dependent variable ast is created and assigned 1 where astcur=1 (“Current asthma”) and 0 where astcur=2 (“No current asthma”). The category “No current asthma” is used as the reference in the analysis.

*Sample design specification step*
* Analysis Preparation Wizard. 

CSPLAN ANALYSIS 
 /PLAN FILE=‘\\PATH FOR COMPLEX SURVEY PLAN FILE\FILENAME.csaplan’ 
 /PLANVARS ANALYSISWEIGHT=RAKEDW0 
 /PRINT PLAN 
 /DESIGN STRATA= TSVARSTR CLUSTER=TSVRUNIT
 /ESTIMATOR TYPE=WR. 
 
*Analysis*  

RECODE  srsex   
 (1=2) (2=1) INTO newsex.  
VARIABLE LABELS newsex 'NEWSEX'.  
EXECUTE. 
  
RECODE   racehpr2   
 (1=1) (2=2) (3=3) (4=4) (5=5) (6=7) (7=6) INTO newrace.  VARIABLE LABELS newrace 'NEWRACEHPR2'.  EXECUTE. 
  
* Complex Samples Logistic Regression.  

CSLOGISTIC astcur BY newsex newrace WITH srage  
 /PLAN FILE = '\\PATH FOR COMPLEX SURVEY PLAN FILE\FILENAME.csaplan'  
 /MODEL newsex newrace srage  
 /INTERCEPT INCLUDE=YES SHOW=YES  
 /STATISTICS PARAMETER EXP SE CINTERVAL  
 /TEST TYPE=F PADJUST=LSD  
 /ODDSRATIOS FACTOR=[newsex]  
 /ODDSRATIOS FACTOR=[newrace]  
 /ODDSRATIOS COVARIATE=[srage]  
 /MISSING CLASSMISSING=EXCLUDE  
 /CRITERIA MXITER=100 MXSTEP=5 PCONVERGE=[1e-006 RELATIVE]  
  LCONVERGE=[0] CHKSEP=20 CILEVEL=95   
/PRINT SUMMARY VARIABLEINFO SAMPLEINFO.

In SPSS, the sample design specification step should be included before conducting any analysis.

When using concatenated data across adults, adolescents, and/or children, use tsvrunit; when using separate data files, delete the commands associated with tsvrunit.

SPSS CSLOGISTIC automatically chooses the highest value of the categorical variable as the reference group for the independent variables as well as the dependent variable. Therefore, recoding categorical variables is necessary to select the desired reference categories if they are different than the categories with highest values.

Sample Code to Pool Multiple Data Cycles

For background on pooling two or more data files, see the page on Pooling CHIS Data .

The following Stata, SAS, and R codes show how to combine CHIS 2017 and 2018 data files and to create weights accounting for the multiple files. In addition to the sample code, we also provide a SAS macro for the user interested in analyzing CHIS data in SAS. This SAS macro will do all the work of weight adjustments automatically, and also generate other necessary information needed in later analysis. Download the SAS macro or refer to the video tutorial.

Stata

log using "folder location\data_step.log", replace

***CHIS 2017 Adult data****
use "your folder location\CHIS 2017 data"

gen year=2017

gen fnwgt0=rakedw0/2

for new fnwgt1-fnwgt160: gen X=0

foreach i of numlist 1/80{
            local j=`i'-0
            replace fnwgt`i'=rakedw`j'/2

}

foreach i of numlist 81/160{
            replace fnwgt`i'=rakedw0/2

}

save adult17 , replace

***CHIS 2018 Adult data****
use "folder location\CHIS 2018 data"

gen year=2018

gen fnwgt0=rakedw0/2

for new fnwgt1-fnwgt160: gen X=0

foreach i of numlist 1/80{           
            replace fnwgt`i'=rakedw0/2

}

foreach i of numlist 81/160{
            local j=`i'-80
            replace fnwgt`i'=rakedw`j'/2
}

/*this step concatenates the data files*/
append using adult17 

save "folder location\combined.dta", replace

svyset [pw=fnwgt0], jkrw(fnwgt1-fnwgt160, multiplier(1)) vce(jack) mse

R

library(openxlsx)
library(haven)
library(dplyr)
library(survey)
library(tidyverse)

# FOLLOWING CODE ONLY WORKS FOR SINGLE YEAR DATA SET FILES NOT TWO-YEARS

# Please download all SAS/Stata files you want to use and put them all into one folder location:
folder_location <- "YOUR DIRECTORY/FOLDER LOCATION"

# Copy and paste below for as many years as you want to pool together and change respective values:
# Change "read_sas" to "read_dta" and ".sas7bdat" to ".dta" if want to pool Stata files together
adult17 <- read_sas(paste0(folder_location,"/adult_2017.sas7bdat")) %>% 
  rename_all(tolower) %>%
  mutate(year = 2017)

adult18 <- read_sas(paste0(folder_location,"/adult_2018.sas7bdat")) %>% 
  rename_all(tolower) %>%
  mutate(year = 2018)

# Put all imported data sets into this list

my_chis_list <- list(adult17, adult18)

# DO NOT CHANGE ANYTHING INSIDE THE FUNCTION UNLESS ABSOLUTELY NECESSARY

pooling <- function(chis_list) {
  
  for(i in 1:length(chis_list)) {
    
    chis_list[[i]][ , paste0("fnwgt", 0:80)] <- chis_list[[i]][ , paste0("rakedw", 0:80)]
    
    chis_list[[i]] <- 
      chis_list[[i]] %>% 
      rename_at(vars(paste0("fnwgt", c(1:80))), ~ paste0("fnwgt", c(1:80) + 80*(i-1))) 
    
  }
  
  chis_list <- chis_list %>%
    map(. %>% mutate(across(everything(), .fns = as.character)))
  
  merged <- 
    bind_rows(chis_list) %>% 
    data.frame(., row.names = NULL)
  
  merged <-
    merged  %>% 
    mutate_all(type.convert, as.is = TRUE)
  
  merged <-
    merged  %>% 
    mutate(across(starts_with("fnwgt"), ~ ifelse(is.na(.), fnwgt0, .)))
  
  merged <- 
    merged %>% 
    mutate(across(starts_with("fnwgt"), ~ ./length(chis_list)))
  
  merged
  
}

# Store pooled data, resulting data will either be in numeric or character format (no factors at all).

combined <- pooling(my_chis_list)

# Set up survey design for analysis

chis_design <- svrepdesign(data = combined,
                                  weights =  ~ fnwgt0,
                                  repweights = "fnwgt[1-9]",
                                  type = "other",
                                  scale = 1,
                                  rscales = 1,
                                  mse = TRUE)

SAS/SUDAAN

data combined; /*this step concatenates the data files*/
    set libname.chis_2017 (in=in17) libname.chis_2018 (in=in18);
    
    if in17 then year=2017;
    else if in18 then year=2018;
    
    ***Create new weight variables;
    fnwgt0 = rakedw0/2;
    array a_origwgts[80] rakedw1-rakedw80;
    array a_newwgts[160] fnwgt1-fnwgt160;
    do i = 1 to 80;
        if year=2017 then do;
            a_newwgts[i] = a_origwgts[i]/2;
            a_newwgts[i+80] = rakedw0/2;
        end;
        else if year=2018 then do;
            a_newwgts[i]    = rakedw0/2;
            a_newwgts[i+80] = a_origwgts[i]/2;
        end;

    end;
run;

proc surveyfreq data = combined varmethod=jackknife;
    weight fnwgt0;
    repweight fnwgt1-fnwgt160/jkcoefs=1;
    table ins;
run;

Jackknife coefficients are necessary for accurate variance calculations, and jackknife coefficients of 1 in SAS will produce equal variance calculations as those produced in SUDAAN. However, for SAS V.9.2(TS1M0) and earlier, a value of 1 will not be accepted; as a substitute, 0.9999 can be entered. Without this specification, the default value of the jackknife coefficients will be [(# replicate weights - 1)/# replicate weights]; for CHIS, this would be [(80 - 1)/80] = 0.9875.

Source and Constructed Variables Data Dictionaries

The dictionaries below contain descriptions of all variables in the entire source data sets and include those variables not available in the Public Use Files.

Select a year below, must be logged in to download files:

2022 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2022-data-dictionary-source-adult-oct-2023-1.pdf	Protected File cv2021-22_adult_source_final.pdf
Adolescent	Protected File chis-2022-data-dictionary-source-teen-oct-2023.pdf	Protected File cv2021-22_teen_source_final.pdf
Child	Protected File chis-2022-data-dictionary-source-child-oct-2023.pdf	Protected File cv2021-22_child_source_final.pdf

2021 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2021-data-dictionary-adult-oct-2023_1.pdf	Protected File cv2021-22_adult_source_final.pdf
Adolescent	Protected File chis-2021-data-dictionary-teen-oct-2023.pdf	Protected File cv2021-22_teen_source_final.pdf
Child	Protected File chis-2021-data-dictionary-child-oct-2023.pdf	Protected File cv2021-22_child_source_final.pdf

2020 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2020-data-dictionary-adult-oct-2023.pdf	Protected File cv2019-20_adult_source_final.pdf
Adolescent	Protected File chis-2020-data-dictionary-teen-oct-2023.pdf	Protected File cv2019-20_teen_source_final.pdf
Child	Protected File chis-2020-source-data-dictionary-child-oct-2023.pdf	Protected File cv2019-20_child_source_final.pdf

2019 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2019-data-dictionary-adult-oct-2023.pdf	Protected File cv2019-20_adult_source_final.pdf
Adolescent	Protected File chis-2019-data-dictionary-teen-oct-2023.pdf	Protected File cv2019-20_teen_source_final.pdf
Child	Protected File chis-2019-data-dictionary-child-oct-2023.pdf	Protected File cv2019-20_child_source_final.pdf

2018 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2018-data-dictionary-adult-oct-2023.pdf	Protected File cv2017-18_adult_source.pdf
Adolescent	Protected File chis-2018-data-dictionary-teen-oct-2023.pdf	Protected File cv2017-18_teen_source.pdf
Child	Protected File chis-2018-data-dictionary-child-oct-2023.pdf	Protected File cv2017-18_child_source.pdf

2017 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2017-data-dictionary_source_adult-oct-2023.pdf	Protected File cv2017-18_adult_source.pdf
Adolescent	Protected File chis-2017-data-dictionary_source_teen-oct-2023.pdf	Protected File cv2017-18_teen_source.pdf
Child	Protected File chis-2017-data-dictionary_source_child-oct-2023.pdf	Protected File cv2017-18_child_source.pdf

2016 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2016-source-data-dictionary-adult-oct-2018_v2.pdf	Protected File cv2015-16_adult_source.pdf
Adolescent	Protected File chis-2016-source-data-dictionary-teen-oct-2018.pdf	Protected File cv2015-16_teen_source.pdf
Child	Protected File chis-2016-source-data-dictionary-child-oct-2018.pdf	Protected File cv2015-16_child_source.pdf

2015 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2015-source-data-dictionary-adult-oct-2018_v2.pdf	Protected File cv2015-16_adult_source.pdf
Adolescent	Protected File chis-2015-source-data-dictionary-teen-oct-2018.pdf	Protected File cv2015-16_teen_source.pdf
Child	Protected File chis-2015-source-data-dictionary-child-oct-2018.pdf	Protected File cv2015-16_child_source.pdf

2014 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2014-source-data-dictionary-adult_oct-2018_v2.pdf	Protected File cv2013-14_adult_source.pdf
Adolescent	Protected File chis-2014-source-data-dictionary-teen_oct-2018.pdf	Protected File cv2013-14_teen_source.pdf
Child	Protected File chis-2014-source-data-dictionary-child_oct-2018.pdf	Protected File cv2013-14_child_source.pdf

2013 Source and Constructed Variables Data Dictionary

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2013-source-data-dictionary-adult_may-2018_v2.pdf	Protected File cv2013-14_adult_source.pdf
Adolescent	Protected File chis-2013-source-data-dictionary-teen_may-2018.pdf	Protected File cv2013-14_teen_source.pdf
Child	Protected File chis-2013-source-data-dictionary-child_may-2018.pdf	Protected File cv2013-14_child_source.pdf

2012 Source and Constructed Variables Data Dictionary

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2012-source-adult-may-2024.pdf	Protected File cv2011-12_adult_source.pdf
Adolescent	Protected File chis-2012-source-teen-may-2024.pdf	Protected File cv2011-12_teen_source.pdf
Child	Protected File chis-2012-source-child-may-2024.pdf	Protected File cv2011-12_child_source.pdf

2011 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File chis-2011-source-adult-may-2024.pdf	Protected File cv2011-12_adult_source.pdf
Adolescent	Protected File chis-2011-source-teen-may-2024.pdf	Protected File cv2011-12_teen_source.pdf
Child	Protected File chis-2011-source-child-may-2024.pdf	Protected File cv2011-12_child_source.pdf

2009 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File 2009_adult_dd_v2_3.pdf	—
Adolescent	Protected File 2009_adol_dd.pdf	—
Child	Protected File 2009_child_dd.pdf	—

2007 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File 2007_adult_dd_v2_2.pdf	—
Adolescent	Protected File 2007_adol_dd.pdf	—
Child	Protected File 2007_child_dd.pdf	—

2005 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File 2005_adult_dd_v2_2.pdf	—
Adolescent	Protected File 2005_adol_dd.pdf	—
Child	Protected File 2005_child_dd.pdf	—

2003 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File 2003_adult_dd_v2_3.pdf	—
Adolescent	Protected File 2003_adol_dd.pdf	—
Child	Protected File 2003_child_dd.pdf	—

2001 Source and Constructed Variables Data Dictionaries

Data Dictionary for:	Source	Constructed Variables
Adult	Protected File 2001_adult_dd_v2_1.pdf	—
Adolescent	Protected File 2001_adol_dd.pdf	—
Child	Protected File 2001_child_dd.pdf	—

Browse Publications

Find an Expert

Training

View all projects

Resources

Analyze CHIS Data

Source and Constructed Variables Data Dictionaries

Analyze CHIS Data

Weighting and Variance Estimation

Pooling CHIS Data

Need to analyze multi-year CHIS data or create adjusted replicate weight variables?

Sample Code

Navigate to...

Replicate Weights

Taylor Series Linearization Method

Sample Code to Pool Multiple Data Cycles

Source and Constructed Variables Data Dictionaries

Have any questions?

Who We Are

Browse Publications

Find an Expert

Training

View all projects

Resources

Analyze CHIS Data

Source and Constructed Variables Data Dictionaries

Analyze CHIS Data

Weighting and Variance Estimation

Pooling CHIS Data

Need to analyze multi-year CHIS data or create adjusted replicate weight variables?

Sample Code

Navigate to...

Replicate Weights

Taylor Series Linearization Method

Sample Code to Pool Multiple Data Cycles

Source and Constructed Variables Data Dictionaries

Have any questions?