Menu - California Health Interview Survey (CHIS)
We have compiled a list of the frequently asked questions (FAQs) to help answer any inquiries you may have. These FAQs may be of use to researchers, academics and other data experts in search of quick answers to technical questions on how to use CHIS. For a more general explanation of CHIS, please visit the About CHIS section.
If you cannot find the answer you are looking for, please do not hesitate to contact us.
General CHIS Data Information:
The survey is conducted on a continuous basis and data is released in a variety of formats. Not all questions were asked in all years. In addition, there are separate questionnaires for adults, adolescents and children. To see questionnaires for each survey cycle, visit our Questionnaires page in the Survey Design and Methods section.
The first place to start is the Questionnaires. In addition, the data dictionaries contain informative notes and are available for both Public Use Files (PUFs) and confidential data. For all years, the data dictionaries include the raw frequencies for nearly every variable. We also have a constructed variables document available on the PUF website which outlines how the constructed variables were created. Should you continue to have questions after that, let us know and we'll do our best to help you.
Source variables are variables based on a single question asked during the CHIS interview. These include confidential variables, as well as those found in the Public Use Files. Source variables are generally labeled with two letters followed by numbers.
Constructed variables are variables that were put together by the CHIS Data Production team. Constructed variables are usually based on multiple questions asked during the interview. In general, constructed variables can be identified by their variable names, which are acronyms and/or abbreviations.
Variables that are imputed have had the missing values replaced by a value that was generated based on a complicated algorithm. Detailed information about imputation methods can be found on the CHIS Design and Methods website.
Not every question is asked of every respondent. Many questions are limited to a particular group through skip patterns. These skip patterns create a universe of respondents eligible to answer a question. For example, the universe for a question on prostate screening in the past year is limited to 'all adult men age 40 years and older.'
As part of the survey, a screener script is read that includes: 1) an introduction of the interviewer, UCLA, and the survey sponsors; 2) an explanation of purpose and importance of the survey; 3) statements describing the confidential and voluntary nature of the survey; and 4) statements explaining the respondent's right to skip any questions and to end the interview at any time.
The (potential) respondent is then asked if they wish to participate in the survey. The respondent implies consent if they choose to continue with the survey.
When citing the survey, please refer to it as the “California Health Interview Survey” and the abbreviated version as “CHIS,” with appropriate survey years (“CHIS 2021,” “CHIS 2020”).
When citing AskCHIS data, refer to it as the “California Health Interview Survey” (“Source: 2021 California Health Interview Survey”).
See citation examples below for Public Use Files, Source Data Files, AskCHIS, AskCHIS NE, and Health Profiles:
By correctly citing CHIS in your research and/or publications, you enable the UCLA Center for Health Policy Research to promote your work. Specifically, we search for recent CHIS citations and keywords that may lead us to new research or articles containing CHIS data. We can also list that research on the UCLA CHPR website, read by funders, policymakers, and fellow researchers.
Public Use Files
Some variables are deemed to be either sensitive or identifiable or both. The data for these variables cannot be disclosed in the Public Use Files.
There is a distinction between identifiable and sensitive. Sensitive variables contain information that would likely harm or embarrass the individual if disclosed. Both identifiable and sensitive variables should be excluded from public use files according to the National Center for Health Statistics.
As an additional layer of protection for the identity of the respondents, the variable that links the Adult, Adolescent, and Child files has been removed from the PUF. This variable is available only through the Data Access Center (DAC). However, some information from the Adult file has been added to the Child and Adolescent records (e.g., race, poverty status).
We suggest you first try AskCHIS, an easy-to-use online query system that enables you to analyze data quickly, easily and free of charge. There is also a multitude of supporting documentation on our website, including data dictionaries. We offer limited technical assistance on how variables were constructed and the general structure of the dataset. We will not, however, assist in the interpretation of the data. There are online resources available to help in this regard.
Local Health Departments
Counties can receive a Local Health Department file. This file is essentially the PUF customized for your county or group of counties. Contact the CHIS Data Access Center for information about how to receive this file.
Working with Confidential Data
You'll need to apply to use the confidential data through the Data Access Center (DAC). The application process consists of submitting a few forms and some supplementary documentation to the CHIS Data Disclosure Review Committee (DDRC). More information on the DAC, including application materials and variable lists, can be found after logging in to the DAC website.
The application is reviewed by the Data Disclosure Review Committee (DDRC). This committee consists of representatives from the California Department of Public Health, and the UCLA Center for Health Policy Research. The committee reviews the application to make sure the project: a) is consistent with the purposes of CHIS; b) is feasible; and c) will not threaten the confidentiality of the respondent. Projects that satisfy these requirements are recommended for approval.
The fees associated with accessing the data help cover the costs of data production, committee review, providing a customized data file, and initial technical assistance. See the Data Access Center Application website for more information about fees.
No, confidential data files do not leave the Data Access Center at UCLA. Data must be analyzed remotely by sending in programming code (SAS, SPSS, Stata, or R) or by using programming/consulting services to assist in analysis. Output is returned to you after disclosure review.
We review all output for small (raw frequency) cell sizes and make release decisions accordingly. We do not generate frequency/counts or cross-tabulations at the sub-county level due to confidentiality and weighting concerns. Small cell values are defined as less than 3 (unweighted) and less than 500 (weighted). In cross-tabulations, complementary cell values are also suppressed to avoid back-calculation.
A dataset that contains any confidential variable is considered confidential and the more information contained in that dataset, the greater the risk to the respondent's confidentiality. In an effort to reduce that risk as much as possible, we require that customized datasets only contain those variables relevant to a particular project.
Yes. We understand that you occasionally may neglect to select particular variables that are important to your project. Variables can be added by requesting the specific variables and providing a brief justification for adding the selected variables. For example, if your project is about the effect of household smoking on asthma and you did not select "age at asthma diagnosis" we will add that variable at your request. The process to specifically add a variable to a dataset is called an Additional Variable Request, which can be submitted via the DAC online portal. Additional Variable Requests and reviewed and approved by the DDRC.
Yes, but there is a limited scholarship available to help cover part of the DAC fees typically paid by researchers. Contact the Data Access Center for more information.
Primarily we support SAS, SPSS, Stata, R, and ArcGIS. Other software is supported to a more limited extent. Please contact us about your software needs before submitting a DAC application. Even though you may have paid for other software, it cannot be put on our network, which is limited to the software for which we are licensed to use.
You can first try AskCHIS, an online query system that enables you to analyze data quickly, easily, and free of charge. We offer limited technical assistance in regard to understanding how variables are constructed and the general structure of the dataset. There is also a multitude of supporting documentation on our website.
No. If you who are not affiliated with UCLA, you should consult with your local IRB to determine the appropriate review of your proposed research, but you are not required to submit proof of IRB approval to the DAC. Most likely, you’ll be able to get an “Exemption”, but check with your local IRB, as you may need IRB approval from your institution. If you are affiliated with UCLA, you should NOT submit an application to the UCLA Human Subjects Protection Committee (IRB), as your project is covered by IRB#11-002227, which the UCLA South General IRB has approved to allow the DAC to conduct analyses on confidential CHIS data. Unless you, or anyone on your project team, are CHIS staff with access to confidential CHIS data, in which case you are required to obtain UCLA IRB approval and submit a copy of the approval with your DAC application.
No. You can use AskCHIS, an online query system that enables you to analyze data quickly, easily and free of charge. It contains weighted totals and percentages for nearly all variables in the CHIS datasets. You can get county or multi-county level cross-tabulated estimates.
You can use our Data Estimates Service. This service can, for a fee, provide you with the weighted percentages and population totals at sub-state geographic levels for almost any variable or set of variables. Unreliable estimates (as a result of variables with very small sample sizes) will be suppressed. Please email us your interest. AskCHIS also provides subsetting for a limited set of variables.
This depends largely on your purpose and how and where you will report or use the estimates. CHIS uses the coefficient of variation (CV) and the confidence interval (CI), both calculated from the standard error of the estimate, to express the sampling variance (or “sampling error”) around an estimate. The calculation of the standard error of the estimate takes into account the complex sample survey design in CHIS.
- The CV indicates whether or not a point estimate (e.g., a mean, proportion, total) is statistically stable relative to its standard error. The CV is a ratio of the standard error to the estimate, showing the proportion of the estimate that reflects sampling variability: CV(p)=se (p) / p. In AskCHIS, estimates with a CV greater than 30% are "flagged" as statistically unstable with a red asterisk (*). While there are no absolute standards for judging how much variability is too much, CHIS generally recommends against reporting or relying on estimates that are this statistically unstable.
- Users will sometimes use a CI to see if an estimate is "statistically greater" than zero (or some other meaningful number) or if two estimates of subsamples from the same data are different from each other. Such uses are similar to "statistical significance testing" but not identical (Schenker & Gentleman, 2012). AskCHIS and AskCHIS NE report 95% confidence intervals.
Reasonable exceptions to the CV > 30% guideline may apply to binary/dichotomous variables when the magnitude of estimates are very small or very large (i.e., below 10% or above 90%). CHIS calculates the CV for binary variables by using the category with the smallest value as the denominator, which means that the CV will be the same regardless of which estimate you choose to report for the binary indicator of interest. This is a conservative approach that minimizes the potential for reporting unstable estimates.
The calculation can also produce a high CV even when the sample size is relatively large. For example, AskCHIS shows that 2.6% of 0–5 year olds in California are uninsured and 95.1% have a usual source of care. Both estimates are flagged as statistically unstable even though the sample of this population is large, suggesting that they are not reportable. This is a limitation of the CV, which increases as the denominator approaches 0. Page 10 of Lee et al. (2007) contains a detailed explanation of this issue.
Under such circumstances (dichotomous variables with low or high values) it may be reasonable to report the estimate if the CI range is acceptable to data users, e.g., reporting estimates with confidence limits within +/- 5%. It may help to consult the statistical standards of your agency, company, or professional association or the standards used in the journals in which you hope to publish.
If, however, you are in doubt or do not have another standard to use we recommend deferring to the CV flags that caution against reporting estimates with high CVs, unless CI ranges are appropriate for your purpose.
Beginning in 2011, CHIS data have been collected continuously, enabling the release of single-year data files (Public Use Files and AskCHIS) to provide users with timely data. However, CHIS is designed to provide stable estimates at the county level over a full two-year cycle of data collection. Thus single-year county-level estimates (or estimates for small subgroups) may not be stable. Trying to further analyze a sub-population within county with single-year data may lead to unstable estimates. When this becomes a problem, we strongly recommend that you combine two or more years of CHIS data (e.g., pooling 2019 and 2020) to help stabilize these data. To combine years in AskCHIS, under the “Years” tab choose the years you want to combine and then select the “Pool” option.
With the shift to continuous data collection in 2011, CHIS has developed one-year weights to provide estimates for each year of data collection. In general, estimates obtained by combining multiple years (such as 2011 and 2012) will be close, but not necessarily identical to, estimates produced from a two-year data file (such as 2011–2012) obtained prior to August 2015.
Part of this is due to differences in the population control totals used to weight one-year data and two-year data. For both data sets, control totals were created from California Department of Finance’s (DOF) Population Estimates and Population projections (for more information, refer to CHIS Methodology Report 5: Weight and Variance Estimation). The CHIS 2011–2012 two-year weights used the population control totals that represent 2012; since data from respondents were collected over this two-year period, the estimates represent the experience of respondents in this timeframe. The 2011 and 2012 CHIS one-year weights, on the other hand, are representative of the population of California in 2011 and 2012, respectively, and use the population control totals unique to 2011 and 2012. Pooling 2011 and 2012 together will provide a population-weighted average for the 2011–2012 data collection period.
CHIS staff periodically update estimates. When large discrepancies are discovered, we send out Data Advisories to affected users. You may contact the CHIS Data Access Center if you have additional questions.
Because it is not feasible to survey every household in California, CHIS uses a probability sampling method in an attempt to accurately represent the population of adults, adolescents, and children living within households in California. This means that the CHIS sample excludes institutionalized group quarter residents (e.g. residents of correctional institutions such as jails or nursing homes) and non-institutionalized group quarter residents (e.g. residents of college dormitories, military quarters on base).
And because CHIS is not a simple random sample, in order to make CHIS representative of California, population weights must be applied to produce accurate population estimates and totals.
Weights are variables in the data file that can be applied to the sample data to produce "weighted" population estimates. The weight variables are constructed through a complex and iterative process. There are separate weight variables for the adult, child and adolescent Public Use Files, all with the same name, RAKEDW0. These final survey weights (RAKEDW0) are used to calculate statewide estimates that represent the population, not the sample. The population totals were obtained from the California Department of Finance and the American Community Survey.
Since the weights are available as variables in the data file, they should be
applied to the sample data to produce accurate weighted population estimates. There are separate weight variables for the adult, child, and adolescent files, all with the same name, RAKEDW0. Weights do two things: 1) adjust point estimates (i.e., proportions, means, regression coefficients), which is accomplished by RAKEDW0, and 2) adjust variance estimates and standard errors, accomplished by either using replicate weights or using by using stratum information and Taylor Series estimation (see FAQ item 8 on SPSS for more information about this issue with PUF data and SPSS).
You should always use the weights provided with CHIS unless you have another method of accounting for sampling, nonresponse, and coverage. Not doing so is considered poor statistical practice, and will often result in inaccurate conclusions.
No. To do so could result in inaccurate standard errors. Because of the geographically stratified sample design, the different strata (counties or groups of counties) should be accounted for. This can be done in two different ways: using the Taylor Series Linearization (TSL) method or the Replicate Weight method. For the TSL method, the stratum variable must be indicated (this variable is called TSVARSTR in CHIS). If you only need the population estimate (percentage) or the population total, then you don't need this stratum identifier.
But if you need a standard error, p-value, confidence interval, or any type of test of association, you'll need the stratum identifier in order to use the TSL method. This stratum identifier is a sub-state geographic identifier, and is only available in the confidential data files. If you only have access to the Public Use Files, you'll need to use the Replicate Weights method instead.
Replicate weights can be used to account for the geographically stratified CHIS sample design. In addition to the full sample weight (RAKEDW0), there are 80 additional "replicate weight" variables (RAKEDW1-RAKEDW80). Replicate weights allow researchers to generate accurate standard errors, confidence intervals, and tests of significance for population estimates. They are used in the absence of sample design information (e.g., TSVARSTR) in the data set, such as the CHIS Public Use Files.
RAKEDW0 is the final survey weight. If you want to produce simple state-wide estimates that include population totals and/or percentages then you will need to use this weight.
However, if you need to generate accurate standard errors, confidence intervals, or tests of significance, you'll need to use the replicate weights that include RAKEDW1-RAKEDW80. The replication method for calculating standard errors must be employed because the CHIS Public Use Files do not include sub-state geographic identifiers or the sampling stratum variable(s) (e.g. counties/groups of counties). This PDF provides information on how to apply the weights using sample code in SUDAAN and Stata.
You can use either replication method or Taylor linearization method for standard error calculation. The decision is left up to you. If you've already published something with estimates and their standard errors or confidence intervals, then you'll probably want to continue to use replicate weights.
At this time SPSS does not have native (i.e., default) syntax to allow for analysis using the replicate weights. In other words, that function is not built into SPSS. You'll have to use SAS, Stata, or R, or another software that has this function like SUDAAN. Please note that the point estimates themselves will be the same when using either replicate weights or Taylor Series estimation in SAS, SPSS, Stata, SUDAAN, R, or other programs. It is the standard errors that will be biased (often underestimated) if replicate weights are used with SPSS.
Further, you should not use SPSS with PUF data because those data sets do not have stratum-level (i.e., county) information (to reduce disclosure risk), and that information is necessary for Taylor Series estimation (the only type of survey weight analysis SPSS can currently do). Thus, appropriate variance estimation cannot be conducted in SPSS with PUF data. User-written programs to account for replicate weights may have been developed for SPSS, but data users should seriously consider using a non-SPSS software, or thoroughly vetting/testing any SPSS replicate weight program they find or develop themselves.
Alternatively, support for SPSS analyses using Taylor Series methods is available through the DAC.
Keep in mind that statistical tests and confidence intervals make use of the standard error, therefore you'll need to use the replicate weights if either of the above are desired. Stratum information is contained within each.
This is a complicated and somewhat controversial issue, so we prefer not to comment specifically. Please contact us and we'll be happy to give you some general information about this topic.
The variable you're using in the Public Use Files may be a different variable than the one being used by AskCHIS. Please note that AskCHIS now uses the replication method for standard error calculation.
The replication method and the Taylor series method are both attempts at estimating the standard error. Standard errors from these two may differ, but the differences are small. The estimates will remain the same.
The replicate weights are computationally intensive and regression models can take an especially long time to run. The Taylor series method will run much faster. In addition, you will be able to use SAS when employing the Taylor series method.
No. The CHIS sample is drawn and weighted at the county level and is, therefore, only representative of the county level. If you apply the weights to respondents at geographic levels lower than strata, your estimates may be biased and the direction of the bias will be unknown.
You can potentially do multilevel modeling at any level the sample size permits, including Census tracts, ZIP codes, or counties.
Maybe. It depends on your objectives. This is another complicated topic so send us a note or give us a call and we'll talk about it. In the meantime, take a look at our methodology report on pooling. It will give you some background information.
Please contact the CHIS Data Access Center. Further information can be provided concerning the feasibility of your project.