Frequently Asked Questions (FAQs)
The following FAQs may be of use to researchers, academics and other data experts in search of quick answers to technical questions on how to use CHIS. For a more general explanation of CHIS, please visit the About CHIS section.
See FAQs and answers by clicking on the following topics:
For more nuanced questions, please consult our CHIS Forum.
General CHIS Data Information:
- If I am interested in a specific topic or survey question, where can I find that information?
- I know a question was asked in the survey, but I can't find it.
- I need more information about specific variables and their use. Where can I get more information?
- What is a source variable? How is this different than any other variable?
- What is a constructed variable? How is this different than any other variable?
- Why are there no missing values for most variables?
- What is imputation?
- I'm looking at the frequency for a variable and the total doesn't match the survey sample size. Why not?
- My university/institution says I need IRB approval. Does your IRB approval cover my project?
- How does a CHIS respondent provide informed consent?
- How do I cite CHIS in my publication?
1. If I am interested in a specific topic or survey question, where can I find that information? Answer:
The best way to find out about topics that are covered in the questionnaires is to check the Survey Topics List
for each survey year. You can also check the questionnaires; every variable that is available in the CHIS data files is based on the information collected via the Questionnaires
2. I know a question was asked in the survey, but I can't find it.Answer:
The survey is conducted on a continuous basis and data is released in a variety of formats. Not all questions were asked in all years. In addition, there are separate questionnaires for adults, adolescents and children. To see questionnaires for each survey cycle, visit our Questionnaires
page in the Survey Design & Methods
3. I need more information about specific variables and their use. Where can I get more information?Answer:
The first place to start is the Questionnaire
. In addition, the data dictionaries contain informative notes and are available for both Public Use Files (PUFs)
and confidential source or Sensitive Data
. For all years, the data dictionaries include the raw frequencies for nearly every variable. We also have a constructed variables document available on the PUF website which outlines how the constructed variables were created. Should you continue to have questions after that, let us know and we'll do our best to help you.
4. What is a source variable? How is this different than any other variable?Answer:
Source variables are variables based on a single question asked during the CHIS interview. These include confidential variables, as well as those found in the Public Use Files. Source variables are generally labeled with two letters followed by numbers.
5. What is a constructed variable? How is this different than any other variable?Answer:
Constructed variables are variables that were put together by the CHIS Data Production team or by Westat, Inc., the data collection contractor. Constructed variables are usually based on multiple questions asked during the interview. In general, constructed variables can be identified by their variable names, which are acronyms and/or abbreviations.
6. Why are there no missing values for most variables? Answer:
For nearly every question on the surveys, some respondents do not provide a valid response (for example, don't know, refused, or other responses). For these missing items, values have been imputed. See the imputation documentation available online
7. What is imputation?Answer:
Variables that are imputed have had the missing values replaced by a value that was generated based on a complicated algorithm. Detailed information about imputation methods can be found here
8. I'm looking at the frequency for a variable and the total doesn't match the survey sample size…why not?Answer:
Not every question is asked of every respondent. Many questions are limited to a particular group through skip patterns. These skip patterns create a universe of respondents eligible to answer a question. For example, the universe for a question on prostate screening in the past year is limited to 'all adult men age 40 years and older.'
9. My university says I need IRB approval…does your IRB approval cover my project?Answer:
No. You'll need to get IRB approval from your own institution. Most likely, you'll be able to get an "Exemption", but check with your local IRB.
10. How does a CHIS respondent get informed consent?Answer:
As part of the survey, a screener script is read that includes: 1) an introduction of the interviewer, UCLA, and the survey sponsors; 2) an explanation of purpose and importance of the survey; 3) statements describing the confidential and voluntary nature of the survey; and 4) statements explaining the respondent's right to skip any questions and to end the interview at any time.
The (potential) respondent is then asked if he/she wishes to participate in the survey. The respondent implies consent if they choose to continue with the survey.
11. How do I cite CHIS in my publication?Answer:
Click on the following link to learn the quick and easy CHIS Citation here
Public Use Files
- Why are some items in the survey not in the Public Use Files (PUF)?
- How can a specific variable be considered identifiable if it's at the state level?
- How do I link the Adult, Adolescent, and/or Child files?
- I have never analyzed data before. Can someone help me figure out what this all means?
1. Why are some items in the survey not in the Public Use Files (PUF)? Answer: Some variables are deemed to be either sensitive or identifiable or both. The data for these variables cannot be disclosed in the Public Use Files.
2. How can a specific variable be considered identifiable if it's at the state level?
Answer: There is a distinction between identifiable and sensitive. Sensitive variables contain information that would likely harm or embarrass the individual if disclosed. Both identifiable and sensitive variables should be excluded from public use files according to the National Center for Health Statistics.
3. How do I link the Adult, Adolescent, and/or Child files?
Answer: As an additional layer of protection for the identity of the respondents, the variable that links the Adult, Adolescent, and Child files has been removed from the PUF. This variable is available only through the Data Access Center. However, some information from the Adult file has been added to the Child and Adolescent records (e.g., race, poverty status).
4. I have never analyzed data before. Can someone help me figure out what this all means?
Answer: We suggest you first try AskCHIS, an easy to use online query system that enables you to analyze data quickly, easily and free of charge. There is also a multitude of supporting documentation on our website, including data dictionaries. We offer limited technical assistance on how variables were constructed and the general structure of the dataset. We will not, however, assist in the interpretation of the data. There are online resources available to help in this regard.
Local Health Departments
1.I work for a county health department and want to use CHIS data specific to my county. How can I obtain that data?
2.I want some of the confidential variables. How can I get access to this data?
1. I work for a county health department and want to use CHIS data specific to my county. How can I obtain that data?Answer:
Counties can receive a Local Health Department file. This file is essentially the PUF customized for your county or group of counties. Contact Dacchpr@ucla.edu
for information about how to receive this file.
2. I want some of the confidential variables. How can I get access to this data? Answer:
Counties can apply for a confidential data file through the Data Access Center. This data file contains identifiable variables but does not contain sensitive variables. Contact Dacchpr@ucla.edu
for information about how to receive files with confidential data for your county.
Working with Confidential Data
1. I am a researcher with an academic institution and need access to confidential data. How do I access these files?Answer:
- I am a researcher with an academic institution and need access to confidential data. How do I access these files?
- Who reviews the Data Access Center application, and what are they looking for?
- Why are there fees for accessing confidential data?
- After I pay, will the confidential data file be sent to me?
- What is disclosure review?
- Why can't I just request the confidential variables I need and ALL of the Public Use Files variables?
- Once my project has been approved and the customized dataset has been cut, will I be able to add variables to my dataset?
- Do I have to pay for DAC use if I am a student?
- What type of software do you support in the DAC?
- I've never analyzed data before. Can someone help me figure out what this all means?
You'll need to apply to use the confidential data through the Data Access Center (DAC). The application process consists of submitting a few forms and some supplementary documentation to the CHIS Data Disclosure Review Committee (DDRC). More information on the DAC, including application materials and variable lists, can be found after logging in to the DAC website located here.
2. Who reviews the Data Access Center application, and what are they looking for?Answer:
The application is reviewed by the Data Disclosure Review Committee (DDRC). This committee consists of representatives from the California Department of Public Health, and the UCLA Center for Health Policy Research. The committee reviews the application to make sure the project: a) is consistent with the purposes of CHIS; b) is feasible; and c) will not threaten the confidentiality of the respondent. Projects that satisfy these requirements are recommended for approval. The CHIS Principal Investigator, Dr. Ninez Ponce, makes the final determination of whether to approve, reject, or suggest revision for a project proposal.
3. Why are there fees for accessing confidential data?Answer:
The fees associated with accessing the data help cover the costs of data production, committee review, providing a customized data file, and initial technical assistance. See the Data Access Center Fact Sheet
for more information about fees.
4. After I pay, will the confidential data file be sent to me?Answer:
No, confidential data files do not leave the Data Access Center at UCLA. Data must be analyzed remotely by sending in programming code (SAS, SPSS, SAS-callable SUDAAN, or Stata) or by using programming/consulting services to assist in analysis. Output is returned to you after disclosure review.
5. What is disclosure review?Answer:
We review all output for small (raw frequency) cell sizes and make release decisions accordingly. For example, if you have generated a cross-tab of Pacific Islander women over the age of 50 by Census tract, we will not return that to you because it violates confidentiality and/or sensitivity requirements.
6. Why can't I just request the confidential variables I need and ALL of the
Public Use Files variables? Answer:
A dataset that contains any confidential variable is considered confidential and the more information contained in that dataset, the greater the risk to the respondent's confidentiality. In an effort to reduce that risk as much as possible, we require that customized datasets only contain those variables relevant to a particular project.
7. Once my project has been approved and the customized dataset has been cut, will I be able to add variables to my dataset?Answer:
Yes. We understand that you occasionally may neglect to select particular variables that are important to your project. Variables can be added by requesting the specific variables and providing a brief justification for adding the selected variables. For example, if your project is about the effect of household smoking on asthma and you did not select "age at asthma diagnosis" we will add that variable at your request.
8. Do I have to pay for DAC use if I am a student? Answer:
Yes, but there is a limited scholarship available to help cover part of the DAC fees typically paid by researchers. Contact the Data Access Center Manager at (310) 794-8362 or email@example.com
, for more information.
9. What type of software do you support in the DAC? Answer:
Primarily we support SAS, SPSS, SAS-callable SUDAAN, and Stata. Other software are supported to a more limited extent. Please contact us about your software needs before submitting a DAC application. Even though you may have paid for other software, it cannot be put on our network, which is limited to the software for which we are licensed to use.
10. I've never analyzed data before. Can someone help me figure out what this all means?Answer:
You can first try AskCHIS, an online query system that enables you to analyze data quickly, easily and free of charge. We offer limited technical assistance in regard to understanding how variables are constructed and the general structure of the dataset. For a fee, we offer consultation services through our Data Estimate Services. There is also a multitude of supporting documentation on our website.
1. All I really want is a weighted estimate at the county level. Do I still have to go through the Data Access Center?Answer:
- All I really want is a weighted estimate at the county level. Do I still have to go through the Data Access Center?
- Are all variables in the CHIS questionnaire in AskCHIS?
- I need some information that I can't get from AskCHIS and still only want the weighted percentages. Where can I find that information?
- My estimates from AskCHIS are unstable (i.e., they have a high coefficient of variation or wide confidence intervals). Can I use them?
- How should I report single-year PUF or AskCHIS estimates that are unstable?
- I ran two-year estimates for 2011-2012 data in the past (before August 2015) and they are different from the pooled 2011 and 2012 pooled estimates I subsequently produced. Why?
No. You can use AskCHIS
, an online query system that enables you to analyze data quickly, easily and free of charge. It contains weighted totals and percentages for nearly all variables in the CHIS datasets. You can get county or multi-county level cross-tabulated estimates.
2. Are all variables in the CHIS questionnaire in AskCHIS? Answer:
No. It is a complicated and time-consuming process to develop variables for Ask
CHIS. If you have a suggestion for an Ask
CHIS variable, please send an email suggestion to firstname.lastname@example.org
3. I need some information that I can't get from AskCHIS and still only want the weighted percentages. Where can I find that information?Answer:
You can use our Data Estimates Service. This service can, for a fee, provide you with the weighted percentages and population totals at sub-state geographic levels for almost any variable or set of variables. Unreliable estimates (as a result of variables with very small sample sizes) will be suppressed. Please email your interest to email@example.com
CHIS also provides subsetting for a limited set of variables.
4. My estimates from AskCHIS are unstable (i.e., they havea high coefficient of variation or wide confidence intervals). Can I use them?
Answer: This depends largely on your purpose and how and
where you will report or use the estimates. CHIS uses the coefficient
of variation (CV) and the confidence interval (CI), both
calculated from the standard error of the estimate, to express the
sampling variance (or “sampling error”) around an estimate. The calculation of
the standard error of the estimate takes into account the complex
sample survey design in CHIS.
- The CV indicates whether
or not a point estimate (e.g., a mean, proportion, total) is statistically
stable relative to its standard error. The CV is a ratio of the standard error
to the estimate, showing the proportion of the estimate that reflects sampling
variability: CV(θ^)=se (θ^) / θ^. In AskCHIS, estimates with a CV greater
than 30% are "flagged" as statistically unstable with a red asterisk
(*). While there are no absolute standards for
judging how much variabililty is too much, CHIS generally recommends
against reporting or relying on estimates that are this statistically unstable.
- Users will sometimes use
a CI to see if an estimate is "statistically greater" than zero (or
some other meaningful number) or if two estimates of subsamples from the same
data are different from each other. Such uses are similar to "statistical
significance testing" but not identical (Schenker &
Gentleman, 2012). AskCHIS and AskCHIS NE report 95% confidence
Reasonable exceptions to
the CV > 30% guideline may apply to binary/dichotomous variables when
the magnitude of estimates are very small or very large (i.e., below 10% or
above 90%). CHIS calculates the CV for binary variables by using the category
with the smallest value as the denominator, which means that the CV will
be the same regardless of which estimate you choose to report for the
binary indicator of interest. This is a conservative approach that minimizes the potential
for reporting unstable estimates.
The calculation can also
produce a high CV even when the sample size is relatively large. For example,
AskCHIS shows that 2.6% of 0-5 year olds in California are uninsured and
95.1% have a usual source of care. Both estimates are flagged as statistically
unstable even though the sample of this population is large, suggesting that
they are not reportable. This is a limitation of the CV, which increases as the
denominator approaches 0. Page 10 of Lee et al.
(2007) contains a detailed explanation of this issue.
Under such circumstances
(dichotomous variables with low or high values) it may be reasonable to report
the estimate if the CI range is acceptable to data users, e.g., reporting
estimates with confidence limits within +/- 5%. It may help to consult
the statistical standards of your agency, company, or professional association
or the standards used in the journals in which you hope to publish.
however, you are in doubt or do not have another standard to use we recommend
deferring to the CV flags that caution against reporting estimates with high
CVs, unless CI ranges are appropriate for your purpose.
5. How should I report single-year PUF or AskCHIS estimates that are unstable?
Answer: Beginning in 2011, CHIS data has been collected
continuously, enabling the release of single-year data files (Public Use Files
[PUFs] and AskCHIS) to provide users with timely data. However, CHIS is
designed to provide stable estimates at the county level over a full two-year
cycle of data collection. Thus single-year county-level estimates (or estimates
for small subgroups) may not be stable. Trying to further analyze a
sub-population within county with single-year data may lead to unstable
estimates. When this becomes a problem, we strongly recommend that
you combine two or more years of CHIS data (e.g., pooling 2013 and 2014) to
help stabilize these data. To combine years in AskCHIS, under the “Years” tab
choose the years you want to combine and then select the “Pool” option.
6. I ran two-year estimates for 2011-2012 data in the past (before August 2015) and they are different from the pooled 2011 and 2012 pooled estimates I subsequently produced. Why?
Answer: With the shift to continuous data collection in
2011, CHIS has developed one-year weights to provide estimates for each year of
data collection. In general, estimates obtained by combining multiple years
(such as 2011 and 2012) will be close, but not necessarily identical to,
estimates produced from a two-year data file (such as 2011-2012) obtained prior
to August 2015.
Part of this is due to
differences in the population control totals used to weight one-year data and
two-year data. For both data sets, control totals were created from California
Department of Finance’s (DOF) Population Estimates and Population projections
(for more information, refer to CHIS Methodology
Report 5: Weight and Variance Estimation). The CHIS 2011-2012
two-year weights used the population control totals that represent 2012; since
data from respondents were collected over this two-year period, the estimates
represent the experience of respondents in this timeframe. The 2011 and
2012 CHIS one-year weights, on the other hand, are representative of the
population of California in 2011 and 2012, respectively, and use
the population control totals unique to 2011 and 2012. Pooling 2011 and
2012 together will provide a population-weighted average for the 2011-2012 data
- How does the CHIS sample represent California?
- What are weights? Why are they used?
- How do I use the weights? Do I really need to use them?
- If I want to do a chi-square or regression analysis, can I use RAKEDW0?
- What are replicate weights? Why are replicate weights used?
- I am using the Public Use Files and do not understand the variables, RAKEDW0, RAKEDW1, etc. What do these mean and which ones do I use?
- I've been using the Public Use Files with the replicate weights but now I'm planning on using the Data Access Center. Should I continue to use the replicate weights?
- Why can't I use SPSS with the replicate weights?
- Do I only need these "replicate weights" if I want to calculate the standard errors? Do I need any cluster or stratum information?
- I am using CHIS data for a multiple regression or multilevel modeling at the sub-stratum level. Should I weight the data?
- Should I apply weights to multilevel analyses at the state level?
- Why are the standard errors/confidence intervals in my AskCHIS estimates different compared to those I find using the Public Use Files?
- What is the difference between the replication method and the Taylor series linearization method?
- Under what circumstances would I use the Taylor series method?
1. How does the CHIS sample represent California? Answer:
Because it is not feasible to survey every household in California, CHIS uses a probability sampling method in an attempt to accurately represent the population of adults, adolescents, and children living within households in California. This means that the CHIS sample excludes institutionalized group quarter residents (e.g. residents of correctional institutions such as jails or nursing homes) and non-institutionalized group quarter residents (e.g. residents of college dormitories, military quarters on base).
And because CHIS is not a simple random sample, in order to make CHIS representative of California, population weights must be applied to produce accurate population estimates and totals.
2. What are weights? Why are they used?Answer:
Weights are variables in the data file that can be applied to the sample data to produce "weighted" population estimates. The weight variables are constructed through a complex and iterative process. There are separate weight variables for the adult, child and adolescent Public Use Files, all with the same name, RAKEDW0. These final survey weights (RAKEDW0) are used to calculate statewide estimates that represent the population, not the sample.
The population totals were obtained from the California Department of Finance and the American Community Survey.
3. How do I use the weights? Do I really need to use them?Answer:
Since the weights are available as variables in the data file, they can be
applied to the sample data to produce accurate weighted population estimates. There are separate weight variables for the adult, child, and adolescent files, all with the same name, RAKEDW0.
For the most part, you'll want to use the weights. Not doing so could result in inaccurate conclusions. As an example, if we wanted to know the total population of adults in California who reported a cancer diagnosis, we would need to use the weight variables. As the table below indicates 4,272 adults who were interviewed in 2003 reported being diagnosed with any kind of cancer, representing 11.2% of the sample . However, when weights are applied to the sample data, we are able to estimate a statewide population total of 2,127,075 Californians who reported cancer diagnosis. This represents a weighted estimate of about 8.3%. We are able to conclude that although our data indicate that about 11% reported any kind of cancer in the unweighted 2003 sample, weighted estimates suggest that 8% of the entire adult population in California have been told that they have cancer.
4. If I want to do a chi-square or regression analysis, can I use RAKEDW0?Answer:
No. To do so could result in inaccurate standard errors. Because of the geographically stratified sample design, the different strata (counties or groups of counties) should be accounted for.
This can be done in two different ways: using the Taylor Series Linearization (TSL) method or the Replicate Weight method. For the TSL method, the stratum variable must be indicated (this variable is called TSVARSTR in CHIS). If you only need the population estimate (percentage) or the population total, then you don't need this stratum identifier.
But if you need a standard error, p-value, confidence interval, or any type of test of association, you'll need the stratum identifier in order to use the TSL method. This stratum identifier is a sub-state geographic identifier, and is only available in the confidential data files. If you only have access to the Public Use Files, you'll need to use the Replicate Weights method instead.
5. What are replicate weights? Why are replicate weights used?Answer:
Replicate weights can be used to account for the geographically stratified CHIS sample design. In addition to the full sample weight (RAKEDW0), there are 80 additional "replicate weight" variables (RAKEDW1-RAKEDW80). Replicate weights allow researchers to generate accurate standard errors, confidence intervals, and tests of significance for population estimates. They are used in the absence of sample design information (e.g., TSVARSTR) in the data set, such as the CHIS Public Use Files.
6. I am using the Public Use Files and do not understand the variables, RAKEDW0, RAKEDW1, etc. What do these mean and which ones do I use? Answer:
RAKEDW0 is the final survey weight. If you want to produce simple state-wide estimates that include population totals and/or percentages then you will need to use this weight.
However, if you need to generate accurate standard errors, confidence intervals, or tests of significance, you'll need to use the replicate weights that include RAKEDW1-RAKEDW80. The replication method for calculating standard errors must be employed because the CHIS Public Use Files do not include sub-state geographic identifiers or the sampling stratum variable(s) (e.g. counties/groups of counties). This PDF
provides information on how to apply the weights using sample code in SUDAAN and Stata.
7. I've been using the Public Use Files with the replicate weights but now I'm planning on using the Data Access Center. Should I continue to use the replicate weights? Answer:
You can use either replication method or Taylor linearization method for standard error calculation. The decision is left up to you. If you've already published something with estimates and their standard errors or confidence intervals, then you'll probably want to continue to use replicate weights.
8. Why can't I use SPSS with the replicate weights?Answer:
At this time SPSS does not have native syntax to allow for analysis using the replicate weights. You'll have to use SAS or Stata. Please note that the estimates themselves will be the same when using SAS, SPSS, Stata or SUDAAN. It is the standard errors that will be biased (often underestimated) if SPSS is used. Note that user-written programs to account for replicate weights have been developed for SPSS.
9. Do I only need these "replicate weights" if I want to calculate the standard errors? Do I need any cluster or stratum information?Answer:
Keep in mind that statistical tests and confidence intervals make use of the standard error, therefore you'll need to use the replicate weights if either of the above are desired. Stratum information is contained within each.
10. I am using CHIS data for a multiple regression or multilevel modeling at the sub-stratum level. Should I weight the data? Answer:
This is a complicated and somewhat controversial issue, so we prefer not to comment specifically. Please contact us and we'll be happy to give you some general information about this topic. (Carlo Carino, 310-794-8319, firstname.lastname@example.org
11. Should I apply weights to multilevel analyses at the state level?Answer:
Same issue applies. Please get in touch with us.
12. Why are the standard errors/confidence intervals in my AskCHIS estimates
different compared to those I find using the Public Use Files?Answer:
The variable you're using in the Public Use Files may be a different variable than the one being used by Ask
CHIS. Please note that Ask
CHIS now uses the replication method for standard error calculation.
13. What is the difference between the replication method and the Taylor series linearization method?Answer:
The replication method and the Taylor series method are both attempts at estimating the standard error. Standard errors from these two may differ, but the differences are small. The estimates will remain the same.
14. Under what circumstances would I use the Taylor series method?Answer:
The replicate weights are computationally intensive and regression models can take an especially long time to run. The Taylor series method will run much faster. In addition, you will be able to use SAS when employing the Taylor series method.
- Can I obtain prevalence rates at the sub-county level (e.g. ZIP codes, Census tracts)? Is this possible?
- I want to do some multilevel modeling. Does this work at the sub-stratum level?
- I want to combine or pool the data across years. Is this feasible?
- I have a great idea for a project using Census tracts, but have no idea if the sample size is big enough at that level. Your data dictionaries do not have the frequencies for these variables. What should I do?
- Where can I get some information on obtaining and analyzing geocoded data?
1. Can I obtain prevalence rates at the sub-county level (e.g. ZIP codes, Census tracts)? Is this possible?
Answer: No. The CHIS sample is drawn and weighted at the county level and is, therefore, only representative of the county level. If you apply the weights to respondents at geographic levels lower than strata, your estimates may be biased and the direction of the bias will be unknown.
2. I want to do some multilevel modeling. Does this work at the sub-stratum level?
Answer: You can potentially do multilevel modeling at any level the sample size permits, including Census tracts, ZIP codes, or counties.
3. I want to combine or pool the data across years. Is this feasible?
Answer: Maybe. It depends on your objectives. This is another complicated topic so send us a note or give us a call and we'll talk about it. In the meantime, take a look at our methodology report on pooling. It will give you some background information.
4. I have a great idea for a project using Census tracts, but have no idea if the
sample size is big enough at that level. Your data dictionaries do not have the frequencies for these variables. What should I do?
Answer: Please contact the Data Access Center Coordinator (310-794-8319 email@example.com). Further information can be provided concerning the feasibility of your project.
5. Where can I get some information on obtaining and analyzing geocoded data?
Answer: Gecoded data, the GIS software and services needed to analyze this data are available through the Data Access Center. Please contact the Data Access Center Coordinator (310-794-8319, firstname.lastname@example.org) for more information.??