Using Machine Learning to Uncover Hidden Heterogeneities in Survey Data

Summary

Published Date: November 05, 2019

​Survey responses in public health surveys are heterogeneous. The quality of a respondent's answers depends on many factors, including cognitive abilities, interview context, and whether the interview is in person or self-administered. A largely unexplored issue is how the language used for public health survey interviews is associated with the survey response. Authors introduce a machine learning approach, Fuzzy Forests, which they use for model selection. They use the 2013 California Health Interview Survey (CHIS) as the training sample and the 2014 CHIS as the test sample.

Authors find that non-English language survey responses differ substantially from English responses in reported health outcomes.

Heterogeneity among the Asian languages suggest that caution should be used when interpreting results that compare across these languages. The 2013 Fuzzy Forests model also correctly predicted 86% of good health outcomes using 2014 data as the test set.