A Comparison of Methods for Coding Race in Linear and Logistic Regression Models

Summary

Published Date: December 01, 2025

In many public health and clinical research studies that use regression models for analyses, race is often considered a confounder and "controlled" for in the regression model with simple indicators for race and non-Hispanic white as the reference group, without much introspection from the data analyst. From a health equity perspective, multiple issues exist with this approach. Authors examine and compare several methods for coding race in linear and logistic regression models. They compare several coding methods using a sample of 8097 participants (≥18 years old) from the 2020 New York City Community Health Survey. To illustrate the importance of coding methods for race, authors conducted regression analyses to compare the results from six coding approaches: dummy, simple effect, difference (forward and backward), deviation, and analyst-defined coding. Body mass index measured continuously and diabetes status measured dichotomously were the outcome variables in the linear and logistic regression models.

Findings: Results showed that selecting a coding method has implications for identifying racial health inequities. The reference group selection is critical to measuring racial inequities in health outcomes. This study emphasizes the need to consider the impact of coding techniques on research study design, particularly when racial health inequities are the research focus.

This publication references a publication that uses California Health Interview Survey (CHIS) data, Income Disparities in Obesity Trends among U.S. Adults: An Analysis of the 2011–2014 California Health Interview Survey.