Summary

Published Date: April 22, 2025

Appropriate training data are a prerequisite for health AI tools. Policymakers, clinicians and patients can assess the datasets used to train AI models as a practical step in determining whom health AI tools are likely to benefit. Analyses of training datasets can help prioritize which health AI tools to validate and help identify where changes are needed to improve the equity of health AI.

This study argues even more emphasis on training datasets is warranted because (1) appropriate training data are a prerequisite for AI; (2) health AI tools cannot be expected to work well for subgroups and individuals who are grossly under-represented in training data; (3) transparency about the data used to train health AI models can help policy makers, clinicians and individual patients understand whom AI tools are likely to work for; and (4) analyzing the appropriateness of training data is a practical step that can help prioritize future validation work and improve the equity of health AI tools in the long term.

Just as authors assess the appropriateness of data when interpreting studies and analytics performed with traditional statistical methods, they can learn a lot about whom health AI tools are likely to benefit by analyzing the datasets used to train health AI models. Analyses of training data are not sufficient to ensure responsible AI. However, presenting information about training data in simple and understandable ways can help policy makers, clinicians and patients understand what health AI tools can do for whom and inform planning for investments in validation and improvements in health AI equity.

This study references 2021 California Health Interview Survey (CHIS) data.