Published Date: November 29, 2023

Summary: Authors aim to assess the accuracy of GPT-4, the conversational artificial intelligence, in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. They compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). 

Independent reviewers evaluated the diagnoses as “correct” or “incorrect.” Physician diagnosis was defined as the consensus of the 3 physicians. Authors evaluated whether the performance of GPT-4 varies by patient race and ethnicity by adding the information on patient race and ethnicity to the clinical vignettes.

Findings: The percentage of correct diagnosis was 97.8% for GPT-4 and 91.1% for physicians. GPT-4 provided appropriate reasoning for 97.8% of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians. 

GPT-4’s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage.

Read the Publication