GPT-4 Outperformed Simulated Human Readers in Diagnosing Complex Clinical Cases
OpenAI’s GPT-4 correctly diagnosed 52.7% of complex challenge cases, compared to 36% of medical journal readers, and outperformed 99.98% of simulated human readers, according to a study published by the New England Journal of Medicine.
The evaluation, conducted by researchers in Denmark, utilized GPT-4 to find diagnoses pertaining to 38 complex clinical case challenges with text information published online between January 2017 and January 2023. GPT-4’s responses were compared to 248,614 answers from online medical journal readers.
Each complex clinical case included a medical history alongside a poll with six options for the most likely diagnosis. The prompt used for GPT-4 asked the program to solve for diagnosis by answering a multiple choice question and analyzing full, unedited text from the clinical case report. Each case was presented to GPT-4 five times to evaluate reproducibility.
Alternatively, researchers collected votes for each case from medical journal readers, which simulated 10,000 sets of answers, resulting in a pseudopopulation of 10,000 human participants.
The most common diagnoses included 15 cases in the field of infectious disease (39.5%), five cases in endocrinology (13.1%), and four cases in rheumatology (10.5%). Patients in the clinical cases ranged from newborn to 89 years of age, and 37% were female.
OpenAI’s GPT-4 correctly diagnosed 52.7% of complex challenge cases, compared to 36% of medical journal readers, and outperformed 99.98% of simulated human readers, according to a study published by the New England Journal of Medicine.
View referenced article