National Institutes of Health (NIH) researchers have discovered that an artificial intelligence (AI) model can accurately solve medical quiz questions designed to evaluate health professionals’ diagnostic skills using clinical images and concise text summaries. Despite this, physician evaluators noted errors in the AI’s image descriptions and explanations of its decision-making processes. These insights were detailed in a study published in npj Digital Medicine, led by the NIH’s National Library of Medicine (NLM) and Weill Cornell Medicine in New York City.
NLM’s Acting Director, Stephen Sherry, PhD, commented on the findings, stating, “The integration of AI into healthcare shows immense potential to aid medical professionals in diagnosing patients more swiftly, thus hastening the onset of treatment.” However, he also cautioned that “AI has not yet evolved sufficiently to supplant the nuanced expertise of human clinicians, which remains vital for precise diagnostics.”
The study involved the AI model and medical doctors responding to diagnostic queries from the New England Journal of Medicine (NEJM)’s Image Challenge. This online quiz presents actual clinical images and brief descriptions detailing patient symptoms and asks for the correct diagnosis from a set of options. The AI model was tasked with answering 207 such questions and providing detailed justifications for each answer, including image descriptions, relevant medical knowledge, and the reasoning process behind its choices.
Nine physicians from various specialities participated in the study to assess the AI’s performance. They answered questions first without any external resources (closed-book) and then with access to external information (open-book). After completing their answers, the physicians were presented with the correct diagnoses and the AI’s responses and rationales, which they were then asked to evaluate.
The findings indicated that the AI model and the physicians performed well in identifying the correct diagnoses. Intriguingly, the AI often outperformed the physicians in the closed-book setting, while the doctors excelled with access to external resources, particularly with the more challenging questions. However, the physicians consistently found faults in the AI’s image descriptions and explanatory reasoning, even in cases where the AI accurately identified the diagnosis. For example, the AI failed to recognize that two differently angled lesions on a patient’s arm were manifestations of the same condition, mistaking them for distinct issues due to their differing appearances.
These results underscore the need for further evaluation of multi-modal AI technologies before clinical deployment. Zhiyong Lu, PhD, NLM Senior Investigator and the study’s corresponding author, emphasized, “This technology could significantly enhance clinicians’ abilities by providing data-driven insights for better clinical decisions. Nevertheless, understanding its risks and limitations is crucial for effectively leveraging its potential in medical practice.”
The study utilized an AI model known as GPT-4V (Generative Pre-trained Transformer 4 with Vision), a multimodal AI capable of processing diverse data types, including text and images. Although the research scope was limited, it highlighted the potential benefits and challenges of using multimodal AI in enhancing medical decision-making. The researchers advocate for more extensive studies to better understand how these AI models compare with human physicians in diagnostic accuracy.
More information: Qiao Jin et al, Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine, npj Digital Medicine. DOI: 10.1038/s41746-024-01185-7
Journal information: npj Digital Medicine Provided by NIH / National Library of Medicine
