Artificial intelligence models are increasingly integrated into medical diagnostic processes, particularly in analysing imaging data like X-rays. Research has indicated that these AI systems do not consistently perform well across different demographic groups, often underperforming in diagnostic accuracy for women and individuals from diverse ethnic backgrounds.
In an intriguing development, a 2022 study by MIT researchers demonstrated that AI models could reliably predict a patient’s race from their chest X-rays, a task that even experienced radiologists cannot achieve. This capability, however, comes with significant implications. The same team has discovered that the accuracy of these models in predicting demographic details correlates with substantial fairness gaps in medical diagnostics. Essentially, models better at identifying demographic characteristics tend to have more significant disparities in diagnosing diseases across different racial and gender groups. This suggests that the AI might be taking demographic shortcuts in its evaluations, leading to potentially incorrect diagnoses for certain groups, such as women and Black individuals.
Marzyeh Ghassemi, an associate professor at MIT, underscored the connection between AI’s ability to predict demographics and its uneven performance across groups, a link that had not been previously established. The study underscores the urgent need to address these biases, as they could potentially lead to harmful consequences for patient care.
The researchers have explored methods to enhance the fairness of these models. They found that retraining the AI with an emphasis on reducing biases showed promising results, but only when the models were applied to patients similar to those they were trained on. When used on patients from different hospitals, the fairness gaps reemerged, suggesting that the debiasing efforts were only sometimes effective.
Haoran Zhang, an MIT graduate student and lead author of the study, advises that hospitals should rigorously test external AI models with their demographic data to ensure any fairness claims are valid in their specific context. This is crucial because models often perform best on the data they were trained on and may need to generalise better across different settings.
The FDA has approved many AI-enabled medical devices for use in radiology, highlighting the growing reliance on AI in medical diagnostics. However, the discovery that these models can inadvertently learn and utilise demographic information to make predictions—even when not explicitly trained—raises concerns about their application and the ethical implications of their use.
The study employed AI models on publicly available chest X-ray datasets to predict several medical conditions and examine their performance. The findings revealed not only variability in accuracy based on gender and race but also a correlation between the models’ demographic prediction accuracy and their fairness gaps. This indicates that the AI may be using demographic features as proxies in its diagnostic processes, which could undermine the fairness and efficacy of medical diagnostics.
To combat these issues, the researchers employed training models to improve subgroup robustness and group adversarial methods to strip demographic information from the training process. Both approaches succeeded, but their effectiveness could have been enhanced when the data closely resembled the training set.
The persistence of fairness gaps in other datasets underscores a significant challenge: models debiased in one context may not maintain their fairness in another. This variability highlights the complexity of AI in medicine and the crucial need for continuous vigilance and adaptation to ensure these technologies serve all patients equitably.
Ghassemi’s team plans to continue exploring new methods to refine AI’s ability to make fair and accurate predictions across diverse patient populations. The research underscores the critical necessity for hospitals to thoroughly evaluate AI models with their specific demographic data before implementation. This responsible deployment and development of AI technologies in healthcare is a vital step in ensuring unbiased medical outcomes.
More information: Yuzhe Yang et al, The limits of fair medical imaging AI in real-world generalization, Nature Medicine. DOI: 10.1038/s41591-024-03113-4
Journal information: Nature Medicine Provided by Massachusetts Institute of Technology
