In a series of experiments funded by Priscilla Chan and Mark Zuckerberg, researchers discovered that GPT-4 inadequately represented the demographic diversity of medical conditions, generating clinical vignettes that perpetuated stereotypes based on demographics – raising concerns about the use of GPT-4 for clinical decision support.
The researchers assessed whether GPT-4 encoded racial and gender biases by using the Azure OpenAI application interface. They examined the impact of these biases on four potential clinical applications of large multimodal language models: medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment.
Various prompts, designed to simulate typical use of GPT-4 in clinical and medical education applications, were tested. Additionally, clinical vignettes were drawn from NEJM Healer and published research on implicit bias in healthcare.
The differential diagnoses formulated by GPT-4 for standardised clinical vignettes were more prone to including diagnoses that perpetuated racial, ethnic, and gender stereotypes.
While GPT-4 holds the potential to be transformative in enhancing healthcare delivery, the study published in The Lancet Digital Health highlighted, its inclination to encode societal biases raises significant concerns regarding its suitability for use in clinical decision support.
In particular, providing biased information to clinicians might perpetuate or amplify disparities through automation biases, while evidence was also found that GPT-4 perpetuates stereotypes about demographic groups when providing diagnostic and treatment recommendations.
The research findings underscore the necessity for targeted bias evaluations, effective mitigation strategies, and a strong emphasis on transparency in model training and data sourcing before the integration of LMM tools such as GPT-4 into clinical care.