Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
GPT4 reinforces historical gender bias in diagnosing cardiovascular symptoms in women
0
Zitationen
12
Autoren
2025
Jahr
Abstract
Abstract Background Large language models (LLMs) such as GPT-4 are increasingly used in clinical settings and medical education. They show potential in generating patient cases (used in exams/textbooks), formulating diagnostic reasoning, and managing treatment plans. There is concern, however, that LLM training data may perpetuate historical biases in interpreting women's cardiovascular symptoms, potentially leading to inaccurate or delayed diagnoses and inequitable outcomes for female patients. Purpose This study aimed to 1) evaluate GPT-4’s representation of male versus female patients when generating simulated cardiovascular cases for medical education and 2) assess GPT-4’s diagnostic performance across genders when real patient data are input into the model. Methods Using the Azure OpenAI API (Python), GPT-4 was prompted to generate 15,000 simulated cases spanning 15 cardiovascular conditions, namely diseases with known gender-based differences in prevalence. The resulting gender distributions were compared to U.S. prevalence data (from large CDC/STS datasets) using FDR-corrected χ² tests. Next, 10 real cardiovascular patient notes were extracted from the MIMIC-IV-Note database of 330,000+ de-identified notes. Patient gender was systematically replaced in otherwise identical notes, and GPT-4 then generated differential diagnoses of the 10 most probable diagnoses (n=2,000 total prompts). Diagnostic accuracy by gender was evaluated by comparing GPT-4’s outputs to actual discharge diagnoses using FDR-corrected Mann-Whitney tests. Results Across the 15 conditions, GPT-4’s modeled gender distributions significantly deviated from real-world data (p<0.0001). In 14 (93%) conditions, males were overrepresented by 30% (SD 8.6%), while females were underrepresented by 31% (SD 8.7%). For instance, 90% of GPT-4-generated heart failure cases were male, compared to ~50% real-world prevalence (p<1.0E-84). Altering only the reported gender in real patient notes produced significant differences in diagnostic accuracy for 20% of sampled cases (2/10). Female patients were diagnosed less accurately than males for aortic dissection (p=0.017; mean rank difference = 1.2 [SD 4.1]) and myocardial infarction (MI) (p=0.032; mean rank difference = 1.4 [SD 5.0]). For example, GPT-4 more often misdiagnosed female patients with anxiety instead of MI (anxiety p=0.014; mean rank difference = 0.9 [SD 3.5]). Conclusions GPT-4 substantially underrepresented women in its simulated cardiovascular cases and demonstrated lower diagnostic accuracy for female patients in certain critical conditions. These findings suggest that LLM-generated educational materials and diagnostic support risk perpetuating historical biases in cardiovascular care. Future work must focus on identifying, mitigating, and monitoring such biases before LLMs are widely deployed in medical education and clinical practice.Workflow:Prevalence & diagnosis analysis Gender bias in prevalence & diagnosis
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.