Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

GPT4 reinforces historical gender bias in diagnosing cardiovascular symptoms in women

2025·0 Zitationen·European Heart Journal

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Background Large language models (LLMs) such as GPT-4 are increasingly used in clinical settings and medical education. They show potential in generating patient cases (used in exams/textbooks), formulating diagnostic reasoning, and managing treatment plans. There is concern, however, that LLM training data may perpetuate historical biases in interpreting women's cardiovascular symptoms, potentially leading to inaccurate or delayed diagnoses and inequitable outcomes for female patients. Purpose This study aimed to 1) evaluate GPT-4’s representation of male versus female patients when generating simulated cardiovascular cases for medical education and 2) assess GPT-4’s diagnostic performance across genders when real patient data are input into the model. Methods Using the Azure OpenAI API (Python), GPT-4 was prompted to generate 15,000 simulated cases spanning 15 cardiovascular conditions, namely diseases with known gender-based differences in prevalence. The resulting gender distributions were compared to U.S. prevalence data (from large CDC/STS datasets) using FDR-corrected χ² tests. Next, 10 real cardiovascular patient notes were extracted from the MIMIC-IV-Note database of 330,000+ de-identified notes. Patient gender was systematically replaced in otherwise identical notes, and GPT-4 then generated differential diagnoses of the 10 most probable diagnoses (n=2,000 total prompts). Diagnostic accuracy by gender was evaluated by comparing GPT-4’s outputs to actual discharge diagnoses using FDR-corrected Mann-Whitney tests. Results Across the 15 conditions, GPT-4’s modeled gender distributions significantly deviated from real-world data (p&lt;0.0001). In 14 (93%) conditions, males were overrepresented by 30% (SD 8.6%), while females were underrepresented by 31% (SD 8.7%). For instance, 90% of GPT-4-generated heart failure cases were male, compared to ~50% real-world prevalence (p&lt;1.0E-84). Altering only the reported gender in real patient notes produced significant differences in diagnostic accuracy for 20% of sampled cases (2/10). Female patients were diagnosed less accurately than males for aortic dissection (p=0.017; mean rank difference = 1.2 [SD 4.1]) and myocardial infarction (MI) (p=0.032; mean rank difference = 1.4 [SD 5.0]). For example, GPT-4 more often misdiagnosed female patients with anxiety instead of MI (anxiety p=0.014; mean rank difference = 0.9 [SD 3.5]). Conclusions GPT-4 substantially underrepresented women in its simulated cardiovascular cases and demonstrated lower diagnostic accuracy for female patients in certain critical conditions. These findings suggest that LLM-generated educational materials and diagnostic support risk perpetuating historical biases in cardiovascular care. Future work must focus on identifying, mitigating, and monitoring such biases before LLMs are widely deployed in medical education and clinical practice.Workflow:Prevalence & diagnosis analysis Gender bias in prevalence & diagnosis

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareSex and Gender in Healthcare

Volltext beim Verlag öffnen

GPT4 reinforces historical gender bias in diagnosing cardiovascular symptoms in women

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen