Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of algorithmic bias in large language models for retinal clinical recommendations
0
Zitationen
6
Autoren
2026
Jahr
Abstract
To assess whether contemporary large language models (LLMs) exhibit demographic or socioeconomic bias when making clinical recommendations for retinal disease and to characterize model-specific equity and reliability profiles. Cross-sectional computational evaluation performed in July 2025. Ten expert-designed clinical vignettes representing common retinal conditions (e.g., age-related macular degeneration, diabetic macular edema) were combined with 1,440 systematically varied patient demographic and socioeconomic profiles (varying race, gender identity, insurance status, housing/clinic proximity, occupation, and age), yielding 14,400 unique simulated patient prompts. Each unique prompt was submitted via stateless API calls to four LLMs (ChatGPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash, DeepSeek-V2), generating 57,600 recommendations. The primary outcome was concordance with a predefined expert reference decision. Mixed-effects logistic regression modeled associations between patient factors and concordance, adjusting for vignette as a random effect. Demographic instability scores were calculated to quantify recommendation changes driven solely by non-clinical factors. Overall concordance differed significantly across models (p < 0.001): 56.8% for Claude 4 Sonnet, 53.5% for ChatGPT-4o, 50.5% for DeepSeek-V2, and 45.9% for Gemini 2.5 Flash. In the pooled model, lack of health insurance (OR 0.80; 95% CI 0.76-0.85; p < 0.001), unstable housing far from clinic (OR 0.84; 95% CI 0.79-0.91; p < 0.001), and low-income occupation (OR 0.93; 95% CI 0.87-0.98; p = 0.014) were associated with lower odds of a concordant recommendation. Black patients were associated with higher odds of concordance compared with White patients (OR 1.20; 95% CI 1.11-1.30; p < 0.001). Major LLMs display moderate concordance with reference retinal management decisions but demonstrate potential for substantial, model-specific, and context-dependent demographic and socioeconomic biases. These findings suggest that equity-focused evaluation beyond traditional accuracy metrics may be warranted before clinical deployment.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.551 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.443 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.942 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.