Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of algorithmic bias in large language models for retinal clinical recommendations

2026·0 Zitationen·AJO InternationalOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

To assess whether contemporary large language models (LLMs) exhibit demographic or socioeconomic bias when making clinical recommendations for retinal disease and to characterize model-specific equity and reliability profiles. Cross-sectional computational evaluation performed in July 2025. Ten expert-designed clinical vignettes representing common retinal conditions (e.g., age-related macular degeneration, diabetic macular edema) were combined with 1,440 systematically varied patient demographic and socioeconomic profiles (varying race, gender identity, insurance status, housing/clinic proximity, occupation, and age), yielding 14,400 unique simulated patient prompts. Each unique prompt was submitted via stateless API calls to four LLMs (ChatGPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash, DeepSeek-V2), generating 57,600 recommendations. The primary outcome was concordance with a predefined expert reference decision. Mixed-effects logistic regression modeled associations between patient factors and concordance, adjusting for vignette as a random effect. Demographic instability scores were calculated to quantify recommendation changes driven solely by non-clinical factors. Overall concordance differed significantly across models (p < 0.001): 56.8% for Claude 4 Sonnet, 53.5% for ChatGPT-4o, 50.5% for DeepSeek-V2, and 45.9% for Gemini 2.5 Flash. In the pooled model, lack of health insurance (OR 0.80; 95% CI 0.76-0.85; p < 0.001), unstable housing far from clinic (OR 0.84; 95% CI 0.79-0.91; p < 0.001), and low-income occupation (OR 0.93; 95% CI 0.87-0.98; p = 0.014) were associated with lower odds of a concordant recommendation. Black patients were associated with higher odds of concordance compared with White patients (OR 1.20; 95% CI 1.11-1.30; p < 0.001). Major LLMs display moderate concordance with reference retinal management decisions but demonstrate potential for substantial, model-specific, and context-dependent demographic and socioeconomic biases. These findings suggest that equity-focused evaluation beyond traditional accuracy metrics may be warranted before clinical deployment.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationGenomics and Rare DiseasesRetinal Imaging and Analysis

Volltext beim Verlag öffnen

Evaluation of algorithmic bias in large language models for retinal clinical recommendations

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen