OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 27.03.2026, 20:46

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Diagnosis and Triage Performance of Contemporary Large Language Models on Short Clinical Vignettes

2025·1 Zitationen·Journal of Medical SystemsOpen Access
Volltext beim Verlag öffnen

1

Zitationen

3

Autoren

2025

Jahr

Abstract

General-purpose large language models (LLMs) are increasingly proposed for diagnostic and triage decision support, yet their reliability relative to humans remains unclear. We evaluated eight contemporary LLMs (ChatGPT-4, ChatGPT-o1, DeepSeek-V3, DeepSeek-R1, Gemini-2.0, Copilot, Grok-2, Llama-3.1) on 48 single-turn clinical vignettes spanning four triage levels (Emergent, 1-day, 1-week, Self-care). Models were tested without prompts and with structured prompts comprising exemplar cases. Primary outcomes were diagnostic and triage accuracy. Secondary measures included confusion matrices, over-triage, safety of advice, and the Capability Comparison Score (CCS). Structured prompting improved performance across models: mean diagnostic accuracy increased from 89.84% to 91.67%, and mean triage accuracy increased from 76.82% to 86.20%. The best diagnostic accuracy was 93.75% (ChatGPT-o1 and DeepSeek-R1; Grok-2 matched this when prompted). Prompting shifted models toward safety: safety of advice rose from 89.06% to 94.53%, accompanied by higher over-triage (from 53.15% to 65.62%). CCS values were numerically lower than accuracy but preserved rankings and conclusions (diagnosis CCS: from 49.54 to 50.46; triage CCS: from 47.66 to 52.34). Error analyses showed predominant over-triage, with rarer but clinically important under-triage. On concise, text-only vignettes, the diagnostic accuracy of advanced LLMs was high, in some cases nearing benchmarks set by physicians in prior studies, whereas triage remained a more significant challenge. Structured prompting provided a practical, training-free lever to enhance robustness. Future work should evaluate uncertainty-aware prompting and real-world, multi-turn/multi-modality cases to strengthen clinical reliability.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingAutopsy Techniques and Outcomes
Volltext beim Verlag öffnen