Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of large language model in cross-specialty medical scenarios
0
Zitationen
9
Autoren
2025
Jahr
Abstract
Large language models (LLMs) demonstrate transformative potential in healthcare, yet their diagnostic and therapeutic accuracy across medical specialties remains inadequately characterized. This study aimed to compare diagnostic and therapeutic capabilities of GPT-4o, GPT-3.5-Turbo, Claude-3-Sonnet across 12 medical specialties using standardized clinical vignettes. 50 PubMed-derived clinical cases between 2007 and 2024 were assessed. Two board-certified physicians independently evaluated LLMs outputs, with a senior clinician adjudicating discrepancies. All LLMs received identical text-based case descriptions with or without images, generating free-text diagnostic and therapeutic recommendations for blinded, randomized evaluation. Among the three evaluated LLMs, GPT-4o demonstrated superior diagnostic accuracy (median 10; IQR, 7.5–10), outperforming Claude-3-Sonnet (median 8; IQR, 2.8–10; P = .02) and GPT-3.5-Turbo (median 4; IQR, 1–9.3; P < .0001). A narrow IQR and minimal variation (SD = 2.9; range = 5.0) reflected high consistency in diagnostic outputs across diverse medical fields. For therapeutic recommendations, GPT-4o (median 10, IQR 0–10) outperformed GPT-3.5-Turbo (median 0, IQR 0–6.3; P = .0005) but showed no significant advantage over Claude-3-Sonnet (median 5, IQR 0–10; P = .45). This study demonstrates that advanced LLMs, particularly GPT-4o, have significant potential to support clinical diagnostics, showing high accuracy and consistency across specialties. However, their inconsistent performance in generating therapeutic recommendations presents a major barrier to clinical adoption.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.393 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.259 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.688 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.502 Zit.