Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Diagnostic performance of newly developed large language models in critical illness cases: A comparative study
0
Zitationen
3
Autoren
2025
Jahr
Abstract
• To the best of our knowledge, this is the first study directly comparing the diagnostic performance of multiple LLMs specifically for critically ill patients in ICU settings. • Our work diverges by focusing on complex ICU cases—where diagnostic uncertainty peaks—and examining how AI could assist frontline providers in managing undifferentiated critical illness prior to specialist consultation. • These newly developed models, especially resoning models (ChatGPT-o3, DeepSeek-R1) show strong potential in supporting complex diagnostic tasks in critical illness and may serve as valuable clinical decision support tools. Notably, the open-source reasoning model DeepSeek-R1 performed competitively; its free availability and ability to be deployed locally make it particularly well-suited for implementation in resource-limited ICU settings. • Further prospective studies conducted in real clinical settings are needed to evaluate the practical utility and cost-effectiveness of different LLMs for ICU diagnostic support, particularly when used by physicians with varying prompt-engineering expertise. These studies should incorporate more clinically relevant evaluation approaches, which may require further development to fully capture LLMs’ diagnostic capabilities, and should employ larger sample sizes with rigorous statistical analysis to properly assess their clinical value. Large language models (LLMs) are increasingly used in clinical decision support, and newly developed models have demonstrated promising potential, yet their diagnostic performance for critically ill patients in Intensive care unit (ICU) settings remains underexplored. This study evaluated the diagnostic accuracy, differential diagnosis quality, and response quality in critical illness cases of four newly developed LLMs. In this cross-sectional comparative study, four newly developed LLMs—ChatGPT-4o, ChatGPT-o3, DeepSeek-V3, and DeepSeek-R1—were evaluated using 50 critical illness cases in ICU settings from published literature. Diagnostic accuracy and response quality were compared across models. A total of 50 critical illness cases were included. ChatGPT-o3 achieved the top diagnosis accuracy at 72% (36/50; 95% CI 0.600–0.840), followed by DeepSeek-R1 at 68% (34/50; 95% CI 0.540–0.800), ChatGPT-4o at 64% (32/50; 95% CI 0.500–0.760), and DeepSeek-V3 at 32% (16/50; 95% CI 0.200–0.460). ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o all significantly outperformed DeepSeek-V3, with no significant differences among the three. The median differential quality score was 5.0 for ChatGPT-o3 (IQR 5.0–5.0; 95% CI 5.0–5.0), DeepSeek-R1 (IQR 5.0–5.0; 95% CI 5.0–5.0), and ChatGPT-4o (IQR 4.0–5.0; 95% CI 4.5–5.0), and 4.0 for DeepSeek-V3 (IQR 3.0–5.0; 95% CI 4.0–5.0). ChatGPT-o3 and DeepSeek-R1 scored significantly higher than DeepSeek-V3; ChatGPT-4o showed a non-significant trend toward better performance. All models received high Likert ratings for response completeness, clarity, and usefulness. ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o each showed a trend toward better response quality compared to DeepSeek-V3, although no significant differences were observed among the models. The newly developed models, especially the reasoning models, demonstrated strong potential in supporting diagnosis in critical illness cases in ICU settings. With further domain-specific fine-tuning, their diagnostic accuracy could be further enhanced. Notably, the open-source reasoning model DeepSeek-R1 performed competitively, suggesting strong potential for scalable deployment in resource-limited settings.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.707 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.613 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.159 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.875 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.