Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance evaluation of large language models in the diagnosis of emergency internal medicine diseases: a retrospective study
0
Zitationen
12
Autoren
2026
Jahr
Abstract
Objective Medical domain large language models (LLMs) exhibit verified clinical decision-support capabilities in simulated case analyses and standardized tests, yet their diagnostic efficacy in real-world emergency settings remain insufficiently explored. This study evaluates the diagnostic performance of 5 mainstream LLMs (ChatGPT-4o, Gemini-2.0, Grok3, DeepSeek-V3, Doubao) against emergency department junior physicians (EDJP) on real-world emergency internal medicine cases. Methods A single-center retrospective analysis design was conducted. 154 anonymized emergency internal medicine patients of the Second Affiliated Hospital of Zhejiang University School of Medicine from January to May 2025 were included, covering common acute diseases of multiple systems. 15 EDJPs and 5 LLMs were selected to diagnose the cases, respectively. The main diagnostic accuracy, comprehensiveness of differential diagnosis, and response time were used as evaluation indicators. Non-parametric tests were used for statistical analysis. Results (1) Main diagnostic accuracy: DeepSeek-V3 (90.0%), ChatGPT-4o (86.0%), and Grok3 (86.0%) were significantly higher than that of EDJP (77.5%, p < 0.05); in the subgroup of respiratory system diseases, Gemini-2.0 and DeepSeek-V3 performed better ( p < 0.05). (2) Comprehensiveness of differential diagnosis: The scores of all LLMs were significantly higher than that of EDJP ( p < 0.05), and the medians of DeepSeek-V3, Gemini-2.0, and Grok3 reached 5.0 points. (3) Response time: LLMs (6.3–14.0 s) were significantly faster than EDJP (360.2 s, p < 0.05), and Doubao had the fastest response. The inter-rater reliability was good (ICC: 0.617–0.899). Conclusion This retrospective study shows that LLMs outperformed EDJPs in diagnostic accuracy, differential diagnosis comprehensiveness and response efficiency for emergency internal medicine diseases, demonstrating significant potential for clinical decision support. Subsequent efforts will focus on exploring how to effectively integrate into physician-led collaborative workflows to enhance emergency care quality and efficiency.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.697 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.602 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.127 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.872 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.