Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Large language models in Chinese anesthesiology residency examinations: a comparative analysis of performance, reliability and clinical reasoning
0
Zitationen
10
Autoren
2026
Jahr
Abstract
Although large language models (LLMs) show potential in medical education, their effectiveness in Chinese anesthesiology Standardized Residency Training Program (SRTP) exams remains unexplored. This study aimed to assess the performance, consistency, and clinical reasoning capabilities of LLMs in this specific context. We conducted a multidimensional evaluation of three LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro) using a 210-question Chinese mock SRTP exam. The exam encompassed four types of questions with increasing complexity: A1 (knowledge-recall), A2 (knowledge-application), A3/A4 (multi-step clinical scenarios), and case-analysis (complex reasoning with partial-credit scoring). Each model was tested across 30 repeated iterations to evaluate both accuracy and consistency. Their performance was compared to anonymized results from 32 human SRTP trainees. All LLMs exceeded the passing threshold (400/650 points). Among them, Claude 3.5 Sonnet achieved the highest mean score (495.4 ± 9.5), followed by Gemini 1.5 Pro (493.8 ± 4.3) and GPT-4o (482.4 ± 10.9). Gemini 1.5 Pro outperformed the others on A1 questions (82.5% median accuracy, P < 0.05 vs. both models) and image-based questions (60% median accuracy, P < 0.001 vs. both models), whereas Claude 3.5 Sonnet excelled in A2 questions (78.3% accuracy, P < 0.01 vs. both models). Performance decreased with increasing question complexity, with median accuracies of 55% for A3/A4 questions and 28.6–33.3% for case-analysis questions. In terms of consistency, Gemini 1.5 Pro answered 64.3% of its questions correctly in all 30 attempts, compared to 51.2% for Claude 3.5 Sonnet (P > 0.05 vs. Gemini 1.5 Pro) and 42.7% for GPT-4o (P = 0.04 vs. Gemini 1.5 Pro). Notably, all LLMs outperformed human trainees, who had an average score of 426.8 ± 25.9 (all P values < 0.001), although the gap narrowed for the most complex questions. State-of-the-art LLMs demonstrate high proficiency on Chinese anesthesiology SRTP exams, surpassing human performance in structured assessments. Their strengths in knowledge recall and potential for scalable feedback make them promising adjuncts for mitigating training disparities. However, the diminished performance in complex clinical reasoning tasks suggests that these models should complement rather than replace human-centric education.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.