Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large language models in Chinese anesthesiology residency examinations: a comparative analysis of performance, reliability and clinical reasoning

2026·0 Zitationen·BMC Medical EducationOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Although large language models (LLMs) show potential in medical education, their effectiveness in Chinese anesthesiology Standardized Residency Training Program (SRTP) exams remains unexplored. This study aimed to assess the performance, consistency, and clinical reasoning capabilities of LLMs in this specific context. We conducted a multidimensional evaluation of three LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro) using a 210-question Chinese mock SRTP exam. The exam encompassed four types of questions with increasing complexity: A1 (knowledge-recall), A2 (knowledge-application), A3/A4 (multi-step clinical scenarios), and case-analysis (complex reasoning with partial-credit scoring). Each model was tested across 30 repeated iterations to evaluate both accuracy and consistency. Their performance was compared to anonymized results from 32 human SRTP trainees. All LLMs exceeded the passing threshold (400/650 points). Among them, Claude 3.5 Sonnet achieved the highest mean score (495.4 ± 9.5), followed by Gemini 1.5 Pro (493.8 ± 4.3) and GPT-4o (482.4 ± 10.9). Gemini 1.5 Pro outperformed the others on A1 questions (82.5% median accuracy, P < 0.05 vs. both models) and image-based questions (60% median accuracy, P < 0.001 vs. both models), whereas Claude 3.5 Sonnet excelled in A2 questions (78.3% accuracy, P < 0.01 vs. both models). Performance decreased with increasing question complexity, with median accuracies of 55% for A3/A4 questions and 28.6–33.3% for case-analysis questions. In terms of consistency, Gemini 1.5 Pro answered 64.3% of its questions correctly in all 30 attempts, compared to 51.2% for Claude 3.5 Sonnet (P > 0.05 vs. Gemini 1.5 Pro) and 42.7% for GPT-4o (P = 0.04 vs. Gemini 1.5 Pro). Notably, all LLMs outperformed human trainees, who had an average score of 426.8 ± 25.9 (all P values < 0.001), although the gap narrowed for the most complex questions. State-of-the-art LLMs demonstrate high proficiency on Chinese anesthesiology SRTP exams, surpassing human performance in structured assessments. Their strengths in knowledge recall and potential for scalable feedback make them promising adjuncts for mitigating training disparities. However, the diminished performance in complex clinical reasoning tasks suggests that these models should complement rather than replace human-centric education.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsTopic Modeling

Volltext beim Verlag öffnen

Large language models in Chinese anesthesiology residency examinations: a comparative analysis of performance, reliability and clinical reasoning

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen