Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Reasoning‐optimised large language models reach near‐expert accuracy on board‐style orthopaedic exams: A multi‐model comparison on 702 multiple‐choice questions

2025·1 Zitationen·Knee Surgery Sports Traumatology ArthroscopyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Purpose The purpose of this study was to compare the accuracy, calibration, reproducibility and operating cost of seven large language models (LLMs)—including four newer models capable of using advanced reasoning techniques to analyse complex medical information and generate accurate responses—on text‐only orthopaedic multiple‐choice questions (MCQs) and to quantify gains over GPT‐4. Methods From Orthobullets, 702 unique, non‐image MCQs (drawn from AAOS Self‐Assessment Examinations, Self‐Assessment‐Based Questions and Orthopaedic In Training Examination‐Based Questions banks) were extracted. Each question was submitted to the following LLMs: OpenAI o3, Anthropic Claude Sonnet 4, Claude Opus 4 (with/without ‘Extended Thinking’) and Google Gemini 2.5 Pro. Additionally, OpenAI's GPT‐4, GPT‐4o and the open‐weight Gemma 3 27B served as comparators. The primary outcome was overall accuracy. The secondary outcomes were topic and difficulty‐stratified accuracy, calibration (expected calibration error [ECE] and Brier score), reproducibility (flip rate on a retest question subset), latency, token use and cost. Statistical tests included paired McNemar, Cochran Q , ordinal logistic regression and Fleiss κ (Bonferroni‐adjusted α = 0.05). Results GPT‐4 achieved 69.7% accuracy (95% CI = 66.2–72.9). All four reasoning‐optimised models scored ≥14 percentage points higher ( p < 3.3 × 10 −15 ); OpenAI o3 led with 93.6% (95% CI = 91.5–95.2), which represents a 34% relative error reduction. Accuracy tended to decline with question difficulty, yet the reasoning advantage persisted in every difficulty stratum. Claude Opus 4 showed the best calibration (ECE = 0.023), while GPT‐4 exhibited overconfidence (ECE = 0.215). All models except Gemma 3 27B exhibited non‐zero flip rates. Median query time: 0.9 s (Gemma) to 15.9 s (Gemini 2.5 Pro). Cost: 0 to 29.9 USD per 1000 queries. Conclusions Reasoning‐optimised LLMs now answer text‐based orthopaedic exam questions with high accuracy and substantially better confidence calibration than earlier models. However, persistent stochasticity and large latency‐cost disparities may limit clinical deployment. Level of Evidence N/A.

Autoren

Institutionen

Themen

Clinical Reasoning and Diagnostic SkillsArtificial Intelligence in Healthcare and EducationReliability and Agreement in Measurement

Volltext beim Verlag öffnen

Reasoning‐optimised large language models reach near‐expert accuracy on board‐style orthopaedic exams: A multi‐model comparison on 702 multiple‐choice questions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen