Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Reasoning‐optimised large language models reach near‐expert accuracy on board‐style orthopaedic exams: A multi‐model comparison on 702 multiple‐choice questions
1
Zitationen
6
Autoren
2025
Jahr
Abstract
Abstract Purpose The purpose of this study was to compare the accuracy, calibration, reproducibility and operating cost of seven large language models (LLMs)—including four newer models capable of using advanced reasoning techniques to analyse complex medical information and generate accurate responses—on text‐only orthopaedic multiple‐choice questions (MCQs) and to quantify gains over GPT‐4. Methods From Orthobullets, 702 unique, non‐image MCQs (drawn from AAOS Self‐Assessment Examinations, Self‐Assessment‐Based Questions and Orthopaedic In Training Examination‐Based Questions banks) were extracted. Each question was submitted to the following LLMs: OpenAI o3, Anthropic Claude Sonnet 4, Claude Opus 4 (with/without ‘Extended Thinking’) and Google Gemini 2.5 Pro. Additionally, OpenAI's GPT‐4, GPT‐4o and the open‐weight Gemma 3 27B served as comparators. The primary outcome was overall accuracy. The secondary outcomes were topic and difficulty‐stratified accuracy, calibration (expected calibration error [ECE] and Brier score), reproducibility (flip rate on a retest question subset), latency, token use and cost. Statistical tests included paired McNemar, Cochran Q , ordinal logistic regression and Fleiss κ (Bonferroni‐adjusted α = 0.05). Results GPT‐4 achieved 69.7% accuracy (95% CI = 66.2–72.9). All four reasoning‐optimised models scored ≥14 percentage points higher ( p < 3.3 × 10 −15 ); OpenAI o3 led with 93.6% (95% CI = 91.5–95.2), which represents a 34% relative error reduction. Accuracy tended to decline with question difficulty, yet the reasoning advantage persisted in every difficulty stratum. Claude Opus 4 showed the best calibration (ECE = 0.023), while GPT‐4 exhibited overconfidence (ECE = 0.215). All models except Gemma 3 27B exhibited non‐zero flip rates. Median query time: 0.9 s (Gemma) to 15.9 s (Gemini 2.5 Pro). Cost: 0 to 29.9 USD per 1000 queries. Conclusions Reasoning‐optimised LLMs now answer text‐based orthopaedic exam questions with high accuracy and substantially better confidence calibration than earlier models. However, persistent stochasticity and large latency‐cost disparities may limit clinical deployment. Level of Evidence N/A.
Ähnliche Arbeiten
The Strengths and Difficulties Questionnaire: A Research Note
1997 · 14.709 Zit.
Making sense of Cronbach's alpha
2011 · 14.102 Zit.
QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies
2011 · 13.826 Zit.
A method for estimating the probability of adverse drug reactions
1981 · 11.549 Zit.
Clarifying Confusion: The Confusion Assessment Method
1990 · 5.253 Zit.
Autoren
Institutionen
- University of Lisbon(PT)
- Erasmus Hospital(BE)
- Centre Hospitalier Universitaire Brugmann(BE)
- Institute for Biotechnology and Bioengineering(PT)
- University of Miyazaki(JP)
- Universitätsklinik Balgrist(CH)
- Centro Hospitalar Póvoa de Varzim Vila do Conde EPE
- University of Minho(PT)
- Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento(PT)
- University of Gothenburg(SE)