OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 24.05.2026, 13:27

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Deep research capabilities in GPT‐5 thinking and Gemini 2.5 Pro improve citation integrity and concordance with American Academy of Orthopaedic Surgeons anterior cruciate ligament and rotator cuff guidelines

2026·1 Zitationen·Knee Surgery Sports Traumatology Arthroscopy
Volltext beim Verlag öffnen

1

Zitationen

4

Autoren

2026

Jahr

Abstract

PURPOSE: To assess whether large language models (LLMs) with advanced reasoning and live web search (LWS) provide recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS) for anterior cruciate ligament (ACL) and rotator cuff (RC) injury management. METHODS: Recommendations from CPGs were extracted and developed into a total of 46 questions (n = 15 for ACL, n = 31 for RC). Four configurations were evaluated: GPT-5 Thinking, GPT-5 Thinking Deep Research, Gemini 2.5 Pro, Gemini 2.5 Pro Deep Research. Concordance with CPGs, the primary endpoint, was independently evaluated by two orthopaedic surgeons. Citation integrity, the secondary endpoint, was evaluated against four criteria: 1-relevance, ensuring the citation was congruent with the response; 2-accuracy, confirming the citation metadata were correct; 3-existence, to rule out hallucinations; and 4-source quality, ensuring the cited source is from a peer-reviewed journal. Blinding was performed by a third investigator, by anonymously randomising the order of LLM-generated responses for each CPG recommendation. RESULTS: All LLMs answered ACL questions concordantly (100% [15/15]; 95% confidence interval [CI]: 78.2%-100%). For RC questions, GPT-5 Thinking and Gemini 2.5 Pro Deep Research each had one discordant answer (96.8% [30/31]; 95% CI: 83.3%-99.9%), whereas the other two configurations were fully concordant (100% [31/31]; 95% CI: 88.7%-100%). GPT-5 Thinking achieved 96.8% (231/239; 95% CI: 93.6%-98.6%) citation integrity, improving to 100% (176/176; 95% CI: 97.9%-100%) with Deep Research. Gemini 2.5 Pro showed substantially lower baseline performance (64.6% [173/268]; 95% CI: 58.5%-70.3%) but improved to 98.6% (274/278; 95% CI: 96.4%-99.6%) with Deep Research. Inter-rater agreement was perfect (κ = 1.0) across all domains, except for citation relevance, which maintained strong agreement (κ = 0.88). CONCLUSIONS: Contemporary LLMs with agentic capabilities can deliver clinically aligned answers concordant with CPGs on ACL and RC injuries, recovering from previous hallucinations. Built-in LWS functions are particularly helpful in ensuring citation reliability. Although expert oversight remains imperative, Deep Research allows LLMs to be considered as a first-pass clinical reasoning companion. LEVEL OF EVIDENCE: NA.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMeta-analysis and systematic reviewsRadiology practices and education
Volltext beim Verlag öffnen