Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Deep research capabilities in GPT‐5 thinking and Gemini 2.5 Pro improve citation integrity and concordance with American Academy of Orthopaedic Surgeons anterior cruciate ligament and rotator cuff guidelines
1
Zitationen
4
Autoren
2026
Jahr
Abstract
PURPOSE: To assess whether large language models (LLMs) with advanced reasoning and live web search (LWS) provide recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS) for anterior cruciate ligament (ACL) and rotator cuff (RC) injury management. METHODS: Recommendations from CPGs were extracted and developed into a total of 46 questions (n = 15 for ACL, n = 31 for RC). Four configurations were evaluated: GPT-5 Thinking, GPT-5 Thinking Deep Research, Gemini 2.5 Pro, Gemini 2.5 Pro Deep Research. Concordance with CPGs, the primary endpoint, was independently evaluated by two orthopaedic surgeons. Citation integrity, the secondary endpoint, was evaluated against four criteria: 1-relevance, ensuring the citation was congruent with the response; 2-accuracy, confirming the citation metadata were correct; 3-existence, to rule out hallucinations; and 4-source quality, ensuring the cited source is from a peer-reviewed journal. Blinding was performed by a third investigator, by anonymously randomising the order of LLM-generated responses for each CPG recommendation. RESULTS: All LLMs answered ACL questions concordantly (100% [15/15]; 95% confidence interval [CI]: 78.2%-100%). For RC questions, GPT-5 Thinking and Gemini 2.5 Pro Deep Research each had one discordant answer (96.8% [30/31]; 95% CI: 83.3%-99.9%), whereas the other two configurations were fully concordant (100% [31/31]; 95% CI: 88.7%-100%). GPT-5 Thinking achieved 96.8% (231/239; 95% CI: 93.6%-98.6%) citation integrity, improving to 100% (176/176; 95% CI: 97.9%-100%) with Deep Research. Gemini 2.5 Pro showed substantially lower baseline performance (64.6% [173/268]; 95% CI: 58.5%-70.3%) but improved to 98.6% (274/278; 95% CI: 96.4%-99.6%) with Deep Research. Inter-rater agreement was perfect (κ = 1.0) across all domains, except for citation relevance, which maintained strong agreement (κ = 0.88). CONCLUSIONS: Contemporary LLMs with agentic capabilities can deliver clinically aligned answers concordant with CPGs on ACL and RC injuries, recovering from previous hallucinations. Built-in LWS functions are particularly helpful in ensuring citation reliability. Although expert oversight remains imperative, Deep Research allows LLMs to be considered as a first-pass clinical reasoning companion. LEVEL OF EVIDENCE: NA.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.773 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.682 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.242 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.898 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.