Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Augmenting Oncology Guideline Maintenance with Large Language Models: A Prospective Evaluation (Preprint)
0
Zitationen
4
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> Maintenance of oncology clinical practice guidelines (CPGs) is increasingly challenged by the rapid growth of trial data and therapeutic complexity. While large language models (LLMs) have shown promise in information retrieval, their utility in the rigorous, end-to-end workflow of guideline maintenance remains under-explored. </sec> <sec> <title>OBJECTIVE</title> This study aimed to systematically evaluate the performance of frontier LLMs in supporting oncology guideline maintenance. We sought to determine their reliability in predicting necessary guideline updates based on new evidence, their accuracy in extracting data from clinical trials, and their effectiveness as automated auditors for detecting errors in established guidelines. </sec> <sec> <title>METHODS</title> Using the Onkopedia peripheral T-cell lymphoma (PTCL) guideline as a natural experiment, we tasked frontier models with deep-research modes (Gemini 2.5 Pro, GPT o4-mini-high) to predict a guideline update in August 2025 based on the 2021 version. Predictions were validated against the official 2025 revision published in October 2025. Next, we benchmarked evidence extraction accuracy across 80 pivotal trials using models of varying scale (27B–671B parameters vs. frontier). Finally, we deployed a stacked LLM workflow to audit 28 recently updated Onkopedia guidelines for linguistic and content-related errors. </sec> <sec> <title>RESULTS</title> In the predictive task, models captured 36.7–40% of substantive updates, often identifying landmark approvals but frequently overstating evidence. While frontier models demonstrated high accuracy (up to 99.2%) in extracting data from individual studies, substantially outperforming smaller open-source models, this precision declined during multi-source synthesis. As automated auditors of existing CPGs, the models successfully identified a median of 16.5 formal errors per document and detected several clinically relevant inconsistencies (e.g., invalid scoring formulas, incorrect staging definitions). </sec> <sec> <title>CONCLUSIONS</title> LLMs currently lack the reasoning stability for autonomous guideline authoring due to deficits in complex synthesis. However, they are effective tools for high-fidelity evidence extraction and automated quality assurance, supporting a human-led, AI-augmented workflow for efficient guideline maintenance. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.316 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.177 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.575 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.468 Zit.