Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation
0
Zitationen
15
Autoren
2026
Jahr
Abstract
Background: Large language models (LLMs) are increasingly being explored to supporting evidence-based decision-making in urology, but their accuracy in interpreting and applying clinical guidelines remains uncertain. Objectives: We aimed to evaluate the ability of LLMs to interpret and apply clinical guidelines across the full spectrum of major urological cancers. Design: This expert-validated study evaluated six configurations of three top LLMs (Claude, Gemini, and ChatGPT) using 25 structured questions for each of the seven major urological cancers: prostate cancer, upper tract urothelial carcinoma, muscle-invasive and non-muscle-invasive bladder cancer, renal cell carcinoma, penile cancer, and testicular cancer. Methods: Both simple and rephrased prompts were used to assess the impact of prompt engineering on response quality. All figures and tables from the English-language EAU guidelines were systematically converted into plain, structured text and peer reviewed by multidisciplinary experts before evaluating the LLM responses. Each response was independently rated by 9–11 uro-oncology specialists using a five-point Likert scale (1: incorrect/unacceptable, 5: optimal), resulting in 10,500 evaluations. Results: Claude achieved the highest overall accuracy, with 45.9% of responses rated as optimal (Likert 5) and 87% as optimal/acceptable (Likert 4–5). Tumor-specific performance peaked in muscle-invasive bladder (56.7% optimal, 93% optimal/acceptable), penile (49.5%, 95%), and testicular cancer (60.9%, 94%). Gemini and ChatGPT showed lower optimal rates but acceptable performance (68%–70% optimal/acceptable). Rephrased prompts did not consistently outperform simple versions. All models showed acceptable accuracy, but the results should be interpreted cautiously due to recency bias and fast LLM tech evolution. Conclusion: This study demonstrates the value of rigorous plain language adaptation and expert validation in benchmarking LLMs, supporting their potential as decision-support tools in uro-oncology.
Ähnliche Arbeiten
Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement
2009 · 63.009 Zit.
Cochrane Handbook for Systematic Reviews of Interventions
2008 · 25.031 Zit.
GRADE: an emerging consensus on rating quality of evidence and strength of recommendations
2008 · 21.182 Zit.
The National Comprehensive Cancer Network
1998 · 16.869 Zit.
Evidence based medicine: what it is and what it isn't
1996 · 15.534 Zit.
Autoren
Institutionen
- Universidad de Zaragoza(ES)
- Hospital Universitario Miguel Servet(ES)
- Instituto de Investigación Sanitaria Aragón(ES)
- Universidade de Santiago de Compostela(ES)
- Complejo Hospitalario Universitario de Santiago(ES)
- Hospital Universitario La Paz(ES)
- Instituto de Biomedicina de Sevilla(ES)
- Hospital Universitario Virgen del Rocío(ES)
- Puigvert Foundation(ES)
- Research Institute Hospital 12 de Octubre(ES)
- Hospital Universitario De Cabueñes(ES)
- Universidad de Oviedo(ES)
- Hospital Universitario Central de Asturias(ES)
- Hospital General Universitario Morales Meseguer(ES)
- Universidad San Pablo CEU(ES)
- Hospital Universitario 12 De Octubre(ES)
- Hospital Universitario HM Madrid(ES)
- Hospital Clínic de Barcelona(ES)
- Consorci Institut D'Investigacions Biomediques August Pi I Sunyer(ES)
- Universitat de Barcelona(ES)
- Asociación Española de Urología(ES)
- Hospital Universitario Puerta del Mar(ES)