Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks
11
Zitationen
4
Autoren
2025
Jahr
Abstract
BACKGROUND: The evaluation of large language models (LLMs) in medicine has undergone a shift from knowledge-based testing to practice-based assessment, representing an evolution in how we measure artificial intelligence readiness for clinical deployment. While LLMs now routinely exceed human performance on medical licensing examinations, their translation to clinical practice remains poorly characterized. OBJECTIVE: This systematic review aims to categorize and analyze medical LLM benchmarks, examining performance patterns across different evaluation paradigms and identifying gaps in current assessment methodologies. METHODS: The protocol was registered at PROSPERO (CRD420251139729). Four databases (MEDLINE/PubMed, Embase/Ovid, Cochrane Library, and arXiv) were searched from inception to August 31, 2025, using keywords related to clinical medicine benchmarks in LLMs. Studies were included if they (1) investigated clinical medicine benchmarks in LLMs, (2) were published in English, and (3) were available in full-text. Studies were excluded if they evaluated nonmedical domains or lacked benchmark validation. Methodological quality was assessed using the Mixed Methods Appraisal Tool (version 2018) by 2 independent reviewers (κ=0.91). Due to heterogeneity in evaluation metrics preventing meta-analysis, narrative synthesis was conducted using structured categorization of benchmark types. RESULTS: From 3917 screened records, 39 medical LLM benchmarks were identified and categorized into 21 (54%) knowledge-based, 15 (38%) practice-based, and 3 (8%) hybrid frameworks. These benchmarks collectively encompass over 2.3 million questions across 45 languages and 172 medical specialties. Traditional knowledge-based benchmarks show saturation with leading models achieving 84%-90% accuracy on USMLE (United States Medical Licensing Examination)-style examinations, approaching or exceeding average physician performance. However, practice-based assessments reveal performance challenges, with specific benchmarks showing varied results: DiagnosisArena 45.82% (95% CI 42.9%-48.8%), MedAgentBench 69.67% (95% CI 64.2%-74.6%), and HealthBench 60% (95% CI 58.6%-61.3%) success rates, with practice-based benchmarks showing lower performance (45%-69%) compared to knowledge benchmarks (84%-90%). Task-specific analysis revealed differential performance patterns: factual retrieval maintained 85%-93% accuracy, clinical reasoning dropped to 50%-60%, diagnostic tasks achieved 45%-55% success, and safety assessment showed significant gaps at 40%-50% accuracy despite being life-critical. Geographic representation spans 6 continents with 18 (46%) benchmarks, incorporating non-English content. Quality assessment revealed 26% (10/39) of benchmarks had insufficient methodological reporting for complete evaluation. CONCLUSIONS: This systematic review provides the first comprehensive analysis quantifying the significant "knowledge-practice gap" in medical artificial intelligence: high performance on knowledge-based examinations (84%-90%) does not translate to clinical competence (45%-69%), with safety assessments at 40%-50%. Our findings provide quantitative evidence for regulators and health systems that examination scores are insufficient and misleading proxies for clinical readiness. This review concludes that autonomous deployment is not currently justifiable and that all evidence-based implementation strategies must mandate practice-oriented validation and robust human-in-the-loop oversight to ensure patient safety. TRIAL REGISTRATION: PROSPERO CRD420251139729; https://www.crd.york.ac.uk/PROSPERO/view/CRD420251139729.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.700 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.605 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.133 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.873 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.