Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks

2025·11 Zitationen·Journal of Medical Internet ResearchOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

BACKGROUND: The evaluation of large language models (LLMs) in medicine has undergone a shift from knowledge-based testing to practice-based assessment, representing an evolution in how we measure artificial intelligence readiness for clinical deployment. While LLMs now routinely exceed human performance on medical licensing examinations, their translation to clinical practice remains poorly characterized. OBJECTIVE: This systematic review aims to categorize and analyze medical LLM benchmarks, examining performance patterns across different evaluation paradigms and identifying gaps in current assessment methodologies. METHODS: The protocol was registered at PROSPERO (CRD420251139729). Four databases (MEDLINE/PubMed, Embase/Ovid, Cochrane Library, and arXiv) were searched from inception to August 31, 2025, using keywords related to clinical medicine benchmarks in LLMs. Studies were included if they (1) investigated clinical medicine benchmarks in LLMs, (2) were published in English, and (3) were available in full-text. Studies were excluded if they evaluated nonmedical domains or lacked benchmark validation. Methodological quality was assessed using the Mixed Methods Appraisal Tool (version 2018) by 2 independent reviewers (κ=0.91). Due to heterogeneity in evaluation metrics preventing meta-analysis, narrative synthesis was conducted using structured categorization of benchmark types. RESULTS: From 3917 screened records, 39 medical LLM benchmarks were identified and categorized into 21 (54%) knowledge-based, 15 (38%) practice-based, and 3 (8%) hybrid frameworks. These benchmarks collectively encompass over 2.3 million questions across 45 languages and 172 medical specialties. Traditional knowledge-based benchmarks show saturation with leading models achieving 84%-90% accuracy on USMLE (United States Medical Licensing Examination)-style examinations, approaching or exceeding average physician performance. However, practice-based assessments reveal performance challenges, with specific benchmarks showing varied results: DiagnosisArena 45.82% (95% CI 42.9%-48.8%), MedAgentBench 69.67% (95% CI 64.2%-74.6%), and HealthBench 60% (95% CI 58.6%-61.3%) success rates, with practice-based benchmarks showing lower performance (45%-69%) compared to knowledge benchmarks (84%-90%). Task-specific analysis revealed differential performance patterns: factual retrieval maintained 85%-93% accuracy, clinical reasoning dropped to 50%-60%, diagnostic tasks achieved 45%-55% success, and safety assessment showed significant gaps at 40%-50% accuracy despite being life-critical. Geographic representation spans 6 continents with 18 (46%) benchmarks, incorporating non-English content. Quality assessment revealed 26% (10/39) of benchmarks had insufficient methodological reporting for complete evaluation. CONCLUSIONS: This systematic review provides the first comprehensive analysis quantifying the significant "knowledge-practice gap" in medical artificial intelligence: high performance on knowledge-based examinations (84%-90%) does not translate to clinical competence (45%-69%), with safety assessments at 40%-50%. Our findings provide quantitative evidence for regulators and health systems that examination scores are insufficient and misleading proxies for clinical readiness. This review concludes that autonomous deployment is not currently justifiable and that all evidence-based implementation strategies must mandate practice-oriented validation and robust human-in-the-loop oversight to ensure patient safety. TRIAL REGISTRATION: PROSPERO CRD420251139729; https://www.crd.york.ac.uk/PROSPERO/view/CRD420251139729.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareTopic Modeling

Volltext beim Verlag öffnen

Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen