Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large‐language‐models for pediatric diagnosis: Performance evaluation using real‐world clinical notes from common and rare cases

2026·1 Zitationen·Pediatric InvestigationOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

ABSTRACT Importance Rigorous evaluation of large language models (LLMs) in pediatric diagnosis using authentic clinical presentations remains limited, particularly regarding response consistency and rare disease recognition. Objective To evaluate the diagnostic accuracy, consistency, and clinical usability of LLMs as diagnostic support tools in pediatric medicine compared with human clinicians using real‐world cases. Methods This cross‐sectional study at Sant Joan de Déu Barcelona Children's Hospital evaluated four LLMs [DxGPT/GPT‐4 (0613), Claude‐3.5 Sonnet, GPT‐4o (0513), and o1‐preview] against 78 pediatric clinicians using 50 real clinical cases (25 rare diseases, 25 common conditions) from a single tertiary pediatric center. All cases were presented using Spanish intake‐style clinical summaries. Each case was queried three times per LLM and evaluated by clinicians with different experience levels. Performance was assessed using the Top‐1 and Top‐5 diagnostic accuracy, response consistency (intraclass correlation coefficient), and qualitative evaluation. Extended clinical information was provided for 20 cases to assess the diagnostic efficiency. Results Advanced LLMs significantly outperformed the clinicians in terms of diagnostic accuracy. o1‐preview and Claude‐3.5 Sonnet achieved mean Top‐1 accuracies of 60.0% and 59.0%, respectively, compared to clinicians’ 48.2% (odds ratios [ORs]: 2.99 and 2.75, both P < 0.001). Performance advantages were most pronounced for rare diseases, where o1‐preview demonstrated 6‐fold higher Top‐5 diagnostic odds compared to clinicians (ORs: 6.00, P < 0.001). Extended clinical information improved the accuracy of both groups, particularly for rare diseases. Human‐Artificial intelligence complementarity analysis revealed 94.3% union accuracy with o1‐preview, representing a 10‐percentage‐point uplift over clinicians alone. Clinicians rated DxGPT favorably (mean, 3.9/5), particularly for rare case support (4.1/5). Interpretation In this proof‐of‐concept study of a reference care center, newer LLMs outperformed previous models and human clinicians in complex pediatric diagnostics, particularly for rare diseases. These findings support further evaluation as augmentative diagnostic tools in similar settings, with appropriate legal, ethical, and clinical oversight frameworks.

Autoren

Institutionen

Themen

Genomics and Rare DiseasesArtificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Large‐language‐models for pediatric diagnosis: Performance evaluation using real‐world clinical notes from common and rare cases

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen