Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Large‐language‐models for pediatric diagnosis: Performance evaluation using real‐world clinical notes from common and rare cases
0
Zitationen
13
Autoren
2026
Jahr
Abstract
ABSTRACT Importance Rigorous evaluation of large language models (LLMs) in pediatric diagnosis using authentic clinical presentations remains limited, particularly regarding response consistency and rare disease recognition. Objective To evaluate the diagnostic accuracy, consistency, and clinical usability of LLMs as diagnostic support tools in pediatric medicine compared with human clinicians using real‐world cases. Methods This cross‐sectional study at Sant Joan de Déu Barcelona Children's Hospital evaluated four LLMs [DxGPT/GPT‐4 (0613), Claude‐3.5 Sonnet, GPT‐4o (0513), and o1‐preview] against 78 pediatric clinicians using 50 real clinical cases (25 rare diseases, 25 common conditions) from a single tertiary pediatric center. All cases were presented using Spanish intake‐style clinical summaries. Each case was queried three times per LLM and evaluated by clinicians with different experience levels. Performance was assessed using the Top‐1 and Top‐5 diagnostic accuracy, response consistency (intraclass correlation coefficient), and qualitative evaluation. Extended clinical information was provided for 20 cases to assess the diagnostic efficiency. Results Advanced LLMs significantly outperformed the clinicians in terms of diagnostic accuracy. o1‐preview and Claude‐3.5 Sonnet achieved mean Top‐1 accuracies of 60.0% and 59.0%, respectively, compared to clinicians’ 48.2% (odds ratios [ORs]: 2.99 and 2.75, both P < 0.001). Performance advantages were most pronounced for rare diseases, where o1‐preview demonstrated 6‐fold higher Top‐5 diagnostic odds compared to clinicians (ORs: 6.00, P < 0.001). Extended clinical information improved the accuracy of both groups, particularly for rare diseases. Human‐Artificial intelligence complementarity analysis revealed 94.3% union accuracy with o1‐preview, representing a 10‐percentage‐point uplift over clinicians alone. Clinicians rated DxGPT favorably (mean, 3.9/5), particularly for rare case support (4.1/5). Interpretation In this proof‐of‐concept study of a reference care center, newer LLMs outperformed previous models and human clinicians in complex pediatric diagnostics, particularly for rare diseases. These findings support further evaluation as augmentative diagnostic tools in similar settings, with appropriate legal, ethical, and clinical oversight frameworks.
Ähnliche Arbeiten
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology
2015 · 31.121 Zit.
A global reference for human genetic variation
2015 · 19.524 Zit.
The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data
2012 · 18.113 Zit.
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
2010 · 15.314 Zit.
A method and server for predicting damaging missense mutations
2010 · 13.458 Zit.