Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Quantitative Synthesis of Large Language Model Performance in Medical Reasoning Tasks
0
Zitationen
8
Autoren
2026
Jahr
Abstract
This study presents a comprehensive evaluation of AI diagnostic models across diverse clinical cases sourced from Thieme (77 cases) and Elsevier (48 cases). The dataset spans frequent (42 cases), less frequent (44 cases), and rare (39 cases) conditions, ensuring balanced assessment. Diagnostic performance, treatment recommendation accuracy, and linguistic reliability were compared across multiple state-of-the-art AI systems. Results highlight strong overall diagnostic capabilities, with notable variations across specialties and disease frequencies. While Pediatrics consistently demonstrated the highest performance, Surgery emerged as the most challenging specialty. Among models, GPT-4o achieved superior diagnostic consensus, treatment recommendation accuracy, and linguistic precision, underscoring its clinical utility. The findings provide empirical benchmarks for advancing AI-based medical decision support systems.