Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Textbook-level medical knowledge in large language models: comparative evaluation using Japanese National Medical Examination
0
Zitationen
8
Autoren
2026
Jahr
Abstract
The accuracy of the latest reasoning-enhanced large language models on national medical licensing examinations remains unknown, which is crucial for determining how close they are to serving as effective knowledge sources for medical education. This study aimed to evaluate the performance of four reasoning-enhanced large language models (LLMs)—GPT-5, Grok-4, Claude Opus 4.1, and Gemini 2.5 Pro—on the Japanese National Medical Examination (JNME), providing insights into their potential as educational resources and their future applicability in medical practice. We evaluated LLM performance using the 2019 and 2025 JNME (n = 793). Questions were entered into each model with chain-of-thought prompting enabled. Accuracy was assessed overall and by question type. Incorrect responses were qualitatively reviewed by a licensed physician and a medical student. From highest to lowest, the overall accuracies of the four LLMs were 97.2% for Gemini 2.5 Pro, 96.3% for GPT-5, 96.1% for Claude Opus 4.1, and 95.6% for Grok-4, with no significant pairwise differences. For image-based and non-image-based items, Gemini 2.5 Pro achieved the highest accuracy of 96.1% and 97.6%, with no significant difference, whereas accuracy was significantly lower on image-based items for the other three LLMs. Across difficulty levels, Gemini 2.5 Pro again achieved the highest accuracy (98.4% for easy, 97.3% for moderate, and 93.2% for difficult items). Within each LLM, accuracy on difficult questions was significantly lower than on easy questions. Common error patterns included providing unnecessary additional options in single-choice questions, misdiagnosis of X-ray or computed tomography images (primarily due to confusion regarding left–right laterality), and difficulties in prioritizing appropriate actions in clinical questions with complex contextual information. Four LLMs released in 2025 surpassed the 95% benchmark on the JNME, and their near-perfect (approximately 99%) performance on basic medical knowledge questions highlights substantial potential for use as learning resources in foundational medical education. Gemini 2.5 Pro demonstrated the most consistent performance across question types, while Grok-4 showed greater variability. The concentration of incorrectness in clinical questions indicates that LLMs still require substantial refinement and validation before their use can be extended to clinical reasoning or patient care.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.336 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.207 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.607 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.476 Zit.