Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Textbook-level medical knowledge in large language models: comparative evaluation using Japanese National Medical Examination

2026·1 Zitationen·BMC Medical Informatics and Decision MakingOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

The accuracy of the latest reasoning-enhanced large language models on national medical licensing examinations remains unknown, which is crucial for determining how close they are to serving as effective knowledge sources for medical education. This study aimed to evaluate the performance of four reasoning-enhanced large language models (LLMs)—GPT-5, Grok-4, Claude Opus 4.1, and Gemini 2.5 Pro—on the Japanese National Medical Examination (JNME), providing insights into their potential as educational resources and their future applicability in medical practice. We evaluated LLM performance using the 2019 and 2025 JNME (n = 793). Questions were entered into each model with chain-of-thought prompting enabled. Accuracy was assessed overall and by question type. Incorrect responses were qualitatively reviewed by a licensed physician and a medical student. From highest to lowest, the overall accuracies of the four LLMs were 97.2% for Gemini 2.5 Pro, 96.3% for GPT-5, 96.1% for Claude Opus 4.1, and 95.6% for Grok-4, with no significant pairwise differences. For image-based and non-image-based items, Gemini 2.5 Pro achieved the highest accuracy of 96.1% and 97.6%, with no significant difference, whereas accuracy was significantly lower on image-based items for the other three LLMs. Across difficulty levels, Gemini 2.5 Pro again achieved the highest accuracy (98.4% for easy, 97.3% for moderate, and 93.2% for difficult items). Within each LLM, accuracy on difficult questions was significantly lower than on easy questions. Common error patterns included providing unnecessary additional options in single-choice questions, misdiagnosis of X-ray or computed tomography images (primarily due to confusion regarding left–right laterality), and difficulties in prioritizing appropriate actions in clinical questions with complex contextual information. Four LLMs released in 2025 surpassed the 95% benchmark on the JNME, and their near-perfect (approximately 99%) performance on basic medical knowledge questions highlights substantial potential for use as learning resources in foundational medical education. Gemini 2.5 Pro demonstrated the most consistent performance across question types, while Grok-4 showed greater variability. The concentration of incorrectness in clinical questions indicates that LLMs still require substantial refinement and validation before their use can be extended to clinical reasoning or patient care.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationInnovations in Medical EducationClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Textbook-level medical knowledge in large language models: comparative evaluation using Japanese National Medical Examination

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen