Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparing Artificial Intelligence Large Language Models in Medical Training: A Performance Analysis of ChatGPT and DeepSeek on United States Medical Licensing Examination (USMLE) Style Questions

2025·2 Zitationen·CureusOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Introduction The integration of artificial intelligence (AI) into medical education is reshaping how students prepare for standardized examinations. Prior studies have shown that AI models can achieve high accuracy on United States Medical Licensing Examination (USMLE) questions, highlighting their potential for examination preparation. ChatGPT (GPT), especially the 4o model, is one of the most widely used AI models; however, its accessibility is limited by subscription costs and regional censorship. DeepSeek (DS), a newer AI model, offers free access and has demonstrated comparable performance in general tasks. In this study, we compared the performance of GPT-4o and DS DeepThink R1 on the AMBOSS medical board preparation question bank to evaluate their potential and limitations as supplementary tools in medical education. Methods We extracted 1,079 USMLE-style multiple-choice questions from the AMBOSS question bank. Questions were categorized by USMLE Step 1 and Step 2 examinations and further grouped by topic, resulting in 36 categories. Each question was assigned a difficulty level (easy, intermediate, or hard) based on AMBOSS grading criteria. To ensure balanced representation, we randomly selected 10 questions per difficulty level per category. Questions and answer choices were copied verbatim from the AMBOSS website and input into GPT-4o and DS R1 without any modification. Model responses were scored as correct or incorrect, and correctness rates were compared across GPT-4o, DS R1, and AMBOSS user performance. Results Both GPT and DS outperformed AMBOSS users, with overall accuracies of 88.79%, 78.68%, and 56.98%, respectively. Comparing GPT and DS, GPT performed significantly better overall (t=7.90, p<0.0001). When stratified by examination type, GPT achieved significantly higher accuracy than DS in both Step 1 (0.89 vs. 0.78, p < 0.0001) and Step 2 (0.88 vs. 0.80, p < 0.0001). GPT consistently showed higher accuracy than DS at all three difficulty levels. However, when further stratified by examination type, statistically significances were only observed in intermediate (p = 0.0002) and hard (p = 0.0021) questions in both Step 1 and Step 2. Conclusion Our findings demonstrated that both AI models outperformed human learners, with GPT-4o showing superior accuracy, particularly in intermediate and hard questions. While DS underperformed relative to GPT, its free accessibility and competitive accuracy in easy questions suggest that it may serve as a viable alternative, particularly in resource-limited settings.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical Imaging

Volltext beim Verlag öffnen

Comparing Artificial Intelligence Large Language Models in Medical Training: A Performance Analysis of ChatGPT and DeepSeek on United States Medical Licensing Examination (USMLE) Style Questions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen