Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparing Artificial Intelligence Large Language Models in Medical Training: A Performance Analysis of ChatGPT and DeepSeek on United States Medical Licensing Examination (USMLE) Style Questions
2
Zitationen
5
Autoren
2025
Jahr
Abstract
Introduction The integration of artificial intelligence (AI) into medical education is reshaping how students prepare for standardized examinations. Prior studies have shown that AI models can achieve high accuracy on United States Medical Licensing Examination (USMLE) questions, highlighting their potential for examination preparation. ChatGPT (GPT), especially the 4o model, is one of the most widely used AI models; however, its accessibility is limited by subscription costs and regional censorship. DeepSeek (DS), a newer AI model, offers free access and has demonstrated comparable performance in general tasks. In this study, we compared the performance of GPT-4o and DS DeepThink R1 on the AMBOSS medical board preparation question bank to evaluate their potential and limitations as supplementary tools in medical education. Methods We extracted 1,079 USMLE-style multiple-choice questions from the AMBOSS question bank. Questions were categorized by USMLE Step 1 and Step 2 examinations and further grouped by topic, resulting in 36 categories. Each question was assigned a difficulty level (easy, intermediate, or hard) based on AMBOSS grading criteria. To ensure balanced representation, we randomly selected 10 questions per difficulty level per category. Questions and answer choices were copied verbatim from the AMBOSS website and input into GPT-4o and DS R1 without any modification. Model responses were scored as correct or incorrect, and correctness rates were compared across GPT-4o, DS R1, and AMBOSS user performance. Results Both GPT and DS outperformed AMBOSS users, with overall accuracies of 88.79%, 78.68%, and 56.98%, respectively. Comparing GPT and DS, GPT performed significantly better overall (t=7.90, p<0.0001). When stratified by examination type, GPT achieved significantly higher accuracy than DS in both Step 1 (0.89 vs. 0.78, p < 0.0001) and Step 2 (0.88 vs. 0.80, p < 0.0001). GPT consistently showed higher accuracy than DS at all three difficulty levels. However, when further stratified by examination type, statistically significances were only observed in intermediate (p = 0.0002) and hard (p = 0.0021) questions in both Step 1 and Step 2. Conclusion Our findings demonstrated that both AI models outperformed human learners, with GPT-4o showing superior accuracy, particularly in intermediate and hard questions. While DS underperformed relative to GPT, its free accessibility and competitive accuracy in easy questions suggest that it may serve as a viable alternative, particularly in resource-limited settings.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.