Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of 5 AI Models on United States Medical Licensing Examination Step 1 Questions: Comparative Observational Study (Preprint)
0
Zitationen
6
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on United States Medical Licensing Examination (USMLE)–style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats. </sec> <sec> <title>OBJECTIVE</title> This study aimed to evaluate and compare the performance of 5 publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 free 120-question set, assessing their accuracy and consistency across question types and medical subjects. </sec> <sec> <title>METHODS</title> This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding 1 audio-based item) was presented to each AI model by using a standardized prompt cycle. Models answered each question 3 times to assess confidence and consistency. Questions were categorized as text-based or image-based and as case-based or information-based. Statistical analysis was performed using chi-square and Fisher exact tests, with Bonferroni adjustment for pairwise comparisons. </sec> <sec> <title>RESULTS</title> Grok achieved the highest score (109/119, 91.6%), followed by Copilot (101/119, 84.9%), Gemini (100/119, 84%), ChatGPT-4 (95/119, 79.8%), and DeepSeek (86/119, 72.3%). DeepSeek’s lower score was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n=96), DeepSeek’s accuracy increased to 89.6% (86/96), matching Copilot. Grok showed the highest accuracy on image-based (21/23, 91.3%) and case-based questions (70/78, 89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (<i>P</i>=.01). The models performed best in biostatistics and epidemiology (5.8/6, 96.7%) and worst in musculoskeletal, skin, and connective tissue (4.4/7, 62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (112/119, 94.1% consistency), improving its accuracy to 89.9% (107/119) on the third attempt. </sec> <sec> <title>CONCLUSIONS</title> AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve. </sec> <sec> <title>CLINICALTRIAL</title> <p/> </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.551 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.443 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.942 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.