Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
When AI models take the exam: large language models vs medical students on multiple-choice course exams
4
Zitationen
10
Autoren
2025
Jahr
Abstract
= 442) were summarized as mean ± SD or median (IQR). Pairwise differences between models were explored with McNemar's test; student-LLM contrasts were descriptive. Across courses, LLMs consistently exceeded the student median and, in several instances, the highest student score. Mean LLM courses scores ranged 7.46-9.88, versus student means 4.28-7.32. OpenAI o1 achieved the highest mean in three courses; Copilot led in Cardiovascular Medicine (text-only subset due to image limitations). All LLMs answered every MCQ and short term test-retest agreement was high (AC1 0.79-1.00). Aggregated across courses, LLMs averaged 8.75 compared with 5.76 for students. On department-set Spanish MCQ exams with negative marking, LLMs outperformed enrolled medical students, answered every item, and showed high short-term reproducibility. These findings support cautious, faculty-supervised use of LLMs as adjuncts to MCQ assessment (e.g. automated pretesting, feedback). Confirmation across institutions, languages, and image-rich formats, and evaluation of educational impact beyond accuracy are needed.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.674 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.583 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.105 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.862 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.