Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparison of Large Language Models’ Performance on Neurosurgical Board Examination Questions
2
Zitationen
2
Autoren
2025
Jahr
Abstract
Abstract Background Multiple-choice board examinations are a primary objective measure of competency in medicine. Large language models (LLMs) have demonstrated rapid improvements in performance on medical board examinations in the past two years. We evaluated five leading LLMs on neurosurgical board exam questions. Methods We evaluated five LLMs (OpenAI o1, OpenEvidence, Claude 3.5 Sonnet, Gemini 2.0, and xAI Grok2) on 500 multiple-choice questions from the Self-Assessment in Neurological Surgery (SANS) American Board of Neurological Surgery (ABNS) Primary Board Examination Review. Performance was analyzed across 12 subspecialty categories and compared to established passing thresholds. Results All models exceeded the threshold for passing, with OpenAI o1 achieving the highest accuracy (87.6%), followed by OpenEvidence (84.2%), Claude 3.5 Sonnet (83.2%), Gemini 2.0 (81.0%) and xAI Grok2 (79.0%). Performance was strongest in Other General (97.4%) and Peripheral Nerve (97.1%) categories, while Neuroradiology showed the lowest accuracy (57.4%) across all models. Conclusions State of the art LLMs continue to improve, and all models demonstrated strong performance on neurosurgical board examination questions. Medical image analysis continues to be a limitation of current LLMs. The current level of LLM performance challenges the relevance of written board examinations in trainee evaluation and suggests that LLMs are ready for implementation in clinical medicine and medical education.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.549 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.443 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.941 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.