Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of Large Language Models on Board-Style Obstetrics and Gynecology Questions: a cross-sectional study (Preprint)
0
Zitationen
11
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> Large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated strong performance on general medical knowledge assessments, but their accuracy in specialty specific domains like Obstetrics and Gynecology (OBGYN) is less well characterized. Prior studies suggest high overall performance, but topic-specific proficiency across OBGYN subspecialties has not yet been evaluated, highlighting the need to assess their performance to inform safe integration into resident use and education. </sec> <sec> <title>OBJECTIVE</title> To benchmark the accuracy of contemporary LLMs on OBGYN knowledge using board-style question stems across subspecialty domains. </sec> <sec> <title>METHODS</title> We selected 50 questions from each of six Personal Review of Learning in Obstetrics and Gynecology (PROLOG) volumes, covering core OBGYN topics (total 300 questions). Three LLMs (ChatGPT-4, Claude 3.5, and Llama 3.1) were prompted to answer the entire set of 300 questions in topic-based blocks of 50. This was repeated over six independent sessions, totaling 1,800 question entries for each model, to obtain an average performance measure and minimize memory bias. Model responses to each individual question were graded against the answer key provided in the PROLOG volumes. We utilized a binary scoring system at the individual question level. A response was ‘correct’ only if it matched the single best answer as defined by the PROLOG volume, and ‘incorrect’ if it did not. Average performance across sessions was compared against the 2024 national Council on Resident Education in Obstetrics and Gynecology (CREOG) resident exam average as a contextual benchmark. Kruskal-Wallis tests, pairwise comparisons, and effect size comparisons using Cohen’s d were used to assess differences in performance across models and topics. </sec> <sec> <title>RESULTS</title> Overall accuracies were: 76% (Claude 3.5), 70% (ChatGPT-4), and 67% (Llama 3.1). Claude 3.5 outperformed the other models overall and in most topic areas, with the largest differences observed in Obstetrics and Reproductive Endocrinology. Accuracy was highest in Patient Management in the Office (84–86% across models) and lowest in Urogynecology and Pelvic Reconstructive Surgery (59–69%). Although comparisons are limited because PROLOG and CREOG questions are not identical, the reported national CREOG average serves as an indirect contextual benchmark. Within this context, average LLM performance on PROLOG questions (67%-76%) exceeded the reported national CREOG average across all resident levels (66%), but ChatGPT-4 (70%) and Llama 3.1 (67%) did not reach the average performance level of a PGY-4 resident (71%). </sec> <sec> <title>CONCLUSIONS</title> LLM accuracy overlapped with reported national CREOG averages. Claude 3.5 outperformed ChatGPT-4 and Llama 3.1, exceeding PGY-4 accuracy. While promising as educational adjuncts, LLMs currently operate at a trainee-level and should complement, not replace, traditional clinical training. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.479 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.364 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.814 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.543 Zit.