Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of Large Language Models on Board-Style Obstetrics and Gynecology Questions: a cross-sectional study (Preprint)

2026·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

<sec> <title>BACKGROUND</title> Large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated strong performance on general medical knowledge assessments, but their accuracy in specialty specific domains like Obstetrics and Gynecology (OBGYN) is less well characterized. Prior studies suggest high overall performance, but topic-specific proficiency across OBGYN subspecialties has not yet been evaluated, highlighting the need to assess their performance to inform safe integration into resident use and education. </sec> <sec> <title>OBJECTIVE</title> To benchmark the accuracy of contemporary LLMs on OBGYN knowledge using board-style question stems across subspecialty domains. </sec> <sec> <title>METHODS</title> We selected 50 questions from each of six Personal Review of Learning in Obstetrics and Gynecology (PROLOG) volumes, covering core OBGYN topics (total 300 questions). Three LLMs (ChatGPT-4, Claude 3.5, and Llama 3.1) were prompted to answer the entire set of 300 questions in topic-based blocks of 50. This was repeated over six independent sessions, totaling 1,800 question entries for each model, to obtain an average performance measure and minimize memory bias. Model responses to each individual question were graded against the answer key provided in the PROLOG volumes. We utilized a binary scoring system at the individual question level. A response was ‘correct’ only if it matched the single best answer as defined by the PROLOG volume, and ‘incorrect’ if it did not. Average performance across sessions was compared against the 2024 national Council on Resident Education in Obstetrics and Gynecology (CREOG) resident exam average as a contextual benchmark. Kruskal-Wallis tests, pairwise comparisons, and effect size comparisons using Cohen’s d were used to assess differences in performance across models and topics. </sec> <sec> <title>RESULTS</title> Overall accuracies were: 76% (Claude 3.5), 70% (ChatGPT-4), and 67% (Llama 3.1). Claude 3.5 outperformed the other models overall and in most topic areas, with the largest differences observed in Obstetrics and Reproductive Endocrinology. Accuracy was highest in Patient Management in the Office (84–86% across models) and lowest in Urogynecology and Pelvic Reconstructive Surgery (59–69%). Although comparisons are limited because PROLOG and CREOG questions are not identical, the reported national CREOG average serves as an indirect contextual benchmark. Within this context, average LLM performance on PROLOG questions (67%-76%) exceeded the reported national CREOG average across all resident levels (66%), but ChatGPT-4 (70%) and Llama 3.1 (67%) did not reach the average performance level of a PGY-4 resident (71%). </sec> <sec> <title>CONCLUSIONS</title> LLM accuracy overlapped with reported national CREOG averages. Claude 3.5 outperformed ChatGPT-4 and Llama 3.1, exceeding PGY-4 accuracy. While promising as educational adjuncts, LLMs currently operate at a trainee-level and should complement, not replace, traditional clinical training. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationSimulation-Based Education in HealthcareInnovations in Medical Education

Volltext beim Verlag öffnen

Performance of Large Language Models on Board-Style Obstetrics and Gynecology Questions: a cross-sectional study (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen