OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.05.2026, 09:58

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparison of GPT-5 and GPT-4o in Solving the Polish Centre for Medical Examinations (CEM) Gastroenterology Examination

2026·0 Zitationen·CureusOpen Access
Volltext beim Verlag öffnen

0

Zitationen

15

Autoren

2026

Jahr

Abstract

INTRODUCTION: Large language models (LLMs) are increasingly explored as tools for medical education and assessment. While prior studies have demonstrated strong performance of LLMs on undergraduate and general medical examinations, their reliability and calibration on specialty-level certification exams remain insufficiently characterized. In particular, little is known about how model-reported confidence aligns with correctness in high-stakes medical testing. OBJECTIVE: The aim of this study was to compare the accuracy of responses and the calibration of self-reported confidence of GPT-4o and GPT-5 when completing a national specialty-level gastroenterology examination administered by the Polish Centre for Medical Examinations (CEM). The CEM gastroenterology exam was selected as a standardized, high-stakes certification assessment that evaluates advanced specialist knowledge and complex clinical problem-solving within a narrowly defined medical domain. Although previous studies have examined the performance of LLMs in other national and international medical examinations, the performance of LLMs in the Polish specialty examination in gastroenterology has not been previously analyzed as a distinct domain. This study, therefore, aims to assess how contemporary LLMs perform within the context of national postgraduate specialty certification and to provide a reference point for comparison with results obtained in other specialties and examination systems. METHODS: Both models were administered 120 multiple-choice questions from the official CEM gastroenterology State Specialization Examination (PES). Accuracy was assessed against the official answer key, with 95% CI calculated using the Wilson method. Paired differences in accuracy were analyzed using McNemar's test. Self-reported confidence levels were recorded on a 10-point scale, and point-biserial correlations were used to evaluate the relationship between confidence and correctness, with Bonferroni correction applied for multiple testing. RESULTS: GPT-4o achieved an accuracy of 85.0% (102/120; 95% CI: 77.6-90.3), while GPT-5 achieved 86.7% (104/120; 95% CI: 79.5-91.6). The difference in accuracy was not statistically significant (χ² = 1.0, p = 0.625). Mean confidence levels were similarly high for both models. The confidence-accuracy correlation was weak and non-significant for GPT-4o (r = 0.14), whereas GPT-5 demonstrated a statistically significant positive correlation (r = 0.28), which remained significant after correction for multiple testing. CONCLUSIONS: Both GPT-4o and GPT-5 exceeded the passing threshold for the CEM gastroenterology examination, demonstrating strong performance on a specialty-level medical assessment. Although overall accuracy was comparable, GPT-5 showed superior alignment between confidence and correctness, suggesting improved metacognitive reliability rather than a substantial gain in raw accuracy. These findings highlight the potential educational value of newer LLMs while underscoring important limitations, including the restricted sample size, exam-specific context, and lack of assessment of real-world clinical reasoning. Ethical considerations such as hallucinations, overconfidence, and inappropriate clinical reliance remain critical barriers to direct clinical deployment. Future research should focus on broader exam representativeness, task difficulty stratification, and controlled integration of LLMs into postgraduate medical education.

Ähnliche Arbeiten