OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 23.05.2026, 19:07

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The Effectiveness of the Multimodal Language Model, Google Gemini 2.5 Pro, in Solving the Specialization Exam in Gynecology and Obstetrics

2025·1 Zitationen·CureusOpen Access
Volltext beim Verlag öffnen

1

Zitationen

14

Autoren

2025

Jahr

Abstract

BACKGROUND: Artificial intelligence (AI) models are developing rapidly, with growing ability to process extensive and up-to-date medical knowledge. This makes them increasingly important as didactic tools, widely used by residents preparing for the Państwowy Egzamin Specjalizacyjny (PES) or the State Specialization Examination. However, concerns remain about the accuracy and reliability of AI-generated answers. Systematic validation is therefore essential, particularly in the context of exam questions, to ensure safe and effective use in medical education. The aim of this study was to assess the educational potential of Gemini 2.5 Pro in solving the PES in gynecology and obstetrics by comparing its responses with the official key and analyzing the declared confidence level for each answer. MATERIALS AND METHODS: This study was designed to empirically verify the ability of the multimodal language model Gemini 2.5 PRO to solve examination tasks at a specialist level. The research material was the PES paper in obstetrics and gynecology (spring session 2025), provided by the Center for Medical Examinations in Łódź. For the final analysis, 119 questions were qualified after one of the tasks was invalidated by the examination committee. During the exam simulation, two parameters were recorded for each question: the consistency of the model's answer with the official key and a subjective assessment of confidence expressed by the model on a 5-point scale. The collected data were used to check whether the model's effectiveness depended on the nature of the question (clinical vs. theoretical) using the chi-squared test, and to assess the correlation between confidence and the correctness of the answer using the Mann-Whitney U test. RESULTS: According to the observations, Gemini 2.5 PRO passed the exam with a score of 96.63%, achieving 115 points. The model's effectiveness was similar, regardless of whether the question concerned a clinical case or theoretical knowledge (p=0.313). We cannot demonstrate that the level of confidence is correlated with the effectiveness of the answer (p=0.064). When comparing the model's confidence level depending on whether the question concerned a clinical case (12 questions) or theoretical knowledge (107 questions), the difference turned out to be statistically insignificant, which means that the level of confidence was similar regardless of the question category (clinical vs. theoretical). CONCLUSIONS: The results of the PES exam in gynecology and obstetrics clearly show that the Gemini Pro model achieved high effectiveness. A key observation was that no correlation can be demonstrated between the model's confidence level and the correctness of the answers given. Additionally, no statistically significant difference was shown in the model's confidence level between theoretical and clinical questions. The analysis proves that AI has enormous potential to support specialized education. Despite such spectacular effects, further in-depth research and constant substantive supervision by specialists are necessary to safely integrate AI with teaching programs.

Ähnliche Arbeiten