OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 02.04.2026, 19:54

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing consistency of AI chatbot responses in ophthalmology medical exams

2025·1 Zitationen·AJO InternationalOpen Access
Volltext beim Verlag öffnen

1

Zitationen

6

Autoren

2025

Jahr

Abstract

• Gemini 2.0 Experimental Advanced showed the highest accuracy and lowest variability. • GPT-4 showed significantly lower test accuracy in the evening. • Different user credentials did not show a significant difference in performance. • All performed significantly better in text-based questions compared to image-based. Large language models (LLMs) are increasingly evaluated in ophthalmology, often using single test iterations that overlook whether responses remain consistent under repeated conditions. We aim to assess commonly used AI models under multiple testing iterations with varying conditions, including time of day, user credentials, and input type to determine their stability in ophthalmology contexts. Comparative analysis study. We tested GPT-4o, GPT-4, and Gemini 2.0 Experimental Advanced on 111 multiple-choice questions from the “fundamentals” section of the Israeli ophthalmology residency exams. Each underwent 12 testing iterations, alternating daily testing time and user accounts to assess for potential biases. Iteration-level consistency was assessed by variation in accuracy across runs, while question-level consistency measured agreement in answers per question. Mixed-effects logistic regression estimated the effects of time of day, user account, and question modality. Question-level agreement was further analysed with Fleiss’ κ and response-pattern distributions. Gemini achieved the highest overall accuracy with smallest variation (84.5%, SD 1.54), followed by GPT-4o (81.2%, SD 1.75) and GPT-4 (72.4%, SD 2.98). Mixed-effects models showed significant evening performance decline in GPT-4 (OR 1.61, p=0.0045). No account-related differences were observed. All models performed markedly worse on image-based items than text (p<0.001). Question-level analysis revealed high raw consistency but lower corrected consistency, especially for GPT-4. LLMs tested demonstrated stable outputs across repeated questioning, though with notable model-specific variability and consistent challenges in image-based items. Future consistency testing should complement accuracy assessments when evaluating LLMs for potential integration into ophthalmology education and practice.

Ähnliche Arbeiten