Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Assessing consistency of AI chatbot responses in ophthalmology medical exams
1
Zitationen
6
Autoren
2025
Jahr
Abstract
• Gemini 2.0 Experimental Advanced showed the highest accuracy and lowest variability. • GPT-4 showed significantly lower test accuracy in the evening. • Different user credentials did not show a significant difference in performance. • All performed significantly better in text-based questions compared to image-based. Large language models (LLMs) are increasingly evaluated in ophthalmology, often using single test iterations that overlook whether responses remain consistent under repeated conditions. We aim to assess commonly used AI models under multiple testing iterations with varying conditions, including time of day, user credentials, and input type to determine their stability in ophthalmology contexts. Comparative analysis study. We tested GPT-4o, GPT-4, and Gemini 2.0 Experimental Advanced on 111 multiple-choice questions from the “fundamentals” section of the Israeli ophthalmology residency exams. Each underwent 12 testing iterations, alternating daily testing time and user accounts to assess for potential biases. Iteration-level consistency was assessed by variation in accuracy across runs, while question-level consistency measured agreement in answers per question. Mixed-effects logistic regression estimated the effects of time of day, user account, and question modality. Question-level agreement was further analysed with Fleiss’ κ and response-pattern distributions. Gemini achieved the highest overall accuracy with smallest variation (84.5%, SD 1.54), followed by GPT-4o (81.2%, SD 1.75) and GPT-4 (72.4%, SD 2.98). Mixed-effects models showed significant evening performance decline in GPT-4 (OR 1.61, p=0.0045). No account-related differences were observed. All models performed markedly worse on image-based items than text (p<0.001). Question-level analysis revealed high raw consistency but lower corrected consistency, especially for GPT-4. LLMs tested demonstrated stable outputs across repeated questioning, though with notable model-specific variability and consistent challenges in image-based items. Future consistency testing should complement accuracy assessments when evaluating LLMs for potential integration into ophthalmology education and practice.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.357 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.221 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.640 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.482 Zit.