Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of large language models on ophthalmology clinical vignettes: A comparative evaluation of Indian indigenous artificial intelligence tools
0
Zitationen
3
Autoren
2026
Jahr
Abstract
The use of large language models (LLMs; e.g. ChatGPT), a subtype of generative artificial intelligence tools, in ophthalmology has evoked great interest given that ophthalmology is a medical field that integrates analysis of multiple inputs, namely, clinical history and image analysis.[1] However, research on LLM applications in ophthalmology has largely focused on OpenAI’s ChatGPT since its public release in November 2022.[2] Little attention has been paid to the performance of indigenously developed Indian LLMs (e.g. Sarvam’s language model and Krutrim AI's Kruti).[3,4] We performed an experiment using OpenAI’s ChatGPT (version GPT-5), along with indigenous LLMs from India, that is, Sarvam’s language model and Krutrim AI’s Kruti from July to August 2025. We utilized standardized clinical cases and the accompanying multiple choice questions (MCQs) published in the Indian Journal of Ophthalmology’s “One Minute Ophthalmology” section as inputs for the LLMs [Figs. 1 and 2].[5] Images, when available, were used as inputs only for Kruti and ChatGPT as Sarvam’s language model accepts only text inputs. LLM performance was benchmarked solely on the final selected MCQ option and compared against published reference answers; explanatory text generated by the LLM was not evaluated. We also evaluated hallucination rates (i.e., the rate at which the LLM generated an answer that was not listed among the options provided in the prompt).Figure 1: Representative screenshot illustrating standardized prompt entry and the final MCQ response generated by Sarvam’s language model using the One Minute Ophthalmology case “The Great Masquerader” by Garg et al.Figure 2: Representative screenshot illustrating standardized prompt entry and the final MCQ response generated by Krutrim AI’s Kruti, using the One Minute Ophthalmology case “The Great Masquerader” by Garg et al.When tested on 95 clinical vignettes, ChatGPT, Sarvam’s language model, and Kruti achieved accuracies of 83%, 67%, and 71%, respectively (P = 0.01; Cochran’s Q test), with hallucination rates of 1%, 0%, and 2% (P = 0.4; Cochran’s Q test). In post-hoc analysis using McNemar's test, ChatGPT outperformed Sarvam’s language model (P = 0.01) and Kruti (P = 0.02), while Sarvam’s language model and Kruti did not differ in performance (P = 0.70). For vignettes with images, the accuracies of ChatGPT, Sarvam’s language model, and Kruti were 83%, 71%, and 74%, respectively (P = 0.06; Cochran’s Q test). In multivariable logistic regression analysis, Kruti demonstrated lower odds of a correct response than ChatGPT (OR = 0.48; P = 0.04), while the presence of a clinical image in the case vignette (OR = 2.06; P = 0.1) and the corresponding author’s affiliation with an Indian institute (OR = 0.64; P = 0.2) were not significant predictors of model performance. The superior performance of ChatGPT can be attributed to differences in quality, quantity, and diversity of training data, and model size and scale, along with exposure to domain-specific fine-tuning (i.e., adapting a model by training it on data specific to ophthalmology in order to improve its performance and relevance in ophthalmology).[6,7] Likewise, the finding that images were not a predictor of LLM performance is consistent with prior work that suggested that LLMs often remain limited to processing unimodal data, typically in the form of text, and are often unable to effectively integrate diverse modalities such as images that are routinely encountered in clinical practice.[8] LLMs likely prioritized processing the textual information over images, suggesting reliance on text-dominant reasoning.[9] The affiliation of the corresponding author with an Indian institution was also not a predictor of LLM performance. This finding suggests that the models interpreted ophthalmic vignettes authored by Indian institutions as effectively as those from other regions, thus demonstrating future potential for wider LLM use in Indian clinics. This implies low cultural or linguistic bias in comprehension of standardized medical English, a reassuring finding given concerns about regional inequities in AI performance.[10] This finding also indicates that differences in model accuracy are more likely attributable to intrinsic factors such as training scale and domain specialization rather than language or contextual variation in input text. The experiment was not without limitations. MCQs from a single journal do not capture the full spectrum of diagnostic reasoning and management processes employed in real-world practice; real-world clinical decision-making rarely involves such constrained, discrete choices. Only the final MCQ option selected by the LLM was evaluated for correctness against the journal’s published reference answer; the underlying reasoning or explanatory text was not evaluated. Also, as new scientific evidence may render previously published answers incorrect and as LLMs are periodically updated, the findings may not be reproducible. Although this experiment did not directly assess the applicability of the LLMs in routine clinical care, the findings provide insights relevant to the future integration of LLMs into routine ophthalmology practice. The consistently low hallucination rates among LLMs are encouraging as they represent a lower likelihood of clinical misinformation, one of the major barriers to the safe deployment of LLMs in clinical settings.[11,12] However, the widespread adoption of LLMs in clinical settings also relies on a low error rate.[12] While there is no established standard for what constitutes an acceptable error rate, we consider the range of 17% to 29% errors to be high for an ophthalmological clinical setting. It is important to recognize that the error rates reflect incorrect responses to discrete MCQ prompts; real-world clinical decision-making is a far more complex process. Additionally, the inadequacies of LLMs to leverage visual data to sharpen clinical reasoning act as major barriers to wide-scale clinical adoption in image-heavy medical specialties like ophthalmology. In conclusion, ChatGPT outperformed indigenously developed Indian LLMs in answering MCQ-based ophthalmology clinical vignettes sourced from a single journal. The presence of clinical images or author affiliation did not affect model performance. Despite low hallucination rates, the observed error rates and reliance on text-dominant processing underscore the need for domain-specific optimization, improved multimodal integration, and human oversight before LLMs can be reliably adopted in ophthalmic clinical practice. Financial support and sponsorship Nil. Conflicts of interest There are no conflicts of interest.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.393 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.259 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.688 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.502 Zit.