Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
ChatGPT-5 versus other mainstream large language models in core diabetic retinopathy patient queries
0
Zitationen
8
Autoren
2026
Jahr
Abstract
Background Diabetic retinopathy is a leading cause of preventable vision loss, and patients increasingly seek disease related information through online consultations. Large language models may support patient education, but their reliability and usability vary across systems, particularly in disease specific settings. Methods Thirty common patient questions about diabetic retinopathy were developed from guidelines and organized into five domains: disease overview, screening and diagnosis, treatment and follow up, lifestyle and prevention, and prognosis and complication management. From November 10 to 15, 2025, two researchers independently submitted all questions to five models (ChatGPT-5, DeepSeek-V3.1, Doubao, Wenxinyiyan 4.5 Turbo, and Kimi) on public platforms under identical conditions without system prompts. Chat histories were reset before each question. Response time, response length, structural metrics, and table outputs were extracted. Two retinal specialists rated each answer on a 1 to 5 Likert scale across accuracy, logical consistency, coherence, safety, and content accessibility. Inter rater agreement was assessed with the intraclass correlation coefficient. Group differences were analyzed using analysis of variance or the Kruskal–Wallis H test with Bonferroni corrected pairwise comparisons. Results Significant between model differences were observed in output efficiency and textual characteristics (all P < 0.001). ChatGPT-5 responded fastest (15.92 ± 4.48 s), whereas Wenxinyiyan 4.5 Turbo and DeepSeek-V3.1 were slowest (41.89 ± 5.09 s and 38.20 ± 2.96 s). DeepSeek-V3.1 generated the longest answers (1396.37 ± 189.23 words), while Kimi produced the shortest (579.40 ± 182.96 words). Only ChatGPT-5 consistently generated structured tables (median 2.00, IQR 1.00-2.00). Content quality differed significantly across all five dimensions (H = 15.34-37.19, all P ≤ 0.004). ChatGPT-5 achieved the highest median scores for accuracy (5.00, IQR 4.00-5.00) and logical consistency (4.50, IQR 4.00-5.00), whereas Kimi showed the lowest accuracy (3.50, IQR 3.00-4.00). The intraclass correlation coefficient indicated good inter rater reliability (0.87). Conclusion Performance of large language models in diabetic retinopathy patient consultations is model dependent. ChatGPT-5 demonstrated the best overall usability, combining faster responses, clearer structure, and higher factual accuracy. Other Chinese optimized models provided comparable professional information coverage but require improved accessibility and stability for safe patient facing use.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.357 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.221 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.640 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.482 Zit.