Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmark analysis of myopia-related issues using large language models: a comparison of ChatGPT-4o and deepseek
1
Zitationen
5
Autoren
2025
Jahr
Abstract
OBJECTIVE: This study evaluated the accuracy and comprehensiveness of responses generated by ChatGPT-4o and DeepSeek regarding commonly asked questions about myopia. METHODS: Thirty myopia-related questions spanning six clinical domains were submitted to both chatbots. Three medical professionals independently rated each response for accuracy and comprehensiveness. Inter-rater reliability was assessed using Fleiss' Kappa, and Shapiro-Wilk tests were conducted to examine normality in rating distributions. Statistical comparisons were performed using the Chi-square test, with significance set at p < 0.05. RESULTS: DeepSeek outperformed ChatGPT-4o in overall accuracy, with significantly more responses rated as "Good" (p < 0.0001). Both models demonstrated high comprehensiveness scores when accuracy was rated "Good," though performance declined in treatment-related queries, particularly regarding commercial products like DIMS lenses. Fleiss' Kappa values indicated poor inter-rater agreement (DeepSeek: [Formula: see text] = 0.106; ChatGPT-4o: [Formula: see text] = - 0.0221), and normality tests showed non-normal score distributions (p < 0.0001 across domains). CONCLUSION: Both ChatGPT-4o and DeepSeek can deliver useful responses to myopia-related questions, though limitations remain in areas requiring up-to-date, region-specific treatment information. DeepSeek's stronger performance suggests that localized LLMs may offer competitive advantages. Ongoing refinement, regular data updates, and domain-specific fine-tuning are essential for improving the reliability of AI chatbots in clinical communication.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.693 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.598 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.124 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.871 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.