Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis
13
Zitationen
3
Autoren
2024
Jahr
Abstract
PURPOSE: To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions. DESIGN: Meta-analysis. METHODS: Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed. RESULTS: Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61-0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73-0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51-0.54). LLMs performed best in "pathology" (0.78 [95 % CI: 0.70-0.86]) and worst in "fundamentals and principles of ophthalmology" (0.52 [95 % CI: 0.48-0.56]). CONCLUSIONS: The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.764 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.674 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.234 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.898 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.