Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Assessment of Otolaryngology Knowledge Among Large Language Models
15
Zitationen
8
Autoren
2024
Jahr
Abstract
OBJECTIVE: The purpose of this study was to evaluate the performance of advanced large language models from OpenAI (GPT-3.5 and GPT-4), Google (PaLM2 and MedPaLM), and an open source model from Meta (Llama3:70b) in answering clinical test multiple choice questions in the field of otolaryngology-head and neck surgery. METHODS: A dataset of 4566 otolaryngology questions was used; each model was provided a standardized prompt followed by a question. One hundred questions that were answered incorrectly by all models were further interrogated to gain insight into the causes of incorrect answers. RESULTS: GPT4 was the most accurate, correctly answering 3520 of 4566 questions (77.1%). MedPaLM correctly answered 3223 of 4566 (70.6%) questions, while llama3:70b, GPT3.5, and PaLM2 were correct on 3052 of 4566 (66.8%), 2672 of 4566 (58.5%), and 2583 of 4566 (56.5%) questions. Three hundred and sixty-nine questions were answered incorrectly by all models. Prompts to provide reasoning improved accuracy in all models: GPT4 changed from incorrect to correct answer 31% of the time, while GPT3.5, Llama3, PaLM2, and MedPaLM corrected their responses 25%, 18%, 19%, and 17% of the time, respectively. CONCLUSION: Large language models vary in their understanding of otolaryngology-specific clinical knowledge. OpenAI's GPT4 has a strong understanding of core concepts as well as detailed information in the field of otolaryngology. Its baseline understanding in this field makes it well-suited to serve in roles related to head and neck surgery education provided that the appropriate precautions are taken and potential limitations are understood. LEVEL OF EVIDENCE: NA Laryngoscope, 135:629-634, 2025.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.773 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.682 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.242 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.898 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.