Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Evaluation of Large Language Model Chatbots in Endodontic Diagnosis Using Clinical Case Vignettes
0
Zitationen
5
Autoren
2026
Jahr
Abstract
Introduction: Large language models (LLMs) are increasingly applied in healthcare, yet their diagnostic accuracy in endodontics remains underexplored. This study evaluated the performance of four chatbots—OpenAI GPT-4o, Microsoft Copilot, Google Gemini, and Gemini Advanced (GeminiA)—on endodontic case vignettes. Methods: Seven clinical cases from the American Association of Endodontists newsletter were presented to each chatbot at two time points (June–July 2024). Fifty-six responses were collected and independently scored by two board-certified endodontists. Diagnostic accuracy (pulpal, apical, and overall) was recorded as a binary outcome. Response quality was assessed using a modified Global Quality Score (mGQS, 5-point), reasoning accuracy (6-point), and completeness (3-point). Secondary outcomes included readability (Flesch–Kincaid grade level) and word count. Mixed-effects models evaluated differences among chatbots. Results: Overall diagnostic accuracy was 69.6% (39/56), with significant differences across chatbots (p = 0.006). GeminiA achieved the highest scores across all qualitative measures (mGQS 4.93 ± 0.27; reasoning 5.93 ± 0.27; completeness 3.0 ± 0.00). GPT-4o also demonstrated high performance (mGQS 4.71 ± 0.47; reasoning 5.64 ± 0.50; completeness 2.79 ± 0.43). Copilot consistently underperformed. Readability exceeded college level across chatbots, and word counts varied, with Copilot having the shortest and Gemini having the longest responses. Conclusions: In this exploratory study, advanced LLMs, particularly GeminiA and GPT-4o, outperformed Copilot and Gemini in endodontic diagnosis and reasoning quality. However, these findings should be interpreted with caution, given the limited number of cases and use of publicly available datasets that may have been included in model training. Further validation using larger, de novo case sets is warranted before these tools can be recommended as adjuncts for education or clinical decision support. Clinical significance: Large language model chatbots show promise as adjunctive tools for endodontic diagnosis. Understanding their strengths and limitations may help clinicians and students critically interpret AI-generated recommendations and support clinical decision-making.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.740 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.649 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.202 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.886 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.