Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative Evaluation of Large Language Model Chatbots in Endodontic Diagnosis Using Clinical Case Vignettes

2026·0 Zitationen·Applied SciencesOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Introduction: Large language models (LLMs) are increasingly applied in healthcare, yet their diagnostic accuracy in endodontics remains underexplored. This study evaluated the performance of four chatbots—OpenAI GPT-4o, Microsoft Copilot, Google Gemini, and Gemini Advanced (GeminiA)—on endodontic case vignettes. Methods: Seven clinical cases from the American Association of Endodontists newsletter were presented to each chatbot at two time points (June–July 2024). Fifty-six responses were collected and independently scored by two board-certified endodontists. Diagnostic accuracy (pulpal, apical, and overall) was recorded as a binary outcome. Response quality was assessed using a modified Global Quality Score (mGQS, 5-point), reasoning accuracy (6-point), and completeness (3-point). Secondary outcomes included readability (Flesch–Kincaid grade level) and word count. Mixed-effects models evaluated differences among chatbots. Results: Overall diagnostic accuracy was 69.6% (39/56), with significant differences across chatbots (p = 0.006). GeminiA achieved the highest scores across all qualitative measures (mGQS 4.93 ± 0.27; reasoning 5.93 ± 0.27; completeness 3.0 ± 0.00). GPT-4o also demonstrated high performance (mGQS 4.71 ± 0.47; reasoning 5.64 ± 0.50; completeness 2.79 ± 0.43). Copilot consistently underperformed. Readability exceeded college level across chatbots, and word counts varied, with Copilot having the shortest and Gemini having the longest responses. Conclusions: In this exploratory study, advanced LLMs, particularly GeminiA and GPT-4o, outperformed Copilot and Gemini in endodontic diagnosis and reasoning quality. However, these findings should be interpreted with caution, given the limited number of cases and use of publicly available datasets that may have been included in model training. Further validation using larger, de novo case sets is warranted before these tools can be recommended as adjuncts for education or clinical decision support. Clinical significance: Large language model chatbots show promise as adjunctive tools for endodontic diagnosis. Understanding their strengths and limitations may help clinicians and students critically interpret AI-generated recommendations and support clinical decision-making.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsDental Research and COVID-19

Volltext beim Verlag öffnen

Comparative Evaluation of Large Language Model Chatbots in Endodontic Diagnosis Using Clinical Case Vignettes

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen