Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability
0
Zitationen
5
Autoren
2025
Jahr
Abstract
Aim: With the advent of the internet, patients are increasingly turning to online resources and AI-powered chatbots for answers to their questions, particularly regarding endodontic treatment. The aim of this study is to compare three large language models (LLMs) (ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1) in terms of accuracy, understandability, and readability, based on the answers provided to frequently asked endodontic questions. Methodology: Thirty open-ended frequently asked questions were generated from selected topics using the AlsoAsked and AnswerThePublic websites. The questions were administered on September 2, 2025, in separate reset sessions for each model, with a single response allowed. Two experienced endodontists independently and in a double-blind manner scored the accuracy of the responses using a 5-point Likert scale. The understandability of the responses was analyzed using the Patient Education Materials Assessment Tool for Printed Materials (PEMAT-P) tool. Readability was assessed using the Flesch-Kincaid Readability Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI) indexes. Inter-rater reliability was assessed using ICC, and group comparisons were performed using ANOVA/Kruskal-Wallis tests based on normality, followed by post-hoc Dunn-Bonferroni tests. Results: Inter-rater agreement was excellent (accuracy ICC: 0.908–0.917; reliability ICC: 0.992–0.995; all p<0.001). A significant difference was found between the models in terms of accuracy (p<0.001): DeepSeek-V3.1 (4.63±0.81) scored the highest, performing significantly better than ChatGPT-5 (3.93±0.79) and Gemini 2.5 Flash (3.67±0.76); There was no difference between ChatGPT-5 and Gemini 2.5 Flash (p>0.05). The understandability (PEMAT-P) scores were similar (p=0.683), and all models scored above 70% (ChatGPT-5, 77.46%; Gemini, 76.04%; DeepSeek-V3.1, 77.57%). Differences were found in readability metrics: DeepSeek-V3.1 scored higher than ChatGPT-5 in FRES (p=0.044); Gemini scored higher than DeepSeek-V3.1 in FKGL (p=0.001); In GFI, Gemini 2.5 Flash scored higher than both ChatGPT-5 (p=0.036) and DeepSeek-V3.1 (p<0.001); In SMOG, Gemini outperformed DeepSeek-V3.1 (p=0.003); In CLI, ChatGPT-5 was higher than DeepSeek-V3.1 (p=0.004). No significant correlation was found between readability and understandability (p>0.05). Conclusions: For patient questions related to endodontics, DeepSeek-V3.1 outperformed ChatGPT-5 and Gemini 2.5 Flash in terms of accuracy. While all models produced similar scores above the PEMAT-P understandability threshold (70%), there were significant differences in readability metrics; furthermore, no model consistently reached the recommended 6th–8th grade level. LLMs can support patient education and communication; however, dentist oversight is required regarding the clinical accuracy and individual suitability of responses, and chatbots should not be used as the sole decision-makers. How to cite this article: Taşyürek M, Adıgüzel Ö, Gündoğar M, Goncharuk-Khomyn M, Ortaç H. Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability. Int Dent Res. 2025;15(Advanced Online). doi:10.5577/indentres.662
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.513 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.407 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.882 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.571 Zit.