Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Feasibility and exploratory assessment of large language models for pediatric dentistry queries: a comparative study
0
Zitationen
8
Autoren
2026
Jahr
Abstract
Background Large Language Models (LLMs) are increasingly used by caregivers to obtain pediatric health information. However, concerns persist regarding the accuracy, reliability, and readability of AI-generated content, especially in pediatric dentistry, where caregiver comprehension is crucial. Objective To conduct an exploratory feasibility assessment of evaluating accuracy, quality, reliability, and readability of responses generated by ChatGPT-4, Google Gemini, and DeepSeek to common pediatric dentistry queries. Methods This exploratory comparative cross-sectional feasibility study utilized 15 patient-oriented pediatric dentistry questions identified through structured searches and expert screening. Each question was submitted verbatim to ChatGPT-4, Gemini, and DeepSeek under standardized conditions. Responses were independently evaluated by three calibrated pediatric dentistry experts using the Global Quality Scale (GQS), a modified DISCERN tool, and the Accuracy of Information Index (AOI). Readability was assessed using the Flesch Reading Ease Score (FRES) and the Flesch–Kincaid Grade Level (FKGL). Inter-examiner reliability was assessed using intraclass correlation coefficients (ICC). Statistical comparisons between LLMs were performed using a fixed-effects model with post-hoc pairwise analysis. Inter-examiner agreement was further evaluated using Bland–Altman analysis. A p -value of <0.05 was considered statistically significant. Results Overall scoring was consistent across examiners, with minor variability observed across domains. A linear mixed-effects model conducted separately for each domain demonstrated that LLM type significantly influenced GQS scores (F = 7.90, p = 0.00), with Gemini and DeepSeek outperforming ChatGPT. No significant differences were observed for AOI ( p = 0.44) and DISCERN ( p = 0.06). Bland-Altman analysis indicated minimal inter-examiner bias; however, the limits of agreement were relatively wide considering the scale range, reflecting variability between individual ratings. Single-measure ICC demonstrated poor agreement (ICC = 0.26), while higher reliability observed when scores were averaged (ICC = 0.90). Conclusion This study offers an exploratory feasibility assessment of LLM evaluation in pediatric dentistry. While the models generally produced high-quality outputs, variations in accuracy, readability, and significant inter-examiner variability highlight important methodological challenges. These findings represent preliminary groundwork and require validation in larger, clinically diverse, real-world settings. LLMs may serve as supportive informational tools; however, their outputs should be interpreted cautiously and used to complement, not replace professional clinical judgment.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.551 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.443 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.942 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.