Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Feasibility and exploratory assessment of large language models for pediatric dentistry queries: a comparative study

2026·0 Zitationen·Frontiers in Oral HealthOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background Large Language Models (LLMs) are increasingly used by caregivers to obtain pediatric health information. However, concerns persist regarding the accuracy, reliability, and readability of AI-generated content, especially in pediatric dentistry, where caregiver comprehension is crucial. Objective To conduct an exploratory feasibility assessment of evaluating accuracy, quality, reliability, and readability of responses generated by ChatGPT-4, Google Gemini, and DeepSeek to common pediatric dentistry queries. Methods This exploratory comparative cross-sectional feasibility study utilized 15 patient-oriented pediatric dentistry questions identified through structured searches and expert screening. Each question was submitted verbatim to ChatGPT-4, Gemini, and DeepSeek under standardized conditions. Responses were independently evaluated by three calibrated pediatric dentistry experts using the Global Quality Scale (GQS), a modified DISCERN tool, and the Accuracy of Information Index (AOI). Readability was assessed using the Flesch Reading Ease Score (FRES) and the Flesch–Kincaid Grade Level (FKGL). Inter-examiner reliability was assessed using intraclass correlation coefficients (ICC). Statistical comparisons between LLMs were performed using a fixed-effects model with post-hoc pairwise analysis. Inter-examiner agreement was further evaluated using Bland–Altman analysis. A p -value of &lt;0.05 was considered statistically significant. Results Overall scoring was consistent across examiners, with minor variability observed across domains. A linear mixed-effects model conducted separately for each domain demonstrated that LLM type significantly influenced GQS scores (F = 7.90, p = 0.00), with Gemini and DeepSeek outperforming ChatGPT. No significant differences were observed for AOI ( p = 0.44) and DISCERN ( p = 0.06). Bland-Altman analysis indicated minimal inter-examiner bias; however, the limits of agreement were relatively wide considering the scale range, reflecting variability between individual ratings. Single-measure ICC demonstrated poor agreement (ICC = 0.26), while higher reliability observed when scores were averaged (ICC = 0.90). Conclusion This study offers an exploratory feasibility assessment of LLM evaluation in pediatric dentistry. While the models generally produced high-quality outputs, variations in accuracy, readability, and significant inter-examiner variability highlight important methodological challenges. These findings represent preliminary groundwork and require validation in larger, clinically diverse, real-world settings. LLMs may serve as supportive informational tools; however, their outputs should be interpreted cautiously and used to complement, not replace professional clinical judgment.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationText Readability and SimplificationHealth Literacy and Information Accessibility

Volltext beim Verlag öffnen

Feasibility and exploratory assessment of large language models for pediatric dentistry queries: a comparative study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen