OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 24.05.2026, 09:47

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Incorporating information retrieval into AI chatbots for patient education on thyroid eye disease

2025·2 Zitationen·International Journal of Medical InformaticsOpen Access
Volltext beim Verlag öffnen

2

Zitationen

7

Autoren

2025

Jahr

Abstract

PURPOSE: To evaluate the performance of general-purpose, retrieval-augmented, and medicine-specific AI chatbots in answering common thyroid eye disease (TED) patient questions. DESIGN: Cross-sectional comparative evaluation. Online TED forum discussion posts were collected and synthesized into 15 representative patient questions across five groups spanning clinical (treatment, diagnosis, management, epidemiology) and non-clinical topics, grouped into three difficulty levels. Three differing large language models (LLMs) generated responses were randomized and anonymized for blinded assessment. SUBJECTS, PARTICIPANTS, AND/OR CONTROLS: Three oculoplastic surgeons evaluated clinical metrics; 3 medical students assessed non-clinical metrics. METHODS: Three AI models generated responses: GPT-4o-mini (ChatGPT), a retrieval-augmented generation model grounded in TED literature (ChatGPT-RAG), and a specially trained LLM for healthcare professionals (OpenEvidence). Blinded raters assessed randomized responses. Statistical analysis used paired Wilcoxon signed-rank tests with Hedges' g for effect sizes. MAIN OUTCOMES MEASURED: Clinical evaluation of responses was conducted using a 7-point Likert scale for relevance, accuracy, balance, and scope. Non-clinical metrics of empathy, understandability, and readability were also assessed using validated tools. RESULTS: OpenEvidence significantly outperformed both ChatGPT (mean clinical score 5.96 vs 4.94; Hedges' g = 1.21, P < 0.001) and ChatGPT-RAG (5.96 vs 5.55; g = 0.53, P < 0.001) in clinical rankings and across most clinical metrics, including accuracy and relevance. However, performance patterns reversed for non-clinical metrics, with ChatGPT consistently outperforming specialized models in empathy, understandability, and actionability (18.4 vs 14.96 for OpenEvidence; g = 1.25, P < 0.001). Across both domains, ChatGPT-RAG achieved intermediate performance, more closely trailing OpenEvidence clinically (g = 0.53) and ChatGPT with respect to non-clinical metrics (g = 0.44). Limitations include a modest sample of raters and synthesized questions from online forums, which may affect generalizability. CONCLUSIONS: Specialized medical AI models may have better clinical accuracy, while general-purpose models may outperform in patient communication and accessibility. The development of retrieval augmented generation-based approaches combining clinical precision with effective communication represents a promising direction for AI-powered patient education in TED and, potentially, other complex conditions.

Ähnliche Arbeiten