Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Incorporating information retrieval into AI chatbots for patient education on thyroid eye disease
2
Zitationen
7
Autoren
2025
Jahr
Abstract
PURPOSE: To evaluate the performance of general-purpose, retrieval-augmented, and medicine-specific AI chatbots in answering common thyroid eye disease (TED) patient questions. DESIGN: Cross-sectional comparative evaluation. Online TED forum discussion posts were collected and synthesized into 15 representative patient questions across five groups spanning clinical (treatment, diagnosis, management, epidemiology) and non-clinical topics, grouped into three difficulty levels. Three differing large language models (LLMs) generated responses were randomized and anonymized for blinded assessment. SUBJECTS, PARTICIPANTS, AND/OR CONTROLS: Three oculoplastic surgeons evaluated clinical metrics; 3 medical students assessed non-clinical metrics. METHODS: Three AI models generated responses: GPT-4o-mini (ChatGPT), a retrieval-augmented generation model grounded in TED literature (ChatGPT-RAG), and a specially trained LLM for healthcare professionals (OpenEvidence). Blinded raters assessed randomized responses. Statistical analysis used paired Wilcoxon signed-rank tests with Hedges' g for effect sizes. MAIN OUTCOMES MEASURED: Clinical evaluation of responses was conducted using a 7-point Likert scale for relevance, accuracy, balance, and scope. Non-clinical metrics of empathy, understandability, and readability were also assessed using validated tools. RESULTS: OpenEvidence significantly outperformed both ChatGPT (mean clinical score 5.96 vs 4.94; Hedges' g = 1.21, P < 0.001) and ChatGPT-RAG (5.96 vs 5.55; g = 0.53, P < 0.001) in clinical rankings and across most clinical metrics, including accuracy and relevance. However, performance patterns reversed for non-clinical metrics, with ChatGPT consistently outperforming specialized models in empathy, understandability, and actionability (18.4 vs 14.96 for OpenEvidence; g = 1.25, P < 0.001). Across both domains, ChatGPT-RAG achieved intermediate performance, more closely trailing OpenEvidence clinically (g = 0.53) and ChatGPT with respect to non-clinical metrics (g = 0.44). Limitations include a modest sample of raters and synthesized questions from online forums, which may affect generalizability. CONCLUSIONS: Specialized medical AI models may have better clinical accuracy, while general-purpose models may outperform in patient communication and accessibility. The development of retrieval augmented generation-based approaches combining clinical precision with effective communication represents a promising direction for AI-powered patient education in TED and, potentially, other complex conditions.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.764 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.674 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.234 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.898 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.