Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news

2025·4 Zitationen·BMC Public HealthOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

With the growing prevalence of health misinformation online, there is an urgent need for tools that can reliably assist the public in evaluating the quality of health information. This study investigates the performance of GPT-3.5-Turbo, a representative and widely used large language model (LLM), in rating the quality of health news and providing explanatory justification for the rating assessment. We evaluated GPT-3.5-Turbo’s performance on 3222 health news articles from an expert-annotated dataset compiled by HealthNewsReview.org, which assesses the quality of health news across nine criteria. GPT-3.5-Turbo was prompted with standardized queries tailored to each criterion. We measured its rating performance using 95% confidence intervals for precision, recall, and F1 scores in binary classification (satisfactory/not satisfactory). Additionally, linguistic complexity, readability, and the quality of GPT-3.5-Turbo’s explainability were assessed through both quantitative linguistic analysis and qualitative evaluation of consistency and contextual relevance. GPT-3.5-Turbo’s rating performance varied across criteria, with the highest accuracy for the Cost criterion (F1 = 0.824) but lower accuracy for Benefit, Conflict, and Quality criteria (F1 < 0.5), underperforming compared to traditional supervised machine learning models. However, its explanations were clear, with readability suited to late high school or early college levels and scored highly for consistency (average score: 2.90/3) and contextual relevance (average score: 2.73/3). These findings highlight GPT-3.5-Turbo’s strength in providing understandable and contextually relevant explanations, despite that its rating accuracy is limited. While GPT-3.5-Turbo’s rating accuracy requires improvement, its strength in offering comprehensible and contextually relevant explanations presents a valuable opportunity to enhance public understanding of health news quality. Leveraging LLMs as complementary tools for health literacy initiatives could help mitigate misinformation by facilitating non-expert audiences to interpret and assess health information.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationCOVID-19 diagnosis using AIMisinformation and Its Impacts

Volltext beim Verlag öffnen

Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen