OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 02.05.2026, 08:43

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

How DISCERNing is ChatGPT? An Evaluation of Models and Prompt Engineering in Assessing Patient Education Materials

2026·0 Zitationen·Applied Clinical Informatics
Volltext beim Verlag öffnen

0

Zitationen

4

Autoren

2026

Jahr

Abstract

Objectives: The objective of this study is to evaluate whether ChatGPT models can reliably apply the DISCERN instrument, a 16-question human-scored rubric developed in 1999 to evaluate consumer health information, and assess the impact of prompting strategies, model choice, and scoring repeatability on agreement with human-derived DISCERN scores. Methods: A PubMed search of "DISCERN" identified English-language studies since 2019 reporting exact webpage URLs with corresponding human-derived DISCERN scores. Archived versions of 42 webpages were retrieved. Three ChatGPT models (GPT-5.2, GPT-4o, and o3) were evaluated using four prompting strategies: "Naïve" zero-shot, item-level "Split" scoring, "Augmented" prompting with DISCERN guidance, and a "Combined" split-plus-augmented approach. Agreement with human scores was assessed using correlations and absolute differences. Repeatability was examined using 10 repeated scoring runs across 9 webpages. Results: Agreement between ChatGPT-generated and human DISCERN scores was weak to moderate. All models demonstrated systematic score compression, overestimating low-quality webpages and underestimating high-quality webpages. Combined prompting modestly improved agreement and reduced absolute error, particularly for the o3 model, which consistently outperformed GPT-5.2 and GPT-4o. Substantial run-to-run variability was observed with a mean score range of 17.5 points and ranges up to 43 points for the same webpage. Averaging scores across runs did not improve agreement with human ratings. ChatGPT's DISCERN scoring reflects systematic attenuation consistent with prediction under noisy subjective measurement. Prompt engineering did not correct calibration bias or reproducibility limitations. Conclusion: Under the prompting strategies evaluated, ChatGPT models were insufficient for reliable automated DISCERN scoring. Persistent attenuation bias and poor repeatability significantly limit clinical or research applicability.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationHealth Literacy and Information AccessibilitySocial Media in Health Education
Volltext beim Verlag öffnen