Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
How DISCERNing is ChatGPT? An Evaluation of Models and Prompt Engineering in Assessing Patient Education Materials
0
Zitationen
4
Autoren
2026
Jahr
Abstract
Objectives: The objective of this study is to evaluate whether ChatGPT models can reliably apply the DISCERN instrument, a 16-question human-scored rubric developed in 1999 to evaluate consumer health information, and assess the impact of prompting strategies, model choice, and scoring repeatability on agreement with human-derived DISCERN scores. Methods: A PubMed search of "DISCERN" identified English-language studies since 2019 reporting exact webpage URLs with corresponding human-derived DISCERN scores. Archived versions of 42 webpages were retrieved. Three ChatGPT models (GPT-5.2, GPT-4o, and o3) were evaluated using four prompting strategies: "Naïve" zero-shot, item-level "Split" scoring, "Augmented" prompting with DISCERN guidance, and a "Combined" split-plus-augmented approach. Agreement with human scores was assessed using correlations and absolute differences. Repeatability was examined using 10 repeated scoring runs across 9 webpages. Results: Agreement between ChatGPT-generated and human DISCERN scores was weak to moderate. All models demonstrated systematic score compression, overestimating low-quality webpages and underestimating high-quality webpages. Combined prompting modestly improved agreement and reduced absolute error, particularly for the o3 model, which consistently outperformed GPT-5.2 and GPT-4o. Substantial run-to-run variability was observed with a mean score range of 17.5 points and ranges up to 43 points for the same webpage. Averaging scores across runs did not improve agreement with human ratings. ChatGPT's DISCERN scoring reflects systematic attenuation consistent with prediction under noisy subjective measurement. Prompt engineering did not correct calibration bias or reproducibility limitations. Conclusion: Under the prompting strategies evaluated, ChatGPT models were insufficient for reliable automated DISCERN scoring. Persistent attenuation bias and poor repeatability significantly limit clinical or research applicability.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.553 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.444 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.943 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.