OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 06.04.2026, 01:51

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessment of ChatGPT-4.0 versus ChatGPT-Mini in Generating Guideline-Based Hypertension Content

2026·0 Zitationen·Arquivos Brasileiros de CardiologiaOpen Access
Volltext beim Verlag öffnen

0

Zitationen

15

Autoren

2026

Jahr

Abstract

Abstract Background Artificial intelligence (AI) language models are increasingly used to generate patient education materials. However, their accuracy, completeness, and adherence to clinical guidelines remain uncertain. Objectives To compare ChatGPT-Mini and ChatGPT-4.0 in the generation of hypertension education content with respect to accuracy, completeness, structural quality using the Ensuring Quality Information for Patients (EQIP), response consistency, and alignment with established guidelines. Methods A standardized set of 31 hypertension-related questions was submitted to both models. Outputs were independently evaluated by 10 blinded clinicians using a modified EQIP score, a 5-point accuracy scale, and a 3-point completeness scale. Response consistency was assessed using BERTScore. Between-model comparisons were performed using the two-sided Wilcoxon rank-sum test (p < 0.05). Effect sizes were reported as Hodges–Lehmann (HL) median differences and Cliff’s delta (δ), both with 95% CIs. Inter-rater reliability was estimated using the intraclass correlation coefficient (ICC; two-way random effects model, absolute agreement). Results Central tendency measures favored ChatGPT-4.0, although differences were small. Median scores were as follows: accuracy, 4.10 (3.70-4.20) versus 3.73 (3.60-4.05); completeness, 1.26 (1.17-1.41) versus 1.10 (0.96-1.23); and total EQIP score, 19.5 (18.0-25.0) versus 18.5 (16.0-23.0) for ChatGPT-4.0 and ChatGPT-Mini, respectively. HL median differences were small, with 95% CIs crossing zero (accuracy: +0.37, −0.25 to +0.50; completeness: +0.16, −0.06 to +0.36; EQIP: +1.0, −1.0 to +6.0). Cliff’s δ values were consistently small and positive across primary outcomes, indicating only modest stochastic dominance of ChatGPT-4.0. Identification clarity tended to be higher with ChatGPT-4.0, whereas response consistency measured by BERTScore F1 was generally higher for ChatGPT-Mini (> 0.92 versus 0.885-0.932). Inter-rater reliability was good to excellent across all measures (ICC > 0.80). Conclusions ChatGPT-4.0 demonstrated small, non-significant improvements in accuracy, completeness, and structural quality compared with ChatGPT-Mini. Effect sizes were modest, and all 95% CIs included zero. ChatGPT-Mini produced more consistent responses. These findings underscore the importance of routinely reporting effect sizes with 95% CIs and support the use of standardized evaluation methods and real-time validation frameworks for AI-generated medical education content.

Ähnliche Arbeiten