Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Assessment of ChatGPT-4.0 versus ChatGPT-Mini in Generating Guideline-Based Hypertension Content
0
Zitationen
15
Autoren
2026
Jahr
Abstract
Abstract Background Artificial intelligence (AI) language models are increasingly used to generate patient education materials. However, their accuracy, completeness, and adherence to clinical guidelines remain uncertain. Objectives To compare ChatGPT-Mini and ChatGPT-4.0 in the generation of hypertension education content with respect to accuracy, completeness, structural quality using the Ensuring Quality Information for Patients (EQIP), response consistency, and alignment with established guidelines. Methods A standardized set of 31 hypertension-related questions was submitted to both models. Outputs were independently evaluated by 10 blinded clinicians using a modified EQIP score, a 5-point accuracy scale, and a 3-point completeness scale. Response consistency was assessed using BERTScore. Between-model comparisons were performed using the two-sided Wilcoxon rank-sum test (p < 0.05). Effect sizes were reported as Hodges–Lehmann (HL) median differences and Cliff’s delta (δ), both with 95% CIs. Inter-rater reliability was estimated using the intraclass correlation coefficient (ICC; two-way random effects model, absolute agreement). Results Central tendency measures favored ChatGPT-4.0, although differences were small. Median scores were as follows: accuracy, 4.10 (3.70-4.20) versus 3.73 (3.60-4.05); completeness, 1.26 (1.17-1.41) versus 1.10 (0.96-1.23); and total EQIP score, 19.5 (18.0-25.0) versus 18.5 (16.0-23.0) for ChatGPT-4.0 and ChatGPT-Mini, respectively. HL median differences were small, with 95% CIs crossing zero (accuracy: +0.37, −0.25 to +0.50; completeness: +0.16, −0.06 to +0.36; EQIP: +1.0, −1.0 to +6.0). Cliff’s δ values were consistently small and positive across primary outcomes, indicating only modest stochastic dominance of ChatGPT-4.0. Identification clarity tended to be higher with ChatGPT-4.0, whereas response consistency measured by BERTScore F1 was generally higher for ChatGPT-Mini (> 0.92 versus 0.885-0.932). Inter-rater reliability was good to excellent across all measures (ICC > 0.80). Conclusions ChatGPT-4.0 demonstrated small, non-significant improvements in accuracy, completeness, and structural quality compared with ChatGPT-Mini. Effect sizes were modest, and all 95% CIs included zero. ChatGPT-Mini produced more consistent responses. These findings underscore the importance of routinely reporting effect sizes with 95% CIs and support the use of standardized evaluation methods and real-time validation frameworks for AI-generated medical education content.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.393 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.259 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.688 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.502 Zit.
Autoren
- Romullo José Costa Ataídes
- Marcos Adriano Garcia Campos
- João Vítor Perez de Souza
- Rafael Cardoso Rocha
- Almir Alamino Lacalle
- Ciro Bezerra Vieira
- Thiago Artioli
- Tiago Cordeiro Medeiros
- Erito Marques de Souza Filho
- Ronaldo Altenburg Gismondi
- Érika Maria Gonçalves Campana
- Francisco José Romeo
- Victor Razuk
- João Ricardo Nickenig Vissoci
- Renato Delascio Lopes
Institutionen
- Universidade de São Paulo(BR)
- Universidade Brasil(BR)
- Duke University(US)
- Instituto Dante Pazzanese de Cardiologia(BR)
- Universidade Federal do Maranhão(BR)
- Universidade Federal Fluminense(BR)
- Universidade Federal Rural do Rio de Janeiro(BR)
- Universidade do Estado do Rio de Janeiro(BR)
- University of Miami(US)
- Clinical Research Institute(US)