OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.05.2026, 16:14

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

2025·2 Zitationen·Digital HealthOpen Access
Volltext beim Verlag öffnen

2

Zitationen

20

Autoren

2025

Jahr

Abstract

Importance Compares the responses of four AI models to common nephrology-related questions encountered in clinical settings. Objective To evaluate generative AI models in enhancing nephrology patient communication and education. Design Generative AI in Nephrology Setting In a study conducted from December 8–12, 2023, and October 21–23, 2024, IT engineers evaluated GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 for nephrology patient communication and education, querying each with 21 nephrology questions and three renal biopsy reports, repeated for consistency. Intervention(s) (for clinical trials) or Exposure(s) (for observational studies) None. Main Outcome(s) and Measure(s) Fifteen nephrologists and one nephrology researcher assessed responses for Appropriateness, Helpfulness, Consistency, and human-like empathy, with rating scale (1–4). Using Shapiro–Wilk and Mann–Whitney U tests with Holm correction, along with TF-IDF, BertScore, and ROUGE were used. The study compared the performance of GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 across 24 nephrology-related questions. Results GPT-4o consistently achieved high scores in Appropriateness (3.39 ± 0.7) and Helpfulness (3.24 ± 0.73), while PaLM 2 demonstrated the highest consistency score (3.0 ± 0.86). In empathy, GPT-4 achieved the highest overall score (80.73%), excelling in patient-centric scenarios, followed by GPT-4o (76.56%). PaLM 2 showed competitive empathy in specific cases, despite scoring lower in consistency and Appropriateness. For Kidney-Related Queries, GPT-4o excelled in relevance metrics, achieving the highest BertScore (0.57) and ROUGE for one-word metrics (0.54). Gemini 1.0 Ultra led in generating coherent responses for Renal Biopsy Reports with the highest TF-IDF (0.56) and ROUGE for longest similar sentences (0.47). All 101 references provided by GPT-4 were 100% accurate. Conclusions and Relevance GPT-4o emerged as the most accurate and consistent model across most evaluation categories, while GPT-4 demonstrated superior empathy and balanced performance. PaLM 2 and Gemini 1.0 Ultra showed strengths in specific areas, highlighting the potential for tailored applications of generative AI in nephrology clinical practice.

Ähnliche Arbeiten