OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 26.03.2026, 05:20

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Clinical reasoning with machines: evaluating the interpretive depth of AI in urological case assessments

2026·3 Zitationen·BMC UrologyOpen Access
Volltext beim Verlag öffnen

3

Zitationen

5

Autoren

2026

Jahr

Abstract

Large language models (LLMs) are increasingly utilized as decision-support tools in medicine. However, their clinical reliability and applicability remain uncertain. This study compared ChatGPT-3.5, ChatGPT-4o, and Gemini 1.0 Pro in responding to standardized urological clinical scenarios evaluated by blinded experts. This observational cross-sectional study included 75 urology specialists categorized by experience (< 10 years vs. ≥ 10 years). Participants independently and blindly rated anonymized AI-generated responses for 10 common urological cases using a 5-point Likert scale across four predefined domains: accuracy, reliability, clinical applicability, and interpretive depth. Normality was assessed with the Shapiro–Wilk test, and ANOVA or Kruskal–Wallis tests were used as appropriate, followed by post-hoc pairwise analyses. Inter-rater reliability was calculated using Cronbach’s α and Fleiss’ κ. Spearman correlation coefficients were computed to examine associations among rating domains. ChatGPT-4o achieved the highest mean scores across all domains, followed by Gemini 1.0 Pro and ChatGPT-3.5. Performance differences were statistically significant for all parameters (p < 0.05), with the largest gaps observed in accuracy (4.4 ± 0.48 vs. 4.0 ± 0.52 vs. 3.7 ± 0.56) and clinical applicability (4.2 ± 0.49 vs. 3.8 ± 0.51 vs. 3.5 ± 0.55). A moderate positive correlation was observed between accuracy and reliability (r = 0.50), while the previously reported negative correlation between reliability and interpretive depth was corrected to r = − 0.18, indicating only a weak inverse relationship. Inter-rater agreement was high (Cronbach’s α = 0.84; Fleiss’ κ = 0.72). Newer-generation large language models, particularly ChatGPT-4o, showed higher performance scores in terms of accuracy and clinical applicability in standardized urological decision-support scenarios. However, these findings should be interpreted with caution and require confirmation through repeated-measures or mixed-model analyses as well as validation in real-world clinical settings. Ongoing benchmarking of evolving AI systems remains important to monitor longitudinal improvements while ensuring safety, reliability, and appropriate clinical use.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen