Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Clinical reasoning with machines: evaluating the interpretive depth of AI in urological case assessments
3
Zitationen
5
Autoren
2026
Jahr
Abstract
Large language models (LLMs) are increasingly utilized as decision-support tools in medicine. However, their clinical reliability and applicability remain uncertain. This study compared ChatGPT-3.5, ChatGPT-4o, and Gemini 1.0 Pro in responding to standardized urological clinical scenarios evaluated by blinded experts. This observational cross-sectional study included 75 urology specialists categorized by experience (< 10 years vs. ≥ 10 years). Participants independently and blindly rated anonymized AI-generated responses for 10 common urological cases using a 5-point Likert scale across four predefined domains: accuracy, reliability, clinical applicability, and interpretive depth. Normality was assessed with the Shapiro–Wilk test, and ANOVA or Kruskal–Wallis tests were used as appropriate, followed by post-hoc pairwise analyses. Inter-rater reliability was calculated using Cronbach’s α and Fleiss’ κ. Spearman correlation coefficients were computed to examine associations among rating domains. ChatGPT-4o achieved the highest mean scores across all domains, followed by Gemini 1.0 Pro and ChatGPT-3.5. Performance differences were statistically significant for all parameters (p < 0.05), with the largest gaps observed in accuracy (4.4 ± 0.48 vs. 4.0 ± 0.52 vs. 3.7 ± 0.56) and clinical applicability (4.2 ± 0.49 vs. 3.8 ± 0.51 vs. 3.5 ± 0.55). A moderate positive correlation was observed between accuracy and reliability (r = 0.50), while the previously reported negative correlation between reliability and interpretive depth was corrected to r = − 0.18, indicating only a weak inverse relationship. Inter-rater agreement was high (Cronbach’s α = 0.84; Fleiss’ κ = 0.72). Newer-generation large language models, particularly ChatGPT-4o, showed higher performance scores in terms of accuracy and clinical applicability in standardized urological decision-support scenarios. However, these findings should be interpreted with caution and require confirmation through repeated-measures or mixed-model analyses as well as validation in real-world clinical settings. Ongoing benchmarking of evolving AI systems remains important to monitor longitudinal improvements while ensuring safety, reliability, and appropriate clinical use.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.312 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.169 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.564 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.466 Zit.