Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports

2026·1 Zitationen·BMC UrologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

This study aimed to evaluate the performance of large language models (LLMs) in classifying prostate MRI reports according to the Prostate Imaging–Reporting and Data System (PIRADS) version 2.1, and to validate their use in supporting clinical decisions in prostate cancer treatment. This retrospective study included 146 patients. Four LLMs — GPT-4o, GPT-o1, Google Gemini 1.5 Pro and Google Gemini 2.0 Experimental Advanced — were tested on standardised, structured prostate MRI reports. A two-radiologist consensus reference standard was used to compare model performance. Agreement was measured using weighted Cohen’s kappa, and accuracy and F1 scores were calculated for three PI-RADS risk groups: low (1–2), intermediate (3) and high (4–5). Performance varied by model. GPT-o1 achieved the highest level of agreement with radiologists (κ = 0.867), followed by GPT-4o (κ = 0.743), Gemini 1.5 Pro (κ = 0.728) and Gemini 2.0 Experimental Advanced (κ = 0.664). GPT-o1 achieved the highest F1 scores for the low-risk (0.93) and high-risk (1.00) groups, demonstrating moderate performance for the PI-RADS 3 group (0.75). All models showed weak performance for PI-RADS 3 (F1 range: 0.54–0.75). Most importantly, none of the models produced invalid results outside the target PI-RADS 1–5 range. LLMs show potential for automating PI-RADS classification from MRI reports, with GPT-o1 demonstrating the best overall performance. However, their failure in PI-RADS 3 lesions indicates that multicentre validation, larger datasets and multimodality integration are needed before they can be used clinically for prostate cancer diagnosis and urological decision-making. Not applicable. This retrospective study did not involve a clinical trial.

Autoren

Institutionen

Themen

Prostate Cancer Diagnosis and TreatmentArtificial Intelligence in Healthcare and EducationAI in cancer detection

Volltext beim Verlag öffnen

Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen