Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large-Scale Evaluation of Machine Learning Models in Identifying Follow-Up Recommendations in Radiology Reports

2025·0 Zitationen·Radiology

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Background Radiology reports often contain follow-up recommendations vital for optimal patient care, prevention of complications, and mitigation of legal risk. However, there is a lack of comprehensive comparison methods for identifying these recommendations across a large volume of reports from various modalities, including open-source large language models. Purpose To evaluate the performance of machine learning (ML) models, including Meta's open-source LLAMA3 and OpenAI's Health Insurance Portability and Accountability Act-compliant Generative Pre-trained Transformer, in identifying follow-up recommendations in radiology reports. Materials and Methods In this retrospective study, three sets of radiology reports were analyzed across multiple imaging modalities from a large urban academic medical center: an expert annotated dataset (n = 11 901) from January 1 to January 10, 2015; a dataset (n = 32 959) extracted through regular expressions (ie, sequences of characters that define search patterns in text) from January 11, 2015, to January 1, 2017; and a dataset (n = 4909) annotated during dictation from September 8, 2018, to February 23, 2021. To assess generalization on impressions, two expertly annotated datasets were used: 2000 chest radiography reports from the publicly available MIMIC-CXR database for external testing and 100 institutional CT reports from January 1 to January 15, 2024, for temporal testing. Thirty-two text classification methods were evaluated separately based on the findings and impression sections of these reports. Performance metrics included precision, recall, accuracy, and F1 score; with 95% bootstrapped CIs and areas under the precision-recall curve. Statistical comparisons were performed by using the McNemar test. Results The study included 49 769 reports from 35 509 patients (mean age, 52.2 years ± 22.0 [SD]; 18 477 female patients) for training (n = 37 140), validation (n = 2584), and internal testing (n = 10 045). For the findings section, a generative-discriminative model initialized with Google's Word2vec embeddings (Hybrid-google) achieved the highest F1 score (0.835; 95% CI: 0.825, 0.845). For the impression section, an attention-based bidirectional long short-term memory (LSTM) with random initialization (AttBiLSTM-random) performed best, with an F1 score of 0.979 (95% CI: 0.976, 0.982). Prefixed prompting with GPT-4 demonstrated superior external and temporal generalization performance on the MIMIC-CXR and institutional CT datasets, achieving F1 scores of 0.969 (95% CI: 0.961, 0.977) and 0.973 (95% CI: 0.937, 1.000), respectively. Conclusion ML models showed promise for automating the classification of follow-up recommendations in radiology reports. © RSNA, 2025 Supplemental material is available for this article.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiology practices and educationCOVID-19 diagnosis using AI

Volltext beim Verlag öffnen

Large-Scale Evaluation of Machine Learning Models in Identifying Follow-Up Recommendations in Radiology Reports

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen