OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 09:36

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large-Scale Evaluation of Machine Learning Models in Identifying Follow-Up Recommendations in Radiology Reports

2025·0 Zitationen·Radiology
Volltext beim Verlag öffnen

0

Zitationen

16

Autoren

2025

Jahr

Abstract

Background Radiology reports often contain follow-up recommendations vital for optimal patient care, prevention of complications, and mitigation of legal risk. However, there is a lack of comprehensive comparison methods for identifying these recommendations across a large volume of reports from various modalities, including open-source large language models. Purpose To evaluate the performance of machine learning (ML) models, including Meta's open-source LLAMA3 and OpenAI's Health Insurance Portability and Accountability Act-compliant Generative Pre-trained Transformer, in identifying follow-up recommendations in radiology reports. Materials and Methods In this retrospective study, three sets of radiology reports were analyzed across multiple imaging modalities from a large urban academic medical center: an expert annotated dataset (<i>n</i> = 11 901) from January 1 to January 10, 2015; a dataset (<i>n</i> = 32 959) extracted through regular expressions (ie, sequences of characters that define search patterns in text) from January 11, 2015, to January 1, 2017; and a dataset (<i>n</i> = 4909) annotated during dictation from September 8, 2018, to February 23, 2021. To assess generalization on impressions, two expertly annotated datasets were used: 2000 chest radiography reports from the publicly available MIMIC-CXR database for external testing and 100 institutional CT reports from January 1 to January 15, 2024, for temporal testing. Thirty-two text classification methods were evaluated separately based on the findings and impression sections of these reports. Performance metrics included precision, recall, accuracy, and F1 score; with 95% bootstrapped CIs and areas under the precision-recall curve. Statistical comparisons were performed by using the McNemar test. Results The study included 49 769 reports from 35 509 patients (mean age, 52.2 years ± 22.0 [SD]; 18 477 female patients) for training (<i>n</i> = 37 140), validation (<i>n</i> = 2584), and internal testing (<i>n</i> = 10 045). For the findings section, a generative-discriminative model initialized with Google's Word2vec embeddings (Hybrid-google) achieved the highest F1 score (0.835; 95% CI: 0.825, 0.845). For the impression section, an attention-based bidirectional long short-term memory (LSTM) with random initialization (AttBiLSTM-random) performed best, with an F1 score of 0.979 (95% CI: 0.976, 0.982). Prefixed prompting with GPT-4 demonstrated superior external and temporal generalization performance on the MIMIC-CXR and institutional CT datasets, achieving F1 scores of 0.969 (95% CI: 0.961, 0.977) and 0.973 (95% CI: 0.937, 1.000), respectively. Conclusion ML models showed promise for automating the classification of follow-up recommendations in radiology reports. © RSNA, 2025 <i>Supplemental material is available for this article.</i>

Ähnliche Arbeiten