Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparing large language models and text embedding models for automated classification of textual, semantic, and critical changes in radiology reports
1
Zitationen
12
Autoren
2025
Jahr
Abstract
PURPOSE: Radiology reports can change during workflows, especially when residents draft preliminary versions that attending physicians finalize. We explored how large language models (LLMs) and embedding techniques can categorize these changes into textual, semantic, or clinically actionable types. METHODS: , precision, and recall. RESULTS: Inter-rater reliability among evaluators was excellent (κ = 0.990). Of the reports analyzed, 1.3 % contained critical changes. The tested methods showed significant performance differences (P < 0.001). The Qwen3-235B-A22B model using a zero-shot prompt, most closely aligned with human assessments of changes in clinical reports, achieving a κ of 0.822 (SD 0.031). The best conventional metric, word difference, had a κ of 0.732 (SD 0.048), the difference between the two showed statistical significance in unadjusted post-hoc tests (P = 0.038) but lost significance after adjusting for multiple testing (P = 0.064). Embedding models underperformed compared to LLMs and classical methods, showing statistical significance in most cases. CONCLUSION: Large language models like Qwen3-235B-A22B demonstrated moderate to strong alignment with expert evaluations of the clinical significance of changes in radiology reports. LLMs outperformed embedding methods and traditional string and word approaches, achieving statistical significance in most instances. This demonstrates their potential as tools to support peer review.
Ähnliche Arbeiten
Refinement and reassessment of the SERVQUAL scale.
1991 · 3.967 Zit.
Radiobiology for the Radiologist.
1974 · 3.502 Zit.
ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee
2017 · 2.441 Zit.
Accuracy of Physician Self-assessment Compared With Observed Measures of Competence
2006 · 2.329 Zit.
Technology as an Occasion for Structuring: Evidence from Observations of CT Scanners and the Social Order of Radiology Departments
1986 · 2.254 Zit.