Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Development and Performance of a Large Language Model for the Quality Evaluation of Multi‐Language Medical Imaging Guidelines and Consensus
1
Zitationen
12
Autoren
2025
Jahr
Abstract
AIM: This study aimed to develop and evaluate an automated large language model (LLM)-based system for assessing the quality of medical imaging guidelines and consensus (GACS) in different languages, focusing on enhancing evaluation efficiency, consistency, and reducing manual workload. METHOD: We developed the QPC-HASE-GuidelineEval algorithm, which integrates a Four-Quadrant Questions Classification Strategy and Hybrid Search Enhancement. The model was validated on 45 medical imaging guidelines (36 in Chinese and 9 in English) published in 2021 and 2022. Key evaluation metrics included consistency with expert assessments, hybrid search paragraph matching accuracy, information completeness, comparisons of different paragraph matching approaches, and cost-time efficiency. RESULTS: The algorithm demonstrated an average accuracy of 77%, excelling in simpler tasks but showing lower accuracy (29%-40%) in complex evaluations, such as explanations and visual aids. The average accuracy rates of the English and Chinese versions of the GACS were 74% and 76%, respectively (p = 0.37). Hybrid search demonstrated superior performance with paragraph matching accuracy (4.42) and information completeness (4.42), significantly outperforming keyword-based search (1.05/1.05) and sparse-dense retrieval (4.26/3.63). The algorithm significantly reduced evaluation time to 8 min and 30 s per guideline and reduced costs to approximately 0.5 USD per guideline, offering a considerable advantage over traditional manual methods. CONCLUSION: The QPC-HASE-GuidelineEval algorithm, powered by LLMs, showed strong potential for improving the efficiency, scalability, and multi-language capability of guideline evaluations, though further enhancements are needed to handle more complex tasks that require deeper interpretation.
Ähnliche Arbeiten
Refinement and reassessment of the SERVQUAL scale.
1991 · 3.967 Zit.
Radiobiology for the Radiologist.
1974 · 3.502 Zit.
ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee
2017 · 2.432 Zit.
Accuracy of Physician Self-assessment Compared With Observed Measures of Competence
2006 · 2.326 Zit.
Technology as an Occasion for Structuring: Evidence from Observations of CT Scanners and the Social Order of Radiology Departments
1986 · 2.251 Zit.