Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Key aspects of fine-tuning and applying LLM-as-a-judge for clinical data summaries in the radiological workflow
0
Zitationen
9
Autoren
2026
Jahr
Abstract
Background This study aims to describe our experience in fine-tuning an LLM-as-a-Judge to evaluate the quality of clinical text summarization in the field of radiology and to formalize the main problems we encountered in solving this task. Methods In this study, information from the Russian language electronic medical records of 30 patients was used. Patients who underwent abdominal computed tomography were selected. Anonymized information about complaints, disease history, medical history, and laboratory and instrumental findings was obtained from the electronic medical records of patients. This information was summarized by six large language models. The resulting summarizations were then evaluated by experts and six different LLMs-as-a-Judges. Kendall’s coefficient of concordance was employed to measure consistency. Results The primary difficulties that we encountered in the development of LLM-as-a-Judge included the selection of the rating scale, evaluation criteria, various categories of members included in the expert team, and prompt granularity. No definitive association was identified between scale size and the consistency of ratings between radiologist experts and LLMs-as-a-Judges. Across different evaluation criteria, the highest level of consistency was achieved with varying scale sizes. Our results indicate that criteria effective for human text evaluation are not always suitable for assessment via an LLM-as-a-Judge. For the majority of the criteria, the highest consistency was observed when all LLMs-as-a-Judges operated with a detailed description of extreme scale values or without a detailed scale description in the prompt. For the effective development of an LLM judge, it is necessary to involve a diverse team of experts. Conclusion For the proper configuration of an LLM-as-a-Judge, numerous factors should be considered, the number of which varies depending on the specific task. To achieve optimal results, additional experiments should be conducted to fine-tune the prompt and other model hyperparameters, comparing their responses against the desired output. Clinical trial registration ClinicalTrials.gov , identifier NCT07057830.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.380 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.243 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.671 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.496 Zit.