Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Key aspects of fine-tuning and applying LLM-as-a-judge for clinical data summaries in the radiological workflow

2026·0 Zitationen·Frontiers in Artificial IntelligenceOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background This study aims to describe our experience in fine-tuning an LLM-as-a-Judge to evaluate the quality of clinical text summarization in the field of radiology and to formalize the main problems we encountered in solving this task. Methods In this study, information from the Russian language electronic medical records of 30 patients was used. Patients who underwent abdominal computed tomography were selected. Anonymized information about complaints, disease history, medical history, and laboratory and instrumental findings was obtained from the electronic medical records of patients. This information was summarized by six large language models. The resulting summarizations were then evaluated by experts and six different LLMs-as-a-Judges. Kendall’s coefficient of concordance was employed to measure consistency. Results The primary difficulties that we encountered in the development of LLM-as-a-Judge included the selection of the rating scale, evaluation criteria, various categories of members included in the expert team, and prompt granularity. No definitive association was identified between scale size and the consistency of ratings between radiologist experts and LLMs-as-a-Judges. Across different evaluation criteria, the highest level of consistency was achieved with varying scale sizes. Our results indicate that criteria effective for human text evaluation are not always suitable for assessment via an LLM-as-a-Judge. For the majority of the criteria, the highest consistency was observed when all LLMs-as-a-Judges operated with a detailed description of extreme scale values or without a detailed scale description in the prompt. For the effective development of an LLM judge, it is necessary to involve a diverse team of experts. Conclusion For the proper configuration of an LLM-as-a-Judge, numerous factors should be considered, the number of which varies depending on the specific task. To achieve optimal results, additional experiments should be conducted to fine-tune the prompt and other model hyperparameters, comparing their responses against the desired output. Clinical trial registration ClinicalTrials.gov , identifier NCT07057830.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingRadiology practices and education

Volltext beim Verlag öffnen

Key aspects of fine-tuning and applying LLM-as-a-judge for clinical data summaries in the radiological workflow

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen