Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating Consistency and Accuracy of GPT-4 Omni to Analyze Thyroid Ultrasound Features and ACR TR Categories to Aid Report Generation

2026·0 Zitationen·Current Medical Imaging Formerly Current Medical Imaging ReviewsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

INTRODUCTION: Multimodal large language models, including GPT-4 Omni (GPT-4o), have been applied for facilitating the healthcare process, but their capacity to interpret thyroid sonography images to aid report generation, as well as ways for improvements, are unclear. METHODS: 120 thyroid nodules were retrospectively included for evaluation of GPT-4o to analyze ultrasound features and ACR TR categories (version 2017). In a zero-shot setting, 80 original images of unmarked nodules (zero-shot unmarked group) and images with nodules' boundary artificially depicted by senior radiologists with red circles (zero-shot marked group) were repetitively input into GPT-4o, respectively with identical prompts for 3 attempts without examples. In a few-shot setting, another 40 images with artificially marked nodule boundary (few-shot marked group) were input after 3 examples. The marking gold standard was established by 2 senior radiologists with over 10 years of experience in thyroid sonography. Consistency of GPT-4o was evaluated with the Gwet agreement coefficient (AC1) value calculated. The mean accuracy of GPT-4o across different settings was compared using the Mann-Whitney test with Bonferroni correction, in comparison to the mean accuracy of 2 junior radiologists with 1 and 3 years of experience in thyroid sonography, respectively. RESULTS: The AC1 values were 0.466 [0.367,0.564], 0.778 [0.696,0.860], 0.823 [0.711,0.934], respectively, for zero-shot unmarked group, zero-shot marked group, and few-shot marked group. The mean accuracy of the 3 groups to judge TR categories was 18.75% [13.78%,23.72%], 42.50% [36.20%,48.80%], 79.17% [71.80%,86.54%]. Zero-shot marked group outperformed zero-shot unmarked group, and the few-shot setting performed even better (p<0.001). Particularly, segmentation helped GPT-4o detect composition, shape, and margin of nodules, and a few-shot setting helped detect echogenicity, margin, and calcification (p<0.001). Compared with junior radiologists, the few-shot marked group achieved a similar accuracy in identifying composition, echogenicity, calcification, and TR categories (p>0.05) and performed even better in identifying the margin of thyroid nodules (p=0.004). DISCUSSION: GPT-4o's performance to analyze original images of thyroid nodules was insufficient, possibly owing to incorrect nodule recognition and a lack of standardized reference. After adopting segmentation methods and a few-shot setting, its performance was improved significantly. CONCLUSION: GPT-4o's consistency and accuracy of analyzing thyroid sonography images can be gradually improved by segmentation methods and a few-shot setting, and finally achieves a junior-radiologist level in this preliminary study. This can potentially benefit report generation, while multicenter validation is needed.

Autoren

Institutionen

Themen

Radiology practices and educationArtificial Intelligence in Healthcare and EducationThyroid Cancer Diagnosis and Treatment

Volltext beim Verlag öffnen

Evaluating Consistency and Accuracy of GPT-4 Omni to Analyze Thyroid Ultrasound Features and ACR TR Categories to Aid Report Generation

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen