Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating Consistency and Accuracy of GPT-4 Omni to Analyze Thyroid Ultrasound Features and ACR TR Categories to Aid Report Generation
0
Zitationen
9
Autoren
2026
Jahr
Abstract
INTRODUCTION: Multimodal large language models, including GPT-4 Omni (GPT-4o), have been applied for facilitating the healthcare process, but their capacity to interpret thyroid sonography images to aid report generation, as well as ways for improvements, are unclear. METHODS: 120 thyroid nodules were retrospectively included for evaluation of GPT-4o to analyze ultrasound features and ACR TR categories (version 2017). In a zero-shot setting, 80 original images of unmarked nodules (zero-shot unmarked group) and images with nodules' boundary artificially depicted by senior radiologists with red circles (zero-shot marked group) were repetitively input into GPT-4o, respectively with identical prompts for 3 attempts without examples. In a few-shot setting, another 40 images with artificially marked nodule boundary (few-shot marked group) were input after 3 examples. The marking gold standard was established by 2 senior radiologists with over 10 years of experience in thyroid sonography. Consistency of GPT-4o was evaluated with the Gwet agreement coefficient (AC1) value calculated. The mean accuracy of GPT-4o across different settings was compared using the Mann-Whitney test with Bonferroni correction, in comparison to the mean accuracy of 2 junior radiologists with 1 and 3 years of experience in thyroid sonography, respectively. RESULTS: The AC1 values were 0.466 [0.367,0.564], 0.778 [0.696,0.860], 0.823 [0.711,0.934], respectively, for zero-shot unmarked group, zero-shot marked group, and few-shot marked group. The mean accuracy of the 3 groups to judge TR categories was 18.75% [13.78%,23.72%], 42.50% [36.20%,48.80%], 79.17% [71.80%,86.54%]. Zero-shot marked group outperformed zero-shot unmarked group, and the few-shot setting performed even better (p<0.001). Particularly, segmentation helped GPT-4o detect composition, shape, and margin of nodules, and a few-shot setting helped detect echogenicity, margin, and calcification (p<0.001). Compared with junior radiologists, the few-shot marked group achieved a similar accuracy in identifying composition, echogenicity, calcification, and TR categories (p>0.05) and performed even better in identifying the margin of thyroid nodules (p=0.004). DISCUSSION: GPT-4o's performance to analyze original images of thyroid nodules was insufficient, possibly owing to incorrect nodule recognition and a lack of standardized reference. After adopting segmentation methods and a few-shot setting, its performance was improved significantly. CONCLUSION: GPT-4o's consistency and accuracy of analyzing thyroid sonography images can be gradually improved by segmentation methods and a few-shot setting, and finally achieves a junior-radiologist level in this preliminary study. This can potentially benefit report generation, while multicenter validation is needed.
Ähnliche Arbeiten
Refinement and reassessment of the SERVQUAL scale.
1991 · 3.967 Zit.
Radiobiology for the Radiologist.
1974 · 3.502 Zit.
ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee
2017 · 2.432 Zit.
Accuracy of Physician Self-assessment Compared With Observed Measures of Competence
2006 · 2.326 Zit.
Technology as an Occasion for Structuring: Evidence from Observations of CT Scanners and the Social Order of Radiology Departments
1986 · 2.251 Zit.