Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating Reasoning Effect for LLMs: Prompt Sensitivity and Text-Image Based Performance in Musculoskeletal Radiology
0
Zitationen
3
Autoren
2026
Jahr
Abstract
Multimodal large language models (LLMs) are increasingly applied in radiology, but the effect of reasoning capabilities across text- and image-based tasks remains unclear. We evaluated four multimodal LLMs—two non-reasoning (ChatGPT-4, Gemini 1.5 Pro) and two reasoning-capable (ChatGPT-5.1, Gemini 3)—using 50 text-based and 50 arrow-localized MSK radiographic anatomy questions, compared with two board-certified radiologists. Accuracy with 95% confidence intervals was calculated, and image-based errors were categorized. Reasoning-capable models outperformed non-reasoning models in text-based tasks, achieving near-ceiling accuracy (96% and 94%; all p≤0.008) with minimal prompt sensitivity. In image-based tasks, reasoning models performed better than non-reasoning models (70–72% vs 46–48%; p<0.001) but remained inferior to radiologists (88–90%). Errors were mainly adjacent-structure substitution and projection-related overlap. While reasoning enhances text-based performance and robustness, multimodal LLMs remain limited in fine-grained visual grounding and are best suited for supportive roles.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.758 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.666 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.220 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.896 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.