Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Generalist Large-Language Models for Spine Imaging Diagnostics: An Early Analysis of Detection Performance for Scoliosis and Lumbar Stenosis
0
Zitationen
15
Autoren
2026
Jahr
Abstract
BACKGROUND: Web-based large language models (LLMs) are increasingly used by patients for medical self-assessment, but their efficacy in spine imaging diagnostics remains underexplored. This study systematically evaluated five leading multimodal LLMs-Grok 2, Grok 3, Grok 4, ChatGPT, and Gemini-for detecting scoliosis and lumbar spinal stenosis across radiographs and MRI modalities. METHODS: We assessed 171 full-length anterior-posterior radiographs (100 with scoliosis, 71 normal) and 200 axial T2-weighted lumbar spine MRIs (100 with severe stenosis, 100 normal) from public databases. Models were prompted without examples to identify pathology and quantify certainty (0-100%). Analyses included McNemar's test for accuracy and ANOVA for confidence levels. RESULTS: In scoliosis detection, Grok 4 exhibited superior accuracy (0.942), followed by Gemini (0.912), Grok 2 (0.890), ChatGPT (0.643), and Grok 3 (0.637). For stenosis, Gemini performed best (0.600), then Grok 4 (0.575), ChatGPT (0.545), Grok 2 (0.500), and Grok 3 (0.450). All models sustained >70% mean certainty (SD <5.3%) across pathologies. ChatGPT and Grok 3 demonstrated reduced confidence in erroneous scoliosis responses (p<0.0001), while only ChatGPT did so for stenosis. Gemini reported elevated confidence in incorrect stenosis responses (p<0.0001). CONCLUSIONS: LLMs perform highly in scoliosis detection but struggle to identify lumbar stenosis. ChatGPT's superior confidence calibration, suggests enhanced reliability. Performance inconsistencies across model iterations (e.g., Grok 3 underperforming Grok 2) underscore the necessity for specialized medical imaging training. Although promising for patient education in simple spine conditions, substantial advancements in accuracy and confidence metrics are essential prior to clinical adoption or broad patient utilization.
Ähnliche Arbeiten
A survey on deep learning in medical image analysis
2017 · 14.029 Zit.
nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation
2020 · 8.179 Zit.
Calculation of average PSNR differences between RD-curves
2001 · 4.093 Zit.
Magnetic Resonance Classification of Lumbar Intervertebral Disc Degeneration
2001 · 3.943 Zit.
Vertebral fracture assessment using a semiquantitative technique
1993 · 3.636 Zit.