Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study

2026·0 Zitationen·Journal of orthopaedic surgeryOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background Integrating large language models (LLMs) into decision-making and education has shown promise across various healthcare disciplines. The study aimed to evaluate the performance of leading LLMs—ChatGPT-5, Gemini 2, Grok 3, and DeepSeek R1—in accurately responding to structured multiple-choice and open-ended queries about complex case scenarios in hand surgery. Methods A prospective cross-sectional analysis used 50 clinically relevant, guideline-based case scenarios developed for hand surgery. Each scenario consisted of four open-ended and two multiple-choice questions, totaling 300 points per LLM. Responses were independently assessed by blinded expert reviewers using a standardized six-point Likert scale evaluating accuracy, completeness, and adherence to international surgical guidelines. Results In multiple-choice queries, Gemini (5.9 ± 0.2) and Grok (5.9 ± 0.1) outperformed ChatGPT (5.7 ± 0.3; p = 0.031 and p = 0.009, respectively) and DeepSeek (5.6 ± 0.4; p = 0.004 and p = 0.001, respectively). In open-ended queries, Gemini (5.6 ± 0.3 accuracy) and Grok (5.5 ± 0.4 accuracy) demonstrated superior results across all measured dimensions—accuracy, completeness, and guideline adherence—markedly surpassing ChatGPT (5.1 ± 0.5 accuracy, p < 0.001) and DeepSeek (4.9 ± 0.6 accuracy; p < 0.001). Notably, Gemini and Grok demonstrated consistently high performance with minimal variability, while ChatGPT, particularly DeepSeek, exhibited considerable inconsistency in complex clinical judgments. Conclusion Gemini 2 and Grok 3 showed reliable and clinically relevant performance, positioning them as promising adjunctive tools for decision-making and education in hand surgery. The limitations in ChatGPT-5 and the significant shortcomings of DeepSeek underscore the necessity for cautious deployment and continued refinement.

Autoren

Ahmet Savran

Institutionen

Izmir Institute of Technology(TR)

Themen

Artificial Intelligence in Healthcare and EducationDiversity and Career in MedicineClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen