OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 03:11

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study

2026·0 Zitationen·Journal of orthopaedic surgeryOpen Access
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2026

Jahr

Abstract

BackgroundIntegrating large language models (LLMs) into decision-making and education has shown promise across various healthcare disciplines. The study aimed to evaluate the performance of leading LLMs-ChatGPT-5, Gemini 2, Grok 3, and DeepSeek R1-in accurately responding to structured multiple-choice and open-ended queries about complex case scenarios in hand surgery.MethodsA prospective cross-sectional analysis used 50 clinically relevant, guideline-based case scenarios developed for hand surgery. Each scenario consisted of four open-ended and two multiple-choice questions, totaling 300 points per LLM. Responses were independently assessed by blinded expert reviewers using a standardized six-point Likert scale evaluating accuracy, completeness, and adherence to international surgical guidelines.ResultsIn multiple-choice queries, Gemini (5.9 ± 0.2) and Grok (5.9 ± 0.1) outperformed ChatGPT (5.7 ± 0.3; <i>p</i> = 0.031 and <i>p</i> = 0.009, respectively) and DeepSeek (5.6 ± 0.4; <i>p</i> = 0.004 and <i>p</i> = 0.001, respectively). In open-ended queries, Gemini (5.6 ± 0.3 accuracy) and Grok (5.5 ± 0.4 accuracy) demonstrated superior results across all measured dimensions-accuracy, completeness, and guideline adherence-markedly surpassing ChatGPT (5.1 ± 0.5 accuracy, <i>p</i> < 0.001) and DeepSeek (4.9 ± 0.6 accuracy; <i>p</i> < 0.001). Notably, Gemini and Grok demonstrated consistently high performance with minimal variability, while ChatGPT, particularly DeepSeek, exhibited considerable inconsistency in complex clinical judgments.ConclusionGemini 2 and Grok 3 showed reliable and clinically relevant performance, positioning them as promising adjunctive tools for decision-making and education in hand surgery. The limitations in ChatGPT-5 and the significant shortcomings of DeepSeek underscore the necessity for cautious deployment and continued refinement.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationDiversity and Career in MedicineClinical Reasoning and Diagnostic Skills
Volltext beim Verlag öffnen