Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study
0
Zitationen
1
Autoren
2026
Jahr
Abstract
BackgroundIntegrating large language models (LLMs) into decision-making and education has shown promise across various healthcare disciplines. The study aimed to evaluate the performance of leading LLMs-ChatGPT-5, Gemini 2, Grok 3, and DeepSeek R1-in accurately responding to structured multiple-choice and open-ended queries about complex case scenarios in hand surgery.MethodsA prospective cross-sectional analysis used 50 clinically relevant, guideline-based case scenarios developed for hand surgery. Each scenario consisted of four open-ended and two multiple-choice questions, totaling 300 points per LLM. Responses were independently assessed by blinded expert reviewers using a standardized six-point Likert scale evaluating accuracy, completeness, and adherence to international surgical guidelines.ResultsIn multiple-choice queries, Gemini (5.9 ± 0.2) and Grok (5.9 ± 0.1) outperformed ChatGPT (5.7 ± 0.3; <i>p</i> = 0.031 and <i>p</i> = 0.009, respectively) and DeepSeek (5.6 ± 0.4; <i>p</i> = 0.004 and <i>p</i> = 0.001, respectively). In open-ended queries, Gemini (5.6 ± 0.3 accuracy) and Grok (5.5 ± 0.4 accuracy) demonstrated superior results across all measured dimensions-accuracy, completeness, and guideline adherence-markedly surpassing ChatGPT (5.1 ± 0.5 accuracy, <i>p</i> < 0.001) and DeepSeek (4.9 ± 0.6 accuracy; <i>p</i> < 0.001). Notably, Gemini and Grok demonstrated consistently high performance with minimal variability, while ChatGPT, particularly DeepSeek, exhibited considerable inconsistency in complex clinical judgments.ConclusionGemini 2 and Grok 3 showed reliable and clinically relevant performance, positioning them as promising adjunctive tools for decision-making and education in hand surgery. The limitations in ChatGPT-5 and the significant shortcomings of DeepSeek underscore the necessity for cautious deployment and continued refinement.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.