Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI
0
Zitationen
12
Autoren
2026
Jahr
Abstract
Abstract Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks—including multimodal data integration, human interaction, and physical effects—generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply “scaled away” with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions. Results Summary We present findings from six experiments. (1) We evaluate zero-shot surgical tool detection performance across 19 open-weight Vision Language Models (VLMs) from 2023 to early-2026 on SDSC-EEA, a large video dataset consisting of endoscopic endonasal approach (EEA) neurosurgical procedures. Despite dramatic increases in model scale and benchmark scores, only one model marginally exceeds the 13.4% majority class baseline on the validation set. (2) We fine-tune Gemma 3 27B with LoRA adapters to generate structured JSON predictions. The model achieves 47.63% exact match accuracy, surpassing the validation set baseline of 13.41%. (3) We replace off-the-shelf JSON generation with a specialized classification head. This approach achieves 51.08% exact match accuracy. (4) To assess the potential of increasing computational resources, we gradually increase trainable parameters (by increasing LoRA rank) by nearly three orders of magnitude. While training accuracy reaches 98.6%, validation accuracy remains below 40%, showing that scaling alone cannot overcome distribution shift. (5) We compare zero-shot and fine-tuned VLM performance against YOLOv12-m, a specialized 26M-parameter object detection model. YOLOv12-m achieves 54.73% exact match accuracy, outperforming all VLM-based methods while using 1,000× fewer parameters. (6) We demonstrate these findings generalize to CholecT50, an independent and public dataset of laparoscopic cholecystectomy procedures, with additional comparisons to five proprietary frontier VLMs. The fine-tuned open-weight model and YOLOv12-m outperform all zero-shot VLM methods including zeroshot methods using proprietary frontier VLMs.
Ähnliche Arbeiten
The SCARE 2020 Guideline: Updating Consensus Surgical CAse REport (SCARE) Guidelines
2020 · 5.576 Zit.
Virtual Reality Training Improves Operating Room Performance
2002 · 2.802 Zit.
An estimation of the global volume of surgery: a modelling strategy based on available data
2008 · 2.510 Zit.
Objective structured assessment of technical skill (OSATS) for surgical residents
1997 · 2.260 Zit.
Does Simulation-Based Medical Education With Deliberate Practice Yield Better Results Than Traditional Clinical Education? A Meta-Analytic Comparative Review of the Evidence
2011 · 1.735 Zit.