Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations
0
Zitationen
21
Autoren
2026
Jahr
Abstract
BACKGROUND AND OBJECTIVES: Traditional medical board examinations present clinical information in static vignettes with multiple-choices (MC), fundamentally different from how physicians gather and integrate data in practice. Recent advances in large language models (LLMs) offer promising approaches to creating more realistic clinical interactive conversations. However, these approaches are limited in neurosurgery, where patient communication capacity varies significantly and diagnosis heavily relies on objective data such as imaging and neurological examinations. We aimed to develop and evaluate a multi–artificial intelligence (AI) agent conversation framework for neurosurgical case assessment that enables realistic clinical interactions through simulated patients and structured access to objective clinical data. METHODS: We developed a framework to convert 608 Self-Assessment in Neurological Surgery first-order diagnosis questions into conversation sessions using 3 specialized AI agents: patient AI for subjective information, system AI for objective data, and clinical AI for diagnostic reasoning. We evaluated generative pretrained transformer 4o's (GPT-4o's) diagnostic accuracy across traditional vignettes, patient-only conversations, and patient + system AI interactions, with human benchmark testing from 10 neurosurgery residents. RESULTS: GPT-4o showed significant performance drops from traditional vignettes to conversational formats in both MC (89.0%-60.9%, P < .0001) and free-response scenarios (78.4%-30.3%, P < .0001). Adding access to objective data through system AI improved performance (to 67.4%, P = .0015; and 61.8%, P < .0001, respectively). Questions requiring image interpretation showed similar patterns but lower accuracy. Residents outperformed GPT-4o in free-response conversations (70.0% vs 28.3%, P = .0030) using fewer interactions and reported high educational value of the interactive format. CONCLUSION: This multi-AI agent framework provides both a more challenging evaluation method for LLMs and an engaging educational tool for neurosurgical training. The significant performance drops in conversational formats suggest that traditional MC testing may overestimate LLMs' clinical reasoning capabilities, while the framework's interactive nature offers promising applications for enhancing medical education.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.
Autoren
- Karl L. Sangwon
- Jeff Zhang
- Robert Steele
- Jaden Stryker
- Jaehyuk Choi
- J. Jack Lee
- Daniel Alexander Alber
- Aly Valliani
- Nivedha Kannapadi
- James S. Ryoo
- Austin Y. Feng
- Hammad A. Khan
- Sean N. Neifert
- Cordelia Orillac
- Hannah Weiss
- Nora C. Kim
- David B. Kurland
- Howard A. Riina
- Douglas Kondziolka
- Michal Mankowski
- Eric K. Oermann