Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Quantifying the speed-accuracy trade-off of large language models on oral and maxillofacial surgery multiple-choice questions
2
Zitationen
6
Autoren
2025
Jahr
Abstract
Large language models (LLMs) such as GPT-4o, Copilot and Gemini are entering dental curricula, yet their suitability for real-time decision support remains unclear because most evaluations report accuracy alone. This prospective in silico diagnostic-accuracy study benchmarked six engines-GPT-4o, OpenAI o3, Copilot-Quick, Copilot-Deep, Gemini-Flash and Gemini-Pro-against 1766 single-best-answer multiple-choice questions from a contemporary oral and maxillofacial surgery (OMFS) board-review text. Textbook keys served as the reference standard. Overall and domain-level accuracy, intra-model answer consistency and per-batch response latency were recorded; χ² tests compared accuracies and Kruskal-Wallis with multiplicity-adjusted Mann-Whitney U tests compared response times. Accuracy differed significantly across engines (χ² = 97.31, p < 0.001), ranging from 77.9% for Copilot-Quick to 88.3% for Gemini-Pro. Reasoning-optimised variants (o3, Copilot-Deep, Gemini-Pro) exceeded their speed-tuned counterparts by 3.8-6.2% points, with the largest gains in trauma, craniofacial deformity and orthognathic surgery domains. These improvements incurred a marked latency penalty: median response times of 2.1-3.1 s versus 0.1-0.2 s for the faster engines. Each additional 3-6 correct answers per 100 items therefore required roughly 2-3 s of extra processing. Items unanswered by all models clustered around rare numeric facts and negatively worded stems. Reasoning-optimised LLMs deliver clinically meaningful accuracy gains on OMFS board questions, but educators and clinicians must weigh this benefit against slower output and maintain expert oversight to mitigate residual knowledge gaps.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.339 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.211 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.614 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.478 Zit.