Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Diagnostic Performance of Large Language Models in Multimodal Analysis of Radiolucent Jaw Lesions
8
Zitationen
2
Autoren
2025
Jahr
Abstract
INTRODUCTION AND AIMS: Large language models (LLMs), such as ChatGPT and Gemini, are increasingly being used in medical domains, including dental diagnostics. Despite advancements in image-based deep learning systems, LLM diagnostic capabilities in oral and maxillofacial surgery (OMFS) for processing multi-modal imaging inputs remain underexplored. Radiolucent jaw lesions represent a particularly challenging diagnostic category due to their varied presentations and overlapping radiographic features. This study evaluated diagnostic performance of ChatGPT 4o and Gemini 2.5 Pro using real-world OMFS radiolucent jaw lesion cases, presented in multiple-choice (MCQ) and short-answer (SAQ) formats across 3 imaging conditions: panoramic radiography only, panoramic + CT, and panoramic + CT + pathology. METHODS: Data from 100 anonymized patients at Wonkwang University Daejeon Dental Hospital were analyzed, including demographics, panoramic radiographs, CBCT images, histopathology slides, and confirmed diagnoses. Sample size was determined based on institutional case availability and statistical power requirements for comparative analysis. ChatGPT and Gemini diagnosed each case under 6 conditions using 3 imaging modalities (P, P+C, P+C+B) in MCQ and SAQ formats. Model accuracy was scored against expert-confirmed diagnoses by 2 independent evaluators. McNemar's and Cochran's Q tests evaluated statistical differences across models and imaging modalities. RESULTS: For MCQ tasks, ChatGPT achieved 66%, 73%, and 82% accuracies across the P, P+C, and P+C+B conditions, respectively, while Gemini achieved 57%, 62%, and 63%, respectively. In SAQ tasks, ChatGPT achieved 34%, 45%, and 48%; Gemini achieved 15%, 24%, and 28%, respectively. Accuracy improved significantly with additional imaging data for ChatGPT; ChatGPT consistently outperformed Gemini across all conditions (P < .001 for MCQ; P = .008 to < .001 for SAQ). MCQ format, which incorporates a human-in-the-loop (HITL) structure, showed higher overall performance than SAQ. CONCLUSION: ChatGPT demonstrated superior diagnostic performance compared to Gemini in OMFS diagnostic tasks when provided with richer multimodal inputs. Diagnostic accuracy increased with additional imaging data, especially in MCQ formats, suggesting LLMs can effectively synthesize radiographic and pathological data. CLINICAL RELEVANCE: LLMs have potential as diagnostic support tools for OMFS, especially in settings with limited specialist access. Presenting clinical cases in structured formats using curated imaging data enhances LLM accuracy and underscores HITL integration. Although current LLMs show promising results, further validation using larger datasets and hybrid AI systems are necessary for broader contextualised, clinical adoption.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.707 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.613 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.159 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.875 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.