Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
S3177 Evaluating Large Language Models for the Interpretation of ACG Guidelines on Premalignant Gastric Conditions: A Comparative Analysis of ChatGPT and DeepSeek
0
Zitationen
14
Autoren
2025
Jahr
Abstract
Introduction: Premalignant gastric conditions, such as intestinal metaplasia and atrophic gastritis present a high risk for progression to gastric cancer if not managed according to evidence-based guidelines. The American College of Gastroenterology (ACG) guidelines published in 2025 provide recommendations on diagnosis, surveillance, and management of premalignant gastric conditions. With the growing use of large language models (LLMs) like ChatGPT and DeepSeek by patients for medical advice, it is important to assess their accuracy in interpreting these guidelines. This study compares ChatGPT and DeepSeek in how well their responses align with the ACG recommendations on premalignant gastric conditions. Methods: We developed 40 questions based on ACG guidelines addressing diagnosis, surveillance, and management. These were input into ChatGPT and DeepSeek. Two board-certified oncologists independently rated each response on accuracy, clarity, coherence, relevance, and completeness using a 5-point Likert scale. Scores were analyzed for inter-rater reliability and compared between models. Results: Inter-rater reliability (R) was moderate but statistically significant (Pearson R = 0.443; P < 0.01). Overall, DeepSeek outperformed ChatGPT across all domains. Oncologist 1 scored DeepSeek 4.81 versus ChatGPT 4.62; Oncologist 2 scored DeepSeek 4.92 versus ChatGPT 4.61. For factual accuracy, DeepSeek scored 4.8 and 4.85, compared to ChatGPT’s 4.5 and 4.25. DeepSeek also performed better in clarity, coherence, relevance, and completeness (up to 4.95 vs ChatGPT’s 4.35-4.88). The most marked difference in performance was observed in the surveillance domain, where DeepSeek scored 4.82 and 4.92, compared to ChatGPT’s 4.17 and 4.14. Both models maintained high coherence, though DeepSeek demonstrated greater clinical precision. Conclusion: DeepSeek’s superior performance likely stems from domain-specific training, suggesting its potential as a clinical decision support tool in gastroenterology. Limitations include use of static prompts, lack of real-world clinical validation, and evolving model behavior. While both models show utility, these findings support cautious adoption of domain-trained LLMs with continued validation, oversight, and patient safety at the forefront.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.391 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.257 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.685 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.501 Zit.