Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

S3177 Evaluating Large Language Models for the Interpretation of ACG Guidelines on Premalignant Gastric Conditions: A Comparative Analysis of ChatGPT and DeepSeek

2025·0 Zitationen·The American Journal of Gastroenterology

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Introduction: Premalignant gastric conditions, such as intestinal metaplasia and atrophic gastritis present a high risk for progression to gastric cancer if not managed according to evidence-based guidelines. The American College of Gastroenterology (ACG) guidelines published in 2025 provide recommendations on diagnosis, surveillance, and management of premalignant gastric conditions. With the growing use of large language models (LLMs) like ChatGPT and DeepSeek by patients for medical advice, it is important to assess their accuracy in interpreting these guidelines. This study compares ChatGPT and DeepSeek in how well their responses align with the ACG recommendations on premalignant gastric conditions. Methods: We developed 40 questions based on ACG guidelines addressing diagnosis, surveillance, and management. These were input into ChatGPT and DeepSeek. Two board-certified oncologists independently rated each response on accuracy, clarity, coherence, relevance, and completeness using a 5-point Likert scale. Scores were analyzed for inter-rater reliability and compared between models. Results: Inter-rater reliability (R) was moderate but statistically significant (Pearson R = 0.443; P < 0.01). Overall, DeepSeek outperformed ChatGPT across all domains. Oncologist 1 scored DeepSeek 4.81 versus ChatGPT 4.62; Oncologist 2 scored DeepSeek 4.92 versus ChatGPT 4.61. For factual accuracy, DeepSeek scored 4.8 and 4.85, compared to ChatGPT’s 4.5 and 4.25. DeepSeek also performed better in clarity, coherence, relevance, and completeness (up to 4.95 vs ChatGPT’s 4.35-4.88). The most marked difference in performance was observed in the surveillance domain, where DeepSeek scored 4.82 and 4.92, compared to ChatGPT’s 4.17 and 4.14. Both models maintained high coherence, though DeepSeek demonstrated greater clinical precision. Conclusion: DeepSeek’s superior performance likely stems from domain-specific training, suggesting its potential as a clinical decision support tool in gastroenterology. Limitations include use of static prompts, lack of real-world clinical validation, and evolving model behavior. While both models show utility, these findings support cautious adoption of domain-trained LLMs with continued validation, oversight, and patient safety at the forefront.

Autoren

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareCardiovascular Health and Risk Factors

Volltext beim Verlag öffnen

S3177 Evaluating Large Language Models for the Interpretation of ACG Guidelines on Premalignant Gastric Conditions: A Comparative Analysis of ChatGPT and DeepSeek

Abstract

Ähnliche Arbeiten

Autoren

Themen