Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A systematic comparison of ChatGPT and DeepSeek for guideline-based question answering in obstetric anesthesia
0
Zitationen
5
Autoren
2026
Jahr
Abstract
LLMs are increasingly applied in clinical medicine, but their performance in protocol-driven fields like anesthesiology remains underexplored. Obstetric anesthesia demands timely and accurate decision-making, where adherence to established guidelines is essential. This study investigates how prompting strategies and model architectures influence LLM performance in a high-stakes clinical domain. To evaluate the performance of four Large Language Models (LLMs)—ChatGPT-4o, ChatGPT-4o-mini, DeepSeek-V3, and DeepSeek-R1—in answering guideline-based questions in obstetric anesthesia using three distinct prompting strategies. Eleven clinical questions derived from the 2019 ACOG Practice Bulletin No. 209 were posed to each model using three prompting strategies: Isolated Prompting (IP), Batch Prompting (BP), and Contextual Isolated Prompting (CIP). Responses were rated across four clinical dimensions (Accuracy, Overconclusiveness, Supplementary Value, and Completeness) using a 5-point Likert scale, and assessed for readability using Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL). Statistical analyses included ANOVA and post hoc comparisons. No significant differences were found across models for Accuracy, Overconclusiveness, or Completeness (p > 0.05). However, Supplementary Value differed significantly (p = 0.008), with ChatGPT-4o under CIP outperforming DeepSeek-V3 under IP (p = 0.021). ChatGPT-4o demonstrated the highest overall readability (lowest FKGL), while ChatGPT-4o-mini’s readability improved significantly under CIP. DeepSeek-V3 under BP outperformed R1 under IP in FKGL scores (p = 0.0294). LLMs demonstrate comparable core clinical accuracy in obstetric anesthesia tasks, with ChatGPT-4o offering the most readable and context-rich responses. Prompting strategy, especially CIP, enhances response quality. These findings support the potential of LLMs as clinical aids, contingent on thoughtful prompt design and domain validation.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.687 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.591 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.114 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.867 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.