OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.05.2026, 07:23

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

A systematic comparison of ChatGPT and DeepSeek for guideline-based question answering in obstetric anesthesia

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2026

Jahr

Abstract

LLMs are increasingly applied in clinical medicine, but their performance in protocol-driven fields like anesthesiology remains underexplored. Obstetric anesthesia demands timely and accurate decision-making, where adherence to established guidelines is essential. This study investigates how prompting strategies and model architectures influence LLM performance in a high-stakes clinical domain. To evaluate the performance of four Large Language Models (LLMs)—ChatGPT-4o, ChatGPT-4o-mini, DeepSeek-V3, and DeepSeek-R1—in answering guideline-based questions in obstetric anesthesia using three distinct prompting strategies. Eleven clinical questions derived from the 2019 ACOG Practice Bulletin No. 209 were posed to each model using three prompting strategies: Isolated Prompting (IP), Batch Prompting (BP), and Contextual Isolated Prompting (CIP). Responses were rated across four clinical dimensions (Accuracy, Overconclusiveness, Supplementary Value, and Completeness) using a 5-point Likert scale, and assessed for readability using Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL). Statistical analyses included ANOVA and post hoc comparisons. No significant differences were found across models for Accuracy, Overconclusiveness, or Completeness (p > 0.05). However, Supplementary Value differed significantly (p = 0.008), with ChatGPT-4o under CIP outperforming DeepSeek-V3 under IP (p = 0.021). ChatGPT-4o demonstrated the highest overall readability (lowest FKGL), while ChatGPT-4o-mini’s readability improved significantly under CIP. DeepSeek-V3 under BP outperformed R1 under IP in FKGL scores (p = 0.0294). LLMs demonstrate comparable core clinical accuracy in obstetric anesthesia tasks, with ChatGPT-4o offering the most readable and context-rich responses. Prompting strategy, especially CIP, enhances response quality. These findings support the potential of LLMs as clinical aids, contingent on thoughtful prompt design and domain validation.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingSimulation-Based Education in Healthcare
Volltext beim Verlag öffnen