Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of next-generation AI chatbots in gynecological knowledge assessment: a comparative pilot study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus

2026·0 Zitationen·Archives of Gynecology and ObstetricsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

PURPOSE: As artificial intelligence (AI) models evolve into their next generations, their application in specialized medical fields requires rigorous validation. While large language models (LLMs) have shown promise in general medicine, their reliability in complex gynecological clinical reasoning remains under-explored. This pilot study aimed to comparatively assess the knowledge retention, safety, and reasoning limitations of advanced AI chatbots in gynecology using a constrained zero-shot multiple-choice question (MCQ) format. METHODS: A total of 70 text-based MCQs covering seven core gynecological modules were adapted from "USMLE Step 1 Sample Test Questions". The questions were administered to four advanced AI models: ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus. To simulate a rapid-retrieval clinical scenario, models were tested under "zero-shot" conditions with a constrained prompt prohibiting reasoning steps. We performed both quantitative statistical analysis (Kruskal-Wallis, Cochran's Q) and qualitative error analysis to identify specific failure modes. RESULTS: Contrary to expectations for advanced models, overall accuracy was unsatisfactory: Gemini-3 (32.86%), DeepSeek-V3.2 (30.00%), ChatGPT-5 (25.71%), and Claude-4.5-Opus (21.43%). Significant performance disparities were observed across modules. Notably, ChatGPT-5 scored 0.00% in Infertility, while DeepSeek-V3.2 reached 70.00% in Common Benign Conditions. Qualitative analysis revealed three critical failure patterns: (1) semantic association bias (confusing high-probability diseases with symptom-specific diagnoses), (2) spatial anatomy confusion, and (3) genetic logic reversal. No significant correlation was found between item difficulty and accuracy (p > 0.05). CONCLUSION: Under constrained non-reasoning prompts, even next-generation AI chatbots demonstrate unsatisfactory performance in gynecology. The qualitative analysis suggests that models often rely on probabilistic keyword matching rather than physiological simulation, leading to theoretically dangerous clinical errors (e.g., misdiagnosing adrenal enzymes). While potential exists, current reliability is insufficient for unsupervised use in gynecological education. These findings highlight the critical need for "chain-of-thought" prompting and human expert oversight.

Autoren

Institutionen

Jingning County People's Hospital(CN)

Themen

Artificial Intelligence in Healthcare and EducationAI in Service InteractionsExplainable Artificial Intelligence (XAI)

Volltext beim Verlag öffnen

Performance of next-generation AI chatbots in gynecological knowledge assessment: a comparative pilot study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen