Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of next-generation AI chatbots in gynecological knowledge assessment: a comparative pilot study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus
0
Zitationen
2
Autoren
2026
Jahr
Abstract
PURPOSE: As artificial intelligence (AI) models evolve into their next generations, their application in specialized medical fields requires rigorous validation. While large language models (LLMs) have shown promise in general medicine, their reliability in complex gynecological clinical reasoning remains under-explored. This pilot study aimed to comparatively assess the knowledge retention, safety, and reasoning limitations of advanced AI chatbots in gynecology using a constrained zero-shot multiple-choice question (MCQ) format. METHODS: A total of 70 text-based MCQs covering seven core gynecological modules were adapted from "USMLE Step 1 Sample Test Questions". The questions were administered to four advanced AI models: ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus. To simulate a rapid-retrieval clinical scenario, models were tested under "zero-shot" conditions with a constrained prompt prohibiting reasoning steps. We performed both quantitative statistical analysis (Kruskal-Wallis, Cochran's Q) and qualitative error analysis to identify specific failure modes. RESULTS: Contrary to expectations for advanced models, overall accuracy was unsatisfactory: Gemini-3 (32.86%), DeepSeek-V3.2 (30.00%), ChatGPT-5 (25.71%), and Claude-4.5-Opus (21.43%). Significant performance disparities were observed across modules. Notably, ChatGPT-5 scored 0.00% in Infertility, while DeepSeek-V3.2 reached 70.00% in Common Benign Conditions. Qualitative analysis revealed three critical failure patterns: (1) semantic association bias (confusing high-probability diseases with symptom-specific diagnoses), (2) spatial anatomy confusion, and (3) genetic logic reversal. No significant correlation was found between item difficulty and accuracy (p > 0.05). CONCLUSION: Under constrained non-reasoning prompts, even next-generation AI chatbots demonstrate unsatisfactory performance in gynecology. The qualitative analysis suggests that models often rely on probabilistic keyword matching rather than physiological simulation, leading to theoretically dangerous clinical errors (e.g., misdiagnosing adrenal enzymes). While potential exists, current reliability is insufficient for unsupervised use in gynecological education. These findings highlight the critical need for "chain-of-thought" prompting and human expert oversight.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.553 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.444 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.943 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.