Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Accuracy of Generative AI Chatbots in Answering Plastic Surgery Examination Questions: A Comparative Evaluation of ChatGPT‐4o, Gemini Advanced, and DeepSeek‐R1

2026·0 Zitationen·Journal of Evidence-Based Medicine

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

With the ongoing development and advancement of generative artificial intelligence (AI), AI chatbots have seen widespread adoption across various industries, including healthcare [1, 2] and medical education [3-5]. These chatbots demonstrate high accuracy in multinational medical licensing examinations [6, 7] and provide reliable responses to medical queries across various specialties [8, 9]. Pilot studies suggest that their integration into medical curricula enhances student engagement, reduces faculty workload, and garners broad student acceptance [10, 11]. Certain medical students have transitioned from traditional textbooks and search engines to chatbots as learning tools [12]. However, most AI chatbots are not specifically tailored for healthcare professionals or medical students. They may generate misleading information [3, 4], fabricate references [5], and lack standardization in medical terminology. A comprehensive evaluation of their accuracy in plastic surgery is lacking, with deviations from authoritative textbooks remaining unexamined. For plastic surgery medical students, further assessment of their reliability as learning tools is necessary. To address these concerns, we conducted a comparative accuracy evaluation of three leading AI chatbots currently available, assessing their performance in answering plastic surgery examination questions. First, we collected 221 subjective questions in plastic surgery from the postgraduate entrance examinations of multiple universities in China, spanning from 2014 to 2024, and categorized them into 10 subspecialties. All questions were derived from publicly available materials and were used in accordance with applicable copyright and academic usage policies. Ten questions were randomly selected from each subspecialty (Table S1), and standard answers for the 100 questions were derived from Grabb and Smith's Plastic Surgery (eighth edition) and Plastic Surgery (fourth edition). Three advanced AI chatbots (ChatGPT-4o, Gemini Advanced, and DeepSeek-R1) were instructed to generate answers for these 100 questions. The reference textbooks were all English-language editions, and both the prompts submitted to the AI chatbots and the responses generated were in English. To minimize bias, all queries were submitted to the chatbots using the same device between January 27 and 29, 2025, with each question asked only once. Responses from each chatbot were blinded and independently evaluated for accuracy by two clinical professors based on standard reference answers. Any disagreements were resolved through consensus discussion, and the final score was calculated as the mean of the two evaluators’ scores. Accuracy was scored using a Likert scale [13, 14], ranging from 1 to 5, representing “very inaccurate,” “inaccurate,” “moderate,” “accurate,” and “very accurate,” respectively. Statistical analyses were performed using IBM SPSS Statistics, version 24.0 (IBM Corp, Armonk, New York). The Shapiro–Wilk test was used to assess the normality of the scores, and the results indicated that all scores followed a normal distribution. One-way ANOVA (α = 0.05) was used to test the score differences between chatbots. If the overall comparison was statistically significant, pairwise comparisons were made using Tukey's HSD. Across the 10 subspecialties, the mean scores of all chatbots exceeded 3.5 on a 5-point scale. Notably, DeepSeek-R1 achieved mean scores exceeding 4.0 across all 10 subspecialties. However, no statistically significant differences were observed in accuracy scores among the three chatbots across the 10 subspecialties. Specifically, all chatbots performed exceptionally well in three subspecialties: craniofacial surgery, eye plastic surgery, and fat transplantation, injection aesthetics, and hair restoration, with mean scores consistently exceeding 4.0 (Figure 1). This study evaluates the accuracy of responses generated by chatbots to plastic surgery examination questions, comparing them with authoritative textbook content. The examination questions comprehensively covered all subspecialties within plastic surgery. Notably, all three chatbots achieved satisfactory accuracy levels across various subspecialties, with DeepSeek-R1 demonstrating “accurate” levels in all subspecialties. This suggests that medical students can generally obtain reasonably accurate answers through chatbot queries regarding plastic surgery knowledge. However, responses related to controversial or emerging academic perspectives remain unreliable and require verification through literature reviews and multi-source validation. It is undeniable that this study has several limitations. First, it exclusively evaluated the accuracy of AI chatbots in answering plastic surgery examination questions from Chinese universities. Although these questions cover multiple subspecialties, their regional focus and limited quantity may overlook assessments of peripheral knowledge. Second, while AI chatbots excel in interactive question-and-answer and real-time feedback, this study employed single-round questioning, failing to leverage their dynamic conversational capabilities. Third, although this study confirms the reliability of AI chatbots as learning tools for plastic surgery students, it did not investigate their tangible educational benefits, such as improvements in academic performance or clinical skills. Future research should focus on long-term and geographically extensive studies to explore the benefits and drawbacks of using AI chatbots as learning tools for plastic surgery students, while examining the boundaries and ethical guidelines for integrating generative AI into medical education [15]. Concurrently, the development and optimization of specialized AI tools tailored to plastic surgery remain urgent priorities. The three AI chatbots demonstrated high accuracy in responding to plastic surgery examination questions. Overall, AI chatbots can assist plastic surgery students in retrieving professional knowledge and are currently reliable learning tools for these students. No parts of the conception, design, execution, writing, or editing of the study were assisted by AI chatbots, except for the described research utilizing AI chatbot responses. This study was funded by the National Key R&D Program of China (No. 2024YFF1206400), the National Natural Science Foundation of China (No. 82372545), and the Science and Technology Projects in Guangzhou (No. 2023A03J1031). The authors declared no potential conflicts of interest with respect to the research, authorship, and publication of this article. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationAI in Service InteractionsClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Accuracy of Generative AI Chatbots in Answering Plastic Surgery Examination Questions: A Comparative Evaluation of ChatGPT‐4o, Gemini Advanced, and DeepSeek‐R1

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen