OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 03.04.2026, 01:24

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of three artificial intelligence chatbots for generating clinical hematology multiple choice questions for medical students

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

9

Autoren

2026

Jahr

Abstract

The integration of artificial intelligence (AI) into medical education has shown promise in streamlining content creation, yet the reliability and validity of AI-generated assessments remain critical concerns. This study evaluates three AI models-ChatGPT, Perplexity, and DeepSeek-in generating hematology multiple-choice questions (MCQs), focusing on their alignment with clinical guidelines, cognitive complexity, and expert acceptability, to determine their practical utility in medical education. To quantitatively evaluate and compare the performance of three AI models-ChatGPT, Perplexity, and DeepSeek-in generating multiple-choice questions (MCQs) relevant to hematology, with a focus on content validity, cognitive level alignment, and expert acceptance. In this study, each AI model was prompted to generate 50 MCQs across five key hematology topics, following standardized instructions emphasizing guideline alignment and cognitive diversity. Three hematology experts, blinded to question source, independently rated all 150 MCQs on criteria including accuracy, clinical relevance, clarity, distractor plausibility, and overall quality, using a structured rubric. Scores were averaged per model, and questions were categorized by Bloom’s taxonomy level. Acceptance was defined as a total score ≥ 15 out of 25. DeepSeek achieved the highest scores for accuracy (4.7 ± 0.4), clinical relevance (4.8 ± 0.3), and distractor plausibility (4.7 ± 0.4), with a perfect acceptance rate (100%) and no need for revision. Perplexity and ChatGPT also produced clinically relevant questions but required minor revisions (acceptance rates: 96% and 90%, respectively). All models favored higher-order cognitive questions. Knowledge and comprehension questions were limited across all models. AI models, particularly DeepSeek, can efficiently generate high-quality, clinically relevant hematology MCQs suitable for medical education and assessment. While DeepSeek demonstrated superior reliability and required minimal expert revision, all models underrepresented foundational knowledge questions and lacked autonomous image-based item generation. Hybrid human-AI workflows and targeted prompt engineering are recommended to optimize cognitive coverage and ensure educational rigor.

Ähnliche Arbeiten