OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.05.2026, 14:24

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating cognitive depth of AI-generated multiple-choice questions with Bloom’s Taxonomy

2026·2 Zitationen·PLoS ONEOpen Access
Volltext beim Verlag öffnen

2

Zitationen

5

Autoren

2026

Jahr

Abstract

INTRODUCTION: While LLMs are used to generate medical and dental MCQs, their alignment with Bloom's Taxonomy remains unexplored. MATERIALS AND METHODS: Five widely used LLMs, including ChatGPT-4o (OpenAI), Copilot Pro (Microsoft), Claude Sonnet 4 (Anthropic), Grok 3 (xAI), and DeepSeek R1 (DeepSeek) were evaluated. Each model generated 60 MCQs (total 300) based on content from an oral and maxillofacial anatomy textbook across the five cognitive levels of Bloom's Taxonomy. Two independent investigators assessed each item using a 5-point Likert scale for remembering, understanding, applying, analyzing, and evaluating/creating. Inter-rater reliability was measured using weighted Cohen's kappa. Model performance and inter-model differences were analyzed using the Kruskal-Wallis test. RESULTS: Inter-rater reliability was moderate to strong (kappa = 0.74-0.86). Median scores for remembering, understanding, applying, and evaluating/creating were above 4 across all LLMs, while the analyzing level scored a median of 3.5 for ChatGPT-4o and DeepSeek R1. No significant difference was found between models in remembering and understanding levels (p > 0.05). Claude Sonnet 4 outperformed the other models at the applying, analyzing, and evaluating/creating levels (p = 0.01, 0.003, and 0.005, respectively). Within-model analysis showed that only Copilot Pro and Claude Sonnet 4 consistently aligned with Bloom's cognitive levels across all categories. In contrast, ChatGPT-4o, DeepSeek R1, and Grok 3 performed significantly better at the lower cognitive levels (p = 0.00, 0.00, and 0.001, respectively). CONCLUSIONS: All LLMs performed well at lower cognitive levels, while Claude Sonnet 4 achieved the highest alignment at higher-order levels.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsAnatomy and Medical Technology
Volltext beim Verlag öffnen