OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 21.05.2026, 10:16

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking AI on Standard Chemistry Exams: LLMs Still Underperform Compared to High School Students

2026·0 Zitationen·Journal of Science Education and TechnologyOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2026

Jahr

Abstract

Abstract As Large Language Models (LLMs) become increasingly prevalent in science education, it is important to understand their capabilities compared to human learners with respect to authentic learning tasks. Such understanding is crucial for designing AI-resilient assessments and developing AI tutors that can guide students in problem solving. Using standardized assessments as benchmarks allows these comparisons to be based on widely accepted educational criteria. To date, most educational benchmarks have been developed and evaluated in English, with other languages receiving far less attention. The present study addresses this gap by introducing the first Hebrew science education benchmark, based on the national high-school matriculation exam in chemistry. We evaluated three LLMs – ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro – on 120 multiple-choice questions and compared their performance to that of over 139,000 high-school students. We found that all three LLMs significantly underperformed relative to human learners. To investigate characteristics that render questions more challenging for LLMs, we conducted a regression analysis and found that visual elements and multi-step reasoning tasks negatively impacted their performance. Finally, chemistry education experts analyzed the items that were most difficult for LLMs and characterized their domain-specific failures. This study makes three contributions: (1) it extends LLM evaluation to an underrepresented linguistic context; (2) it advances the methodological landscape of LLM benchmarking by directly comparing multiple models with human students on authentic, curriculum-aligned national examinations; and (3) it provides a mixed-methods analysis of LLM performance, offering a more educationally grounded characterization of current model capabilities.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in Materials ScienceExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen