Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment

2025·0 Zitationen·Educational Process International JournalOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Background/purpose. Generative artificial intelligence (GenAI) is often promoted as a transformative tool for assessment, yet evidence of its validity compared to human raters remains limited. This study examined whether an AI-based rater could be used interchangeably with trained faculty in scoring complex coursework. Materials/methods. Ninety-one essays from teacher education courses at two Greek universities were independently evaluated by two human raters and an AI system, using a common rubric. Results. Human inter-rater reliability was excellent (ICC(2,1) = .884; ICC(2,k) k=2 = .938). In contrast, AI–human agreement was substantially weaker (AI vs Human-Z: ICC(2,1) = .406; ICC(2,k) = .578; AI vs Human-S: ICC(2,1) = .279; ICC(2,k) = .436). The AI consistently inflated scores by 2.71–3.32 points and compressed distributions, limiting its ability to discriminate across performance levels. Bland–Altman analyses confirmed systematic proportional bias, with over-scoring of weaker work and under-scoring of stronger work. Results revealed significant inconsistency in AI performance: while the model failed to align with Human-S (κ = .017), it demonstrated statistically significant, moderate agreement with Human-Z (κ = .367). This discrepancy highlights the lack of standardization in human grading and the sensitivity of algorithms to divergent interpretive frameworks. A principal component analysis suggested that AI captured a narrower construct of quality than human raters. Conclusion. These findings indicate that current GenAI tools are not suitable for high-stakes assessment in higher education, where fairness and construct validity are essential. They may, however, offer value in formative feedback or administrative support if used transparently and under human oversight.

Autoren

Themen

Intelligent Tutoring Systems and Adaptive LearningArtificial Intelligence in Healthcare and EducationOnline Learning and Analytics

Volltext beim Verlag öffnen

Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment

Abstract

Ähnliche Arbeiten

Autoren

Themen