Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment
0
Zitationen
2
Autoren
2025
Jahr
Abstract
Background/purpose. Generative artificial intelligence (GenAI) is often promoted as a transformative tool for assessment, yet evidence of its validity compared to human raters remains limited. This study examined whether an AI-based rater could be used interchangeably with trained faculty in scoring complex coursework. Materials/methods. Ninety-one essays from teacher education courses at two Greek universities were independently evaluated by two human raters and an AI system, using a common rubric. Results. Human inter-rater reliability was excellent (ICC(2,1) = .884; ICC(2,k) k=2 = .938). In contrast, AI–human agreement was substantially weaker (AI vs Human-Z: ICC(2,1) = .406; ICC(2,k) = .578; AI vs Human-S: ICC(2,1) = .279; ICC(2,k) = .436). The AI consistently inflated scores by 2.71–3.32 points and compressed distributions, limiting its ability to discriminate across performance levels. Bland–Altman analyses confirmed systematic proportional bias, with over-scoring of weaker work and under-scoring of stronger work. Results revealed significant inconsistency in AI performance: while the model failed to align with Human-S (κ = .017), it demonstrated statistically significant, moderate agreement with Human-Z (κ = .367). This discrepancy highlights the lack of standardization in human grading and the sensitivity of algorithms to divergent interpretive frameworks. A principal component analysis suggested that AI captured a narrower construct of quality than human raters. Conclusion. These findings indicate that current GenAI tools are not suitable for high-stakes assessment in higher education, where fairness and construct validity are essential. They may, however, offer value in formative feedback or administrative support if used transparently and under human oversight.
Ähnliche Arbeiten
A spreading-activation theory of semantic processing.
1975 · 8.033 Zit.
Cognitive Load During Problem Solving: Effects on Learning
1988 · 7.787 Zit.
International Conference on Learning Representations (ICLR 2013)
2013 · 6.258 Zit.
Learning from delayed rewards
1989 · 5.472 Zit.
Comprehension: A Paradigm for Cognition
1998 · 4.772 Zit.