OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 11:26

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

SteadyEval: Robust LLM Exam Graders via Adversarial Training and Distillation

2026·0 Zitationen·ComputersOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2026

Jahr

Abstract

Large language models (LLMs) are increasingly used as rubric-guided graders for short-answer exams, but their decisions can be unstable across prompts and vulnerable to answer-side prompt injection. In this paper, we study SteadyEval, a guardrailed exam-grading pipeline in which an adversarially trained LoRA filter (SteadyEval-7B-deep) preprocesses student answers to remove answer-side prompt injection, after which the original Mistral-7B-Instruct rubric-guided grader assigns the final score. We build two exam-grading pipelines on top of Mistral-7B-Instruct: a baseline pipeline that scores student answers directly, and a guardrailed pipeline in which a LoRA-based filter (SteadyEval-7B-deep) first removes injection content from the answer and a downstream grader then assigns the final score. Using two rubric-guided short-answer datasets in machine learning and computer networking, we generate grouped families of clean answers and four classes of answer-side attacks, and we evaluate the impact of these attacks on score shifts, attack success rates, stability across prompt variants, and alignment with human graders. On the pooled dataset, answer-side attacks inflate grades in the unguarded baseline by an average of about +1.2 points on a 1–10 scale, and substantially increase score dispersion across prompt variants. The guardrailed pipeline largely removes this systematic grade inflation and reduces instability for many items, especially in the machine-learning exam, while keeping mean absolute error with respect to human reference scores in a similar range to the unguarded baseline on clean answers, with a conservative shift in networking that motivates per-course calibration. Chief-panel comparisons further show that the guardrailed pipeline tracks human grading more closely on machine-learning items, but tends to under-score networking answers. These findings are best interpreted as a proof-of-concept guardrail and require per-course validation and calibration before operational use.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Adversarial Robustness in Machine LearningTopic ModelingArtificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen