OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 09:36

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Autonomous Evaluation Framework for LLM Output Robustness

2025·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2025

Jahr

Abstract

In recent years, large language models (LLMs) have achieved state-of-the-art performance across diverse natural language processing (NLP) tasks, yet persistent concerns remain about their robustness, reliability, and ability to generalize under adversarial or domain-shifted conditions. Conventional evaluation methods often depend on static benchmarks and costly human annotation, limiting scalability and reproducibility. This work introduces an autonomous evaluation framework that unifies benchmark-driven testing with automated adversarial probing for systematic, end-to-end assessment of LLM robustness. The framework is instantiated using the DeepSeek-7B-Chat and Mistral-7B-Instruct models and evaluated across three distinct QA benchmarks—SciQ (scientific), MedQA (medical), and FinQA (financial reasoning)—under both clean and perturbed conditions. Results reveal high robustness in the medical domain (ΔF1 ≈ 0) and minor but consistent degradation in financial reasoning (ΔF1 ≈ –0.0033), highlighting domain-dependent sensitivity to linguistic and numerical perturbations. Comparative analysis between DeepSeek and Mistral indicates consistent robustness trends across architectures, underscoring the generality of the proposed evaluation protocol. Overall, the framework captures semantic consistency, factual accuracy, and error resilience while reducing evaluation time from weeks to hours and improving coverage by over 30 % relative to human-only baselines. Beyond empirical gains, it establishes a reproducible, domain-and model-agnostic methodology for robustness assessment, advancing standardized and scalable evaluation practices for reliable LLM deployment in high-stakes domains such as medicine and finance.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingAdversarial Robustness in Machine Learning
Volltext beim Verlag öffnen