Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Autonomous Evaluation Framework for LLM Output Robustness
0
Zitationen
1
Autoren
2025
Jahr
Abstract
In recent years, large language models (LLMs) have achieved state-of-the-art performance across diverse natural language processing (NLP) tasks, yet persistent concerns remain about their robustness, reliability, and ability to generalize under adversarial or domain-shifted conditions. Conventional evaluation methods often depend on static benchmarks and costly human annotation, limiting scalability and reproducibility. This work introduces an autonomous evaluation framework that unifies benchmark-driven testing with automated adversarial probing for systematic, end-to-end assessment of LLM robustness. The framework is instantiated using the DeepSeek-7B-Chat and Mistral-7B-Instruct models and evaluated across three distinct QA benchmarks—SciQ (scientific), MedQA (medical), and FinQA (financial reasoning)—under both clean and perturbed conditions. Results reveal high robustness in the medical domain (ΔF1 ≈ 0) and minor but consistent degradation in financial reasoning (ΔF1 ≈ –0.0033), highlighting domain-dependent sensitivity to linguistic and numerical perturbations. Comparative analysis between DeepSeek and Mistral indicates consistent robustness trends across architectures, underscoring the generality of the proposed evaluation protocol. Overall, the framework captures semantic consistency, factual accuracy, and error resilience while reducing evaluation time from weeks to hours and improving coverage by over 30 % relative to human-only baselines. Beyond empirical gains, it establishes a reproducible, domain-and model-agnostic methodology for robustness assessment, advancing standardized and scalable evaluation practices for reliable LLM deployment in high-stakes domains such as medicine and finance.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.