Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A Modular Framework for Business-Specific Benchmarking of Large Language Models
0
Zitationen
1
Autoren
2025
Jahr
Abstract
Large Language Models (LLMs) are being rapidly integrated into core business functions, creating a need for evaluation methods that measure performance against specific operational goals rather than generic metrics that are typically used for evaluating them. The standard approaches are developing fully custom benchmarks or adapting standardized frameworks like HELM. We propose a structured, hybrid methodology that integrates a robust LLM-as-a-Judge (LJ) pipeline into a business-aligned testing strategy. The framework consists of a comprehensive six-stage methodology for designing, creating, and maintaining business-relevant benchmarks; a modular reference architecture that operationalizes its work, automating evaluation to minimize human involvement; a practical validation of the framework through a case study comparing multiple LLMs on a financial advisory task. The results demonstrate that this approach not only provides a more accurate assessment of model's fitness for a specific business purpose but also integrates evaluation into the Machine Learning Operations (MLOps) lifecycle, enabling continuous improvement and management.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.611 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.504 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.025 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.835 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.