Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

ENTERPRISE LARGE LANGUAGE MODEL EVALUATION BENCHMARK

2025·1 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Large Language Models (LLMs) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task complexities. We propose a 14-task framework grounded in Bloom’s Taxonomy to holistically evaluate LLM capabilities in enterprise contexts. To address challenges of noisy data and costly annotation, we develop a scalable pipeline combining LLM-as-a-Labeler, LLM-as-aJudge, and corrective retrieval-augmented generation (CRAG), curating a robust 9,700-sample benchmark. Evaluation of six leading models shows open-source contenders like DeepSeek R1 rival proprietary models in reasoning tasks but lag in judgment-based scenarios, likely due to overthinking. Our benchmark reveals critical enterprise performance gaps and offers actionable insights for model optimization. This work provides enterprises a blueprint for tailored evaluations and advances practical LLM deployment.

Autoren

Institutionen

Atlassian (United Kingdom)(GB)

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingExplainable Artificial Intelligence (XAI)

Volltext beim Verlag öffnen

ENTERPRISE LARGE LANGUAGE MODEL EVALUATION BENCHMARK

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen