Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
CamelEval: Advancing Benchmarks for Arabic Language Models in Generative Tasks
0
Zitationen
6
Autoren
2025
Jahr
Abstract
Large Language Models (LLMs) serve as the foundation of contemporary artificial intelligence systems. Recently, a diverse range of Arabic-centric LLMs has emerged, and with them, a variety of evaluation suites, designed to assess the alignment of LLMs with the values and preferences of Arabic speakers and address their capabilities on instruction following, open-ended question answering, and information delivery. However, the majority of these suites rely exclusively on multiple-choice questions and, thereby, fail to adequately assess the text generation capabilities of LLMs. To address this shortcoming, we propose a new automated evaluation benchmark, CamelEval. CamelEval comprises three test suites to evaluate general instruction following, factuality, and cultural alignment. Each test suite contains 805 carefully curated challenging test cases that reflect the nuances of Arabic language and culture. We envision CamelEval as a tool to guide the development of future Arabic LLMs, serving over 400 million Arabic speakers by providing LLMs that not only communicate in their language but also understand their culture.