Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Assessing Probability Rating Performance for Large Language Models in Systematic Literature Review Automation
0
Zitationen
3
Autoren
2026
Jahr
Abstract
Classical machine learning methods of systematic literature review (SLR) automation use thresholded probability scores to make decisions about including or excluding articles in the title/abstract selection process. Large language models (LLMs) have gained popularity in SLR automation and are sometimes prompted to give similar probability score outputs as classical models. LLMs do not always perform well in mathematical reasoning tasks, which raises the question whether their probability outputs are meaningful. We evaluated the accuracy of four LLMs in title/abstract screening, using qualitative rating (Yes/Unsure/No) and a continuous probability rating (0-1). We also assessed the consistency between probability and qualitative ratings within each model using violin plots. The results show that LLMs performed well in excluding irrelevant articles, but their performance dropped notably for including relevant articles. Claude Sonnet 4 and Phi 4 produced consistent probability and qualitative ratings. GPT 4o-mini did not represent probabilities correctly and Qwen 2.5 underperformed in the qualitative category rating. Due to the varying performance of LLMs, we recommend using multiple LLMs to assist human reviewers in the title/abstract screening process of SLRs. Furthermore, we advise testing whether the output of LLMs is meaningful before using them to evaluate articles.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.400 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.261 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.695 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.506 Zit.