OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 07.04.2026, 03:32

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing Probability Rating Performance for Large Language Models in Systematic Literature Review Automation

2026·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

3

Autoren

2026

Jahr

Abstract

Classical machine learning methods of systematic literature review (SLR) automation use thresholded probability scores to make decisions about including or excluding articles in the title/abstract selection process. Large language models (LLMs) have gained popularity in SLR automation and are sometimes prompted to give similar probability score outputs as classical models. LLMs do not always perform well in mathematical reasoning tasks, which raises the question whether their probability outputs are meaningful. We evaluated the accuracy of four LLMs in title/abstract screening, using qualitative rating (Yes/Unsure/No) and a continuous probability rating (0-1). We also assessed the consistency between probability and qualitative ratings within each model using violin plots. The results show that LLMs performed well in excluding irrelevant articles, but their performance dropped notably for including relevant articles. Claude Sonnet 4 and Phi 4 produced consistent probability and qualitative ratings. GPT 4o-mini did not represent probabilities correctly and Qwen 2.5 underperformed in the qualitative category rating. Due to the varying performance of LLMs, we recommend using multiple LLMs to assist human reviewers in the title/abstract screening process of SLRs. Furthermore, we advise testing whether the output of LLMs is meaningful before using them to evaluate articles.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMeta-analysis and systematic reviewsComputational and Text Analysis Methods
Volltext beim Verlag öffnen