OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.05.2026, 01:43

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large language models for risk-of-bias assessment in randomised clinical trials—a comparative validation study

2026·0 Zitationen·EBioMedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

8

Autoren

2026

Jahr

Abstract

BACKGROUND: Large language models (LLMs) are emerging tools for evidence synthesis. Risk of bias (RoB) assessment of trials remains an essential but time-consuming step inconsistent even amongst experts. Early LLM studies showed mixed reliability. Advances in reasoning-enabled models warrant evaluation of their accuracy and consistency for RoB screening across randomised trials to reduce reviewer workload. METHODS: -score). FINDINGS: For RoB 1, interobserver agreement ranged from κ 0.0.27 (95% CI 0.07-0.46) with Gemini Flash 2.0 to κ 0.39 (0.20-0.59) with DeepSeek v3. For RoB 2, agreement was lower, from κ 0.06 (-0.07 to 0.18) with ChatGPT o3 to κ 0.13 (-0.04 to 0.31) with Gemini. Diagnostic performance was limited with sensitivity ranging 0.05-0.55, specificity 0.78-0.99, PPV 0.31-0.50, and NPV 0.48-0.61 across models, with models consistently over-flagging concerns. INTERPRETATION: None of the evaluated LLMs were sufficiently reliable for fully autonomous RoB assessment. DeepSeek v3 and ChatGPT o3 approximated human performance best on RoB 1, but RoB 2 rule-in and rule-out performance remained modest. Current use should be supervised, with possible application of LLMs for triage or as a second assessor. Major improvements in protocol retrieval, task-specific tuning, and calibrated thresholds, prospectively validated, are needed for safe stand-alone deployment. FUNDING: This study received no financial support.

Ähnliche Arbeiten