OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.05.2026, 23:53

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Using a Large Language Model (ChatGPT‐4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers

2025·2 Zitationen·Cochrane Evidence Synthesis and MethodsOpen Access
Volltext beim Verlag öffnen

2

Zitationen

10

Autoren

2025

Jahr

Abstract

Background: Risk of bias (RoB) assessment is a highly skilled task that is time-consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task-specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non-task-specific Internet-scale training sets. They demonstrate human-like abilities and might be able to support tasks like RoB assessment. Methods: Following a published peer-reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID-19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual- or cluster-randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review-trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review-trial pairs to estimate human-ChatGPT agreement for "Overall RoB" (primary outcome) and "RoB due to the randomization process", and ChatGPT-ChatGPT (intrarater) agreement for "Overall RoB". We used ChatGPT-4o (February 2025) throughout. Results: < 0.001). Conclusions: ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or "hallucinating". The estimated agreement for "Overall RoB" is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM-based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.

Ähnliche Arbeiten