Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Using a Large Language Model (ChatGPT‐4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers

2025·2 Zitationen·Cochrane Evidence Synthesis and MethodsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

ABSTRACT Background Risk of bias (RoB) assessment is a highly skilled task that is time‐consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task‐specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non‐task‐specific Internet‐scale training sets. They demonstrate human‐like abilities and might be able to support tasks like RoB assessment. Methods Following a published peer‐reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID‐19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual‐ or cluster‐randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review‐trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review‐trial pairs to estimate human‐ChatGPT agreement for “Overall RoB” (primary outcome) and “RoB due to the randomization process”, and ChatGPT‐ChatGPT (intrarater) agreement for “Overall RoB”. We used ChatGPT‐4o (February 2025) throughout. Results The 75 reviews were sampled from 35 Cochrane review groups, and all used RoB1. The 75 trials spanned five decades, and all but one were published in English. Human‐ChatGPT agreement for “Overall RoB” assessment was 50.7% (95% CI 39.3%–62.0%), substantially higher than expected by chance ( p = 0.0015). Human‐ChatGPT agreement for “RoB due to the randomization process” was 78.7% (95% CI 69.4%–88.0%; p < 0.001). ChatGPT‐ChatGPT agreement was 74.7% (95% CI 64.8%–84.6%; p < 0.001). Conclusions ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or “hallucinating”. The estimated agreement for “Overall RoB” is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM‐based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationAdvanced Causal Inference TechniquesExplainable Artificial Intelligence (XAI)

Volltext beim Verlag öffnen

Using a Large Language Model (ChatGPT‐4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen