OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 08.05.2026, 21:30

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large Language Models in Randomized Controlled Trials Design: Observational Study

2025·4 Zitationen·Journal of Medical Internet ResearchOpen Access
Volltext beim Verlag öffnen

4

Zitationen

7

Autoren

2025

Jahr

Abstract

Background: Randomized controlled trials (RCTs) face challenges such as limited generalizability, insufficient recruitment diversity, and high failure rates, often due to restrictive eligibility criteria and inefficient patient selection. Large language models (LLMs) have shown promise in various clinical tasks, but their potential role in RCT design remains underexplored. Objective: This study investigates the ability of LLMs, specifically GPT-4-Turbo-Preview, to assist in designing RCTs that enhance generalizability, recruitment diversity, and reduce failure rates, while maintaining clinical safety and ethical standards. Methods: We conducted a noninterventional, observational study analyzing 20 parallel-arm RCTs, comprising 10 completed and 10 registered studies published after January 2024 to mitigate pretraining biases. The LLM was tasked with generating RCT designs based on input criteria, including eligibility, recruitment strategies, interventions, and outcomes. The accuracy of LLM-generated designs was quantitatively assessed by 2 independent clinical experts by comparing them to clinically validated ground truth data from ClinicalTrials.gov. We have conducted statistical analysis using natural language processing-based methods, including Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-L, and Metric for Evaluation of Translation with Explicit ORdering (METEOR), for objective scoring on corresponding LLM outputs. Qualitative assessments were performed using Likert scale ratings (1-3) for domains such as safety, clinical accuracy, objectivity or bias, pragmatism, inclusivity, and diversity. Results: The LLM achieved an overall accuracy of 72% in replicating RCT designs. Recruitment and intervention designs demonstrated high agreement with the ground truth, achieving 88% and 93% accuracy, respectively. However, LLMs showed lower accuracy in designing eligibility criteria (55%) and outcomes measurement (53%). Natural language processing statistical analysis reported BLEU=0.04, ROUGE-L=0.20, and METEOR=0.18 on average objective scoring of LLM outputs. Qualitative evaluations showed that LLM-generated designs scored above 2 points and closely matched the original designs in scores across all domains, indicating strong clinical alignment. Specifically, both original and LLM-based designs ranked similarly high in safety, clinical accuracy, and objectivity or bias in published RCTs. Moreover, LLM-based design ranked noninferior to original designs in registered RCTs in multiple domains. In particular, LLMs enhanced diversity and pragmatism, which are key factors in improving RCT generalizability and addressing failure rates. Conclusions: LLMs, such as GPT-4-Turbo-Preview, have demonstrated potential in improving RCT design, particularly in recruitment and intervention planning, while enhancing generalizability and addressing diversity. However, expert oversight and regulatory measures are essential to ensure patient safety and ethical standards. The findings support further integration of LLMs into clinical trial design, although continued refinement is necessary to address limitations in eligibility and outcomes measurement.

Ähnliche Arbeiten