OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 27.04.2026, 04:00

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Dual-Model LLM Ensemble via Web Chat Interfaces Reaches Near-Perfect Sensitivity for Systematic-Review Screening: A Multi-Domain Validation with Equivalence to API Access

2025·0 Zitationen·medRxivOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2025

Jahr

Abstract

ABSTRACT Background Prior work showed that state-of-the-art (mid-2025) large language models (LLMs) prompted with varying batch sizes can perform well on systematic review (SR) abstract screening via public APIs within a single medical domain. Whether comparable performance holds when using no-code web interfaces (GUIs) and whether results generalize across medical domains remain unclear. Objective To evaluate the screening performance of a zero-shot, large-batch, two-model LLM ensemble (OpenAI GPT-5 Thinking; Google Gemini 2.5 Pro) operated via public chat GUIs across a diverse range of medical topics, and to compare its performance with an equivalent API-based workflow. Methods We conducted a retrospective evaluation using 736 titles and abstracts from 16 Cochrane reviews (330 included, 406 excluded), all published in May-June 2025. The primary outcome was the sensitivity of a pre-specified “OR” ensemble rule designed to maximize sensitivity, benchmarked against final full-text inclusion decisions (reference standard). Secondary outcomes were specificity, single-model performance, and duplicate-run reliability (Cohen’s κ). Because models saw only titles/abstracts while the reference standard reflected full-text decisions, specificity estimates are conservative for abstract-level screening. Results The GUI-based ensemble achieved 99.7% sensitivity (95% CI, 98.3%-100.0%) and 49.3% specificity (95% CI, 44.3%-54.2%). The API-based workflow yielded comparable performance, with 99.1% sensitivity (95% CI, 97.4%-99.8%) and 49.3% specificity (95% CI, 44.3%-54.2%). The difference in sensitivity was not statistically significant (McNemar p=0.625) and met equivalence within a ±2-percentage-point margin (TOST<0.05). Duplicate-run reliability was substantial to almost perfect (Cohen’s κ: 0.78-0.93). The two models showed complementary strengths: Gemini 2.5 Pro consistently achieved higher sensitivity (94.5%-98.2% across single runs), whereas GPT-5 Thinking yielded higher specificity (62.3%-67.0%). Conclusions A zero-code, browser-based workflow using a dual-LLM ensemble achieves near-perfect sensitivity for abstract screening across multiple medical domains, with performance equivalent to API-based methods. Ensemble approaches spanning two model families may mitigate model-specific blind spots. Prospective studies should quantify workload, cost, and operational feasibility in end-to-end systematic review pipelines.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMeta-analysis and systematic reviewsBiomedical Text Mining and Ontologies
Volltext beim Verlag öffnen