OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 09.05.2026, 05:45

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of six large language models for study identification in an obstetric systematic review

2026·0 Zitationen·Minerva Obstetrics and Gynecology
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2026

Jahr

Abstract

INTRODUCTION: Identifying eligible studies is a foundational component of systematic reviews, requiring careful interpretation of complex inclusion and exclusion criteria. Given the rapid integration of large language models (LLMs) into evidence synthesis, their reliability in autonomously performing this task necessitates timely evaluation. This study intends to assess whether general-purpose LLMs can accurately perform the study identification phase of a published systematic review in obstetrics using only predefined eligibility criteria. EVIDENCE ACQUISITION: Six publicly accessible LLMs were given a standardized prompt, without iterative refinement or human oversight, to identify eligible studies from a 2023 JAMA Network Open meta-analysis. Each model's output was compared with the 14 studies included in the reference review. Primary outcomes were precision, recall, F1 score, and hallucination severity. EVIDENCE SYNTHESIS: Claude 3.7 achieved the highest accuracy, correctly identifying 5 of 14 reference studies (precision 71.4%, recall 35.7%, F1 score of 0.48). In comparison, all other models exhibited substantially lower performance with minimal variation in F1 scores (ranging between 0.08-0.12), indicating a poor balance of precision and recall. Precision was generally inversely related to the number of false positives, and LLMs that returned more total studies tended to produce more hallucinations. CONCLUSIONS: Current general-purpose LLMs are unreliable for autonomous study identification in clinically relevant systematic reviews and human oversight remains essential. The low F1 scores highlight major limitations in current LLMs' ability to accurately and comprehensively identify relevant studies. These findings underscore the need for fine-tuning and hybrid AI-human workflows before safe integration into evidence synthesis in obstetrics and gynecology.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMeta-analysis and systematic reviewsMaternal and fetal healthcare
Volltext beim Verlag öffnen