Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses
49
Zitationen
11
Autoren
2024
Jahr
Abstract
BACKGROUND: Systematic reviews are performed manually despite the exponential growth of scientific literature. OBJECTIVE: To investigate the sensitivity and specificity of GPT-3.5 Turbo, from OpenAI, as a single reviewer, for title and abstract screening in systematic reviews. DESIGN: Diagnostic test accuracy study. SETTING: Unannotated bibliographic databases from 5 systematic reviews representing 22 665 citations. PARTICIPANTS: None. MEASUREMENTS: A generic prompt framework to instruct GPT to perform title and abstract screening was designed. The output of the model was compared with decisions from authors under 2 rules. The first rule balanced sensitivity and specificity, for example, to act as a second reviewer. The second rule optimized sensitivity, for example, to reduce the number of citations to be manually screened. RESULTS: Under the balanced rule, sensitivities ranged from 81.1% to 96.5% and specificities ranged from 25.8% to 80.4%. Across all reviews, GPT identified 7 of 708 citations (1%) missed by humans that should have been included after full-text screening at the cost of 10 279 of 22 665 false-positive recommendations (45.3%) that would require reconciliation during the screening process. Under the sensitive rule, sensitivities ranged from 94.6% to 99.8% and specificities ranged from 2.2% to 46.6%. Limiting manual screening to citations not ruled out by GPT could reduce the number of citations to screen from 127 of 6334 (2%) to 1851 of 4077 (45.4%), at the cost of missing from 0 to 1 of 26 citations (3.8%) at the full-text level. LIMITATIONS: Time needed to fine-tune prompt. Retrospective nature of the study, convenient sample of 5 systematic reviews, and GPT performance sensitive to prompt development and time. CONCLUSION: The GPT-3.5 Turbo model may be used as a second reviewer for title and abstract screening, at the cost of additional work to reconcile added false positives. It also showed potential to reduce the number of citations before screening by humans, at the cost of missing some citations at the full-text level. PRIMARY FUNDING SOURCE: None.
Ähnliche Arbeiten
The PRISMA 2020 statement: an updated guideline for reporting systematic reviews
2021 · 90.765 Zit.
Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement
2009 · 83.089 Zit.
The Measurement of Observer Agreement for Categorical Data
1977 · 78.048 Zit.
Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement
2009 · 63.587 Zit.
Measuring inconsistency in meta-analyses
2003 · 62.236 Zit.
Autoren
Institutionen
- Inserm(FR)
- Université Paris Cité(FR)
- Sorbonne Université(FR)
- Université Sorbonne Paris Nord(FR)
- Sorbonne Paris Cité(FR)
- Assistance Publique – Hôpitaux de Paris(FR)
- Centre de Recherche Épidémiologie et Statistique(FR)
- Universität für Weiterbildung Krems(AT)
- RTI International(US)
- University of Freiburg(DE)
- Université Paris-Est Créteil(FR)
- Epidemiology in dermatology and evaluation of therapeutics
- Délégation Paris 5(FR)