OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.05.2026, 20:04

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing the methodologic quality of systematic reviews using generative large language models

2025·0 Zitationen·Canadian Urological Association JournalOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2025

Jahr

Abstract

INTRODUCTION: We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs). METHODS: A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicate, with differences adjudicated by a third expert. We created a customized generative artificial intelligence (generative pre-trained transformer [GPT]), "Urology AMSTAR 2 Quality Assessor," and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial were calculated against human results. Internal validity among three trials were computed. RESULTS: GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%. CONCLUSIONS: GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.

Ähnliche Arbeiten