OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.05.2026, 18:14

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Computer assisted verbal autopsy: comparing large language models to physicians for assigning causes to 6939 deaths in Sierra Leone from 2019–2022

2025·2 Zitationen·BMC MedicineOpen Access
Volltext beim Verlag öffnen

2

Zitationen

11

Autoren

2025

Jahr

Abstract

BACKGROUND: Verbal autopsies (VAs) collect information on deaths in low and middle-income countries occurring outside healthcare facilities to estimate causes of death (CODs) for use in epidemiological or planning studies. Physician coding of VAs focused on the narrative of deaths and past symptoms is current best practice. Large language models (LLM) such as GPT-5 enable possible use of the narrative portion of VAs to assign CODs. However, there are few if any robust comparisons of LLMs to physician coding. METHODS: We analyzed 6,939 VA records from a random sample of deaths in Sierra Leone (2019-2022) to compare five models: three LLMs (GPT-3.5, GPT-4, GPT-5) and two based on symptom algorithms (InterVA-5, InSilicoVA), against physician-assigned CODs. GPT models used narratives, whereas InterVA-5 and InSilicoVA relied on questionnaires. CODs were grouped into 19, 10, and 7 categories for adult, child, and neonatal deaths. We used cause specific mortality fraction (CSMF) accuracy and partial chance corrected concordance (PCCC) to assess population and individual-level agreement respectively, compared to the standard of physician coding. We stratified analyses by age group as CODs vary among neonates, children and adults. RESULTS: Overall, GPT-5 outperformed all models (PCCC = 0.71), followed by GPT-4 (0.61), GPT-3.5 (0.56), InSilicoVA (0.44), and InterVA-5 (0.44). GPT-5 achieved the highest performance for adult (0.68), child (0.71), and neonatal (0.65) deaths. Across ages, performance increased from 1 month to 14 years and declined from 15 to 69 years. GPT-5, GPT-4, GPT-3.5, and InSilicoVA achieved the highest PCCC in 14, 7, 7, and 2 of the 30 CODs, respectively. At the population level, GPT-5 achieved the highest CSMF accuracy (0.9), while all other models had comparable performance (0.74-0.79). CONCLUSIONS: GPT models and InSilicoVA showed greater performance for specific CODs at the individual-level. GPT models demonstrated improvements over InterVA-5 and InSilicoVA models. This study provides foundational evidence for integrating LLM and algorithmic models with physician coding to improve the quality of VA data.

Ähnliche Arbeiten