OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.05.2026, 03:33

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Multi-LLM Disagreement as a Scalable Detector of Human Annotation Errors in Structured Data from Clinical Free-Text

2026·0 Zitationen·medRxivOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Abstract Objective Structured extraction from clinical free-text depends on human annotators whose labels are susceptible to errors and knowledge-driven mistakes; exhaustive quality control is impractical at scale. We evaluate whether disagreement among multiple locally hosted large language models (LLMs) can prioritize human annotations for targeted review. Methods Multiple LLMs independently extract the same set of structured variables annotated by a human reviewer. For each annotation, an agreement score counts the LLMs matching the human label. Using four locally hosted LLMs (Gemma 3 27B, DeepSeek-R1 70B, GPT-OSS 120B, Mistral Large 3), we evaluated this approach on 910 German-language colonoscopy reports describing endoscopic mucosal resection, with five structured variables per case (anatomical location, two diameters, resection technique, multiple polyps), yielding 4,550 annotations and a 377-case adjudication sample. A stratified sample oversampling low-agreement strata was adjudicated blinded by an experienced reviewer and analyzed with prevalence-adjusted estimates Results Human error rates rose as LLM agreement fell, from 0% at scores 3–4 to 76% at score 0. The lowest-agreement stratum was only 6.5% of annotations yet concentrated an estimated 80% of errors. The multi-LLM disagreement score achieved a prevalence-adjusted AUC-ROC of 0.991 (95% CI 0.987–0.994) and AUC-PR of 0.893 (95% CI 0.851–0.929) for error detection. Discussion Multi-LLM disagreement outperformed single models and provided graded operating points for risk-stratified review. Conclusion Multi-LLM disagreement provides a scalable quality-control signal for targeted review of the highest-yield cases. Because all models run locally, the framework is GDPR-compliant; its language- and task-agnostic design supports application across clinical domains.

Ähnliche Arbeiten