OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 05.05.2026, 17:24

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Natural Language Processing Methods Automate Molecular Marker Extraction From Glioma Pathology Reports

2026·0 Zitationen·Neurosurgery
Volltext beim Verlag öffnen

0

Zitationen

16

Autoren

2026

Jahr

Abstract

BACKGROUND AND OBJECTIVES: Molecular markers such as isocitrate dehydrogenase (IDH) and alpha-thalassemia/mental retardation syndrome X-linked (ATRX) status are essential for glioma classification and treatment planning, but their manual extraction from pathology reports creates significant research bottlenecks. This study evaluated 3 Natural Language Processing approaches with increasing computational complexity: deterministic Regular Expressions (RegEx), statistical Term Frequency-Inverse Document Frequency (TF-IDF) with logistic regression, and contextual deep learning Bidirectional Encoder Representations from Transformers (BERT). We address whether more intensive approaches provide sufficient performance benefits over simpler approaches in computational pathology research. METHODS: We analyzed pathology reports from 404 patients with glioma at Institution A and 197 at Institution B for external validation. IDH analysis included 399 (Institution A) and 193 (Institution B) patients; ATRX analysis included 361 and 130 patients, respectively. All approaches underwent identical preprocessing steps, including text normalization, terminology standardization, and context extraction. Performance was evaluated using standard classification metrics and memory usage benchmarks on internal and external validation data sets. RESULTS: Simpler approaches outperformed more intensive approaches on external validation. For IDH, Regex achieved near-perfect accuracy (99%, area under the curve [AUC] 1.000) and TF-IDF performed exceptionally (94.2%, AUC 0.984), while BlueBERT underperformed (85.2%, AUC 0.934). For ATRX, Regex achieved perfect accuracy (100%, AUC 1.000) and TF-IDF maintained high accuracy (98.0%, AUC 0.998), outperforming BERT-large (84.6%, AUC 0.931). BERT-based approaches required 1825-1953 MB of memory vs Regex (0.82-5.52 MB) and TF-IDF (17.27-34.89 MB). CONCLUSION: Simple Natural Language Processing approaches effectively automate molecular marker extraction from pathology reports with near-perfect accuracy while requiring minimal computational resources. This enables expanded sample sizes in retrospective studies, multi-institutional analyses of rare molecular subgroups, and accelerated biomarker research. Future work will focus on validation across larger data sets, infrastructure integration, and expansion to additional molecular markers.

Ähnliche Arbeiten