Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Session 11: <em>Methods Towards A Distant Supervision Paradigm for Clinical Information Extraction: Creating Large Training Datasets for Machine Learning</em>
0
Zitationen
4
Autoren
2018
Jahr
Abstract
Background In the era of big data, a large number of clinical narratives exist in electronic health records. Automatic extraction of key variables from clinical narratives has facilitated many aspects of healthcare and biomedical research. Conventional approaches are based on rule-based natural language processing (NLP) techniques that rely on expert knowledge and exhaustive human efforts of designing rules. Recently machine learning has seen a big performance gain compared to conventional NLP approaches. Despite the impressive improvements achieved by machine learning models, large manual labeled training data are the crucial building blocks of conventional machine learning methods and key enablers of recent deep learning methods. However, large training data are not always readily available and usually expensive to obtain from human annotators. This problem becomes more significant for use cases in clinical domain due to the Health Insurance Portability and Accountability Act (HIPAA) where methods, such as crowdsourcing, are not applicable, and requirements of annotators being medical experts. Method In this paper, we propose a distant supervision paradigm for clinical information extraction. In this paradigm, rule-based NLP algorithms are used to generate large training data with labels automatically. Machine learning models are subsequently trained on these distant labels with word embedding features. Results We study the effectiveness of the proposed framework on two clinical information extraction tasks i2b2 smoking status extraction shared task and a fracture extraction task at our institution. We tested three prevalent machine learning models, namely, Convolutional Neural Networks, Support Vector Machine, and Random Forrest. Conclusion The experimental results show that the proposed distant supervision paradigm is effective for the machine learning models to learn rules towards gold standard from distant labels. Moreover, the machine learning models trained on the distant labels generated by a rule-based NLP algorithm could perform better than the NLP algorithm given sufficient data. Additionally, we showed that CNN was more sensitive to the data size than the conventional machine learning models and that all the tested machine learning methods were viable options for the distant supervision paradigm.
Ähnliche Arbeiten
Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support
2008 · 50.948 Zit.
Gene Ontology: tool for the unification of biology
2000 · 44.372 Zit.
STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets
2018 · 19.029 Zit.
Haploview: analysis and visualization of LD and haplotype maps
2004 · 14.710 Zit.
A translation approach to portable ontology specifications
1993 · 12.503 Zit.