Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Tuning a neural information retrieval system in medical domains with limited data: Development Study (Preprint)
0
Zitationen
4
Autoren
2022
Jahr
Abstract
<sec> <title>BACKGROUND</title> Pathology reports contain key information about the patient’s diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility. </sec> <sec> <title>OBJECTIVE</title> To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions. </sec> <sec> <title>METHODS</title> The study aimed to develop a neural information retrieval system that can be successfully trained on small text corpuses. We develop such a system and validate it by training it to automatically answer two pre-defined questions given text from renal pathology reports: 1) “What kind of rejection does the patient show?”; and 2) “What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?”. First, we followed the conventionally recommended procedure for developing domain-specific models and pre-trained a previously proposed medical language model called Clinical BERT further with our text corpus, which contains 3.4K renal transplant pathology reports and 1.5M words, using Masked Language Modeling to obtain ‘Kidney BERT’. Second, we hypothesized that the conventional pre-training procedure fails to capture the intricate vocabulary of narrow technical domains. We created extended Kidney BERT (‘exKidneyBERT’) by extending the tokenizer of Clinical BERT with six technical keywords from our corpus (which we determined were missing from the original tokenizer vocabulary) and then repeating the pre-training procedure. Third, to further improve performance, all three models were fine-tuned with information retrieval (IR) heads tailored to the two questions of interest. </sec> <sec> <title>RESULTS</title> For the first question regarding rejection, the overlap ratio at word level for exKidneyBERT – 83.3% for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR) – beats that of both Clinical BERT and Kidney BERT (both are 46.1% for ABMR, and 65.2% for TCMR). For the second question regarding IFTA, the exact match rate of exKidneyBERT (95.8%) beats that of Kidney BERT (95.0%) and Clinical BERT (94.7%). </sec> <sec> <title>CONCLUSIONS</title> We developed exKidneyBERT, a high-performing model for automatically extracting information from renal pathology reports. More broadly, we found that when working in domains with highly specialized vocabulary, it is essential to extend the vocabulary library of the BERT tokenizer to improve model performance, otherwise, pre-training (especially on small corpuses) is ineffective. In our case, pre-training BERT language models on kidney pathology reports improved model performance even though the training corpus was much smaller than the corpora normally used to train language models. </sec> <sec> <title>CLINICALTRIAL</title> <p /> </sec>