Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Contrastive text embeddings with effective sample mining for enhanced disease diagnosis
0
Zitationen
9
Autoren
2025
Jahr
Abstract
Disease diagnosis is a pivotal task in Clinical Decision Support (CDS), which aids physicians in differential diagnosis, faces challenges in achieving high precision and improving clinical adaptability. The development of deep learning models based on pre-trained transformers, especially Large Language Models (LLMs) and text embedding models, bring opportunities to construct advanced disease diagnosis models, but either kind of model meets challenges. LLM-based disease diagnosis models have not performed reliability on generated diagnoses and physicians difficultly trace the original evidence for these generated diagnoses. Text embedding models can address the previous challenges but the domain-specific semantic misalignment caused by corpus distribution differences between open-domain corpora and authentic clinical notes during the training of general text embedding models causes inaccurate ranking. In this paper, we build Disease Diagnoser based on a hybrid information retrieval model architecture of an augmented retriever and an augmented reranker. We propose an approach to respectively construct DD-retriever and DD-reranker through contrastive text embeddings with Effective Sample Mining, for addressing the domain-specific semantic misalignment during contrastive learning and thereby enhancing the diagnostic accuracy in disease diagnosis. Specifically, Effective Sample Mining provides high-quality positive and negative samples for model fine-tuning, augmenting contrastive learning. We define Semantic Target in order to improve the capability of a text embedding model in identifying positive sample during contrastive learning. Extensive experiments demonstrate that Disease Diagnoser outperforms the best performing SOTA LLM, by 12.9%, 22.9% and 24.8% respectively on top-3, top-5 and top-10 diagnostic accuracy. Our approach is validated to be generalized to any hospital, using its private annotated clinical notes, to construct a specific disease diagnosis model. Additionally, we construct PMC-Patients-DD, a new public clinical note dataset with grounded truth, specifically designed for disease diagnosis related tasks. This dataset is available for more researchers in the field of disease diagnosis to facilitate further researches. • We propose Effective Sample Mining approach, mining exact positive samples and hard negative samples, to provide high-quality medical positive and negative samples for addressing the domain-specific semantic misalignment during contrastive learning, thereby enhancing the accuracy of disease diagnosis. Exact Positive Mining aims to mine high-quality positive samples through LLM’s information extraction and web search, while Hard Negative Mining randomly generates numbers of negative samples and selects the ones that are neither completely irrelevant to nor highly semantically-related to the positive samples. • We define Semantic Target, a vector extracted from a query or explicitly associated with the query to mine the exact positive sample, establishing a high semantic correlation in disease diagnosis between a query and the exact positive sample, thereby improving the capability of a text embedding model in identifying positive sample during contrastive learning. • We construct PMC-Patients-DD dataset, a new public clinical note dataset with grounded truth, specifically designed for disease diagnosis related tasks. We have integrated the mined disease names from PMC-Patients with their corresponding patient summaries to create a new dataset, including 29470 patient summaries annotated with diagnoses covering 1145 types of diseases. This dataset is available for researchers in the field of disease diagnosis to facilitate further researches.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.315 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.685 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.211 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.614 Zit.
Artificial intelligence in healthcare: past, present and future
2017 · 4.411 Zit.