Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Contrastive text embeddings with effective sample mining for enhanced disease diagnosis

2025·0 Zitationen·ArrayOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Disease diagnosis is a pivotal task in Clinical Decision Support (CDS), which aids physicians in differential diagnosis, faces challenges in achieving high precision and improving clinical adaptability. The development of deep learning models based on pre-trained transformers, especially Large Language Models (LLMs) and text embedding models, bring opportunities to construct advanced disease diagnosis models, but either kind of model meets challenges. LLM-based disease diagnosis models have not performed reliability on generated diagnoses and physicians difficultly trace the original evidence for these generated diagnoses. Text embedding models can address the previous challenges but the domain-specific semantic misalignment caused by corpus distribution differences between open-domain corpora and authentic clinical notes during the training of general text embedding models causes inaccurate ranking. In this paper, we build Disease Diagnoser based on a hybrid information retrieval model architecture of an augmented retriever and an augmented reranker. We propose an approach to respectively construct DD-retriever and DD-reranker through contrastive text embeddings with Effective Sample Mining, for addressing the domain-specific semantic misalignment during contrastive learning and thereby enhancing the diagnostic accuracy in disease diagnosis. Specifically, Effective Sample Mining provides high-quality positive and negative samples for model fine-tuning, augmenting contrastive learning. We define Semantic Target in order to improve the capability of a text embedding model in identifying positive sample during contrastive learning. Extensive experiments demonstrate that Disease Diagnoser outperforms the best performing SOTA LLM, by 12.9%, 22.9% and 24.8% respectively on top-3, top-5 and top-10 diagnostic accuracy. Our approach is validated to be generalized to any hospital, using its private annotated clinical notes, to construct a specific disease diagnosis model. Additionally, we construct PMC-Patients-DD, a new public clinical note dataset with grounded truth, specifically designed for disease diagnosis related tasks. This dataset is available for more researchers in the field of disease diagnosis to facilitate further researches. • We propose Effective Sample Mining approach, mining exact positive samples and hard negative samples, to provide high-quality medical positive and negative samples for addressing the domain-specific semantic misalignment during contrastive learning, thereby enhancing the accuracy of disease diagnosis. Exact Positive Mining aims to mine high-quality positive samples through LLM’s information extraction and web search, while Hard Negative Mining randomly generates numbers of negative samples and selects the ones that are neither completely irrelevant to nor highly semantically-related to the positive samples. • We define Semantic Target, a vector extracted from a query or explicitly associated with the query to mine the exact positive sample, establishing a high semantic correlation in disease diagnosis between a query and the exact positive sample, thereby improving the capability of a text embedding model in identifying positive sample during contrastive learning. • We construct PMC-Patients-DD dataset, a new public clinical note dataset with grounded truth, specifically designed for disease diagnosis related tasks. We have integrated the mined disease names from PMC-Patients with their corresponding patient summaries to create a new dataset, including 29470 patient summaries annotated with diagnoses covering 1145 types of diseases. This dataset is available for researchers in the field of disease diagnosis to facilitate further researches.

Autoren

Institutionen

Themen

Machine Learning in HealthcareTopic ModelingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Contrastive text embeddings with effective sample mining for enhanced disease diagnosis

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen