Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Ontology- and LLM-based data harmonization for federated learning in healthcare
2
Zitationen
8
Autoren
2026
Jahr
Abstract
Introduction: Semantic heterogeneity across electronic health records (EHRs) limits scalable and privacy-preserving analytics in healthcare. While federated learning (FL) enables collaborative modeling without sharing raw data, it requires consistent, ontology-aligned representations. We present an ontology- and large language model (LLM)-based data harmonization approach to support secure, interoperable FL workflows. Methods: We propose a general two-step pipeline for converting or annotating clinical text into a predefined target ontology format. First, candidate concepts are retrieved from the target vocabulary using embedding-based similarity search or ontology cross-references. Second, an LLM acts as a semantic validator, accepting or rejecting candidates based on explicit equivalence or subsumption criteria. The approach is ontology-agnostic and configurable; mapping to MONDO and HPO is demonstrated as a real-world use case. Final accepted mappings were evaluated against independent human expert assessment. Results: Across two clinical datasets, expert-LLM agreement reached up to 92%, with overall performance ranging from 78% to 91% depending on candidate-generation strategy. Retrieval alone was insufficient for reliable mapping, whereas LLM-based validation substantially improved precision while complementary retrieval strategies improved recall. Discussion: The proposed pipeline transforms ontology-based harmonization from a manual expert task into a reusable and configurable workflow suitable for federated healthcare research. By combining high-recall retrieval with LLM-based semantic adjudication, the approach enables scalable, privacy-preserving conversion of heterogeneous clinical text into standardized representations across domains.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.714 Zit.
Coding Algorithms for Defining Comorbidities in ICD-9-CM and ICD-10 Administrative Data
2005 · 10.546 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.940 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.543 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.051 Zit.