OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 10.05.2026, 18:26

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Ontology- and LLM-based data harmonization for federated learning in healthcare

2026·2 Zitationen·Frontiers in Digital HealthOpen Access
Volltext beim Verlag öffnen

2

Zitationen

8

Autoren

2026

Jahr

Abstract

Introduction: Semantic heterogeneity across electronic health records (EHRs) limits scalable and privacy-preserving analytics in healthcare. While federated learning (FL) enables collaborative modeling without sharing raw data, it requires consistent, ontology-aligned representations. We present an ontology- and large language model (LLM)-based data harmonization approach to support secure, interoperable FL workflows. Methods: We propose a general two-step pipeline for converting or annotating clinical text into a predefined target ontology format. First, candidate concepts are retrieved from the target vocabulary using embedding-based similarity search or ontology cross-references. Second, an LLM acts as a semantic validator, accepting or rejecting candidates based on explicit equivalence or subsumption criteria. The approach is ontology-agnostic and configurable; mapping to MONDO and HPO is demonstrated as a real-world use case. Final accepted mappings were evaluated against independent human expert assessment. Results: Across two clinical datasets, expert-LLM agreement reached up to 92%, with overall performance ranging from 78% to 91% depending on candidate-generation strategy. Retrieval alone was insufficient for reliable mapping, whereas LLM-based validation substantially improved precision while complementary retrieval strategies improved recall. Discussion: The proposed pipeline transforms ontology-based harmonization from a manual expert task into a reusable and configurable workflow suitable for federated healthcare research. By combining high-recall retrieval with LLM-based semantic adjudication, the approach enables scalable, privacy-preserving conversion of heterogeneous clinical text into standardized representations across domains.

Ähnliche Arbeiten