Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large Language Models for Synthetic Tabular Health Data: A Benchmark Study

2024·3 Zitationen·Studies in health technology and informaticsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the real-world data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation.

Autoren

Institutionen

Bern University of Applied Sciences(CH)

Themen

Machine Learning in HealthcareTopic ModelingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Large Language Models for Synthetic Tabular Health Data: A Benchmark Study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen