A text-to-tabular approach to generate synthetic patient data using LLMs

Article

2025, 18 June

Publisher: IEEE Xplore
Authors: Margaux Tornqvist, Jean-Daniel Zucker, Tristan Fauvel, Nicolas Lambert, Mathilde Berthelot, Antoine Movschin

Abstract

Access to large-scale high-quality healthcare databases is key to accelerate medical research and make insightful discoveries about diseases. However, access to such data is often limited by patient privacy concerns, data sharing restrictions and high costs. To overcome these limitations, synthetic patient data has emerged as an alternative. However, synthetic data generation (SDG) methods typically rely on machine learning (ML) models trained on original data, leading to the data scarcity problem. We propose an approach to generate synthetic tabular patient data that does not require access to the original data, but only a description of the desired database. We harness prior medical knowledge and the in-context learning capabilities of large language models (LLMs) to perform zero-shot generation of realistic patient data, even in low-resource settings. We quantitatively evaluate our approach against state-of-the-art SDG Models, using fidelity, privacy, and utility metrics. Our results show that while LLMs may not match the performance of state-of-the-art models trained on the original data, they effectively generate realistic patient data with well-preserved clinical correlations. An ablation study highlights key elements of our prompt that contribute to the generation of high-quality synthetic patient data. This approach, which is easy to use and does not require original data or advanced ML skills, is particularly valuable for quickly generating custom-designed patient data, supporting project implementation, and providing educational resources.

Let’s bring science to impact together

Whether you’re interested in our work, looking to co-publish, or exploring to explore how
our insights can support your objectives, our team is here to connect.