1. Introduction
A significant challenge in healthcare research is the scarcity and often suboptimal quality of medical data, particularly in resource-limited regions. The need for infrastructure, funding, and research capacity in these areas hinders the collection of comprehensive patient datasets. Moreover, the prevalence of specific medical conditions may vary geographically, resulting not only in a lack of data for particular diseases but also in the presence of biases within the data. This disparity exacerbates the inequities between well-resourced healthcare institutions and those serving marginalized communities [
1,
2]. This uneven distribution of medical data intensifies the gap in healthcare research and medical innovation. Hospitals and research centers with access to extensive datasets can conduct thorough testing and develop effective treatments, while insufficient data constrain others from pursuing similar endeavors.
Addressing the scarcity of medical data requires innovative strategies. One approach involves inter-institutional data sharing, often hindered by stringent data privacy regulations. Data anonymization techniques are used to remove identifiers and standardize shared data to mitigate these restrictions [
3]. However, these methods can introduce biases or distortions and may compromise data utility by removing or obscuring sensitive or unique information. Additionally, encrypted data sharing, while a widely used privacy-preserving solution, is not without its risks. Encrypted data are vulnerable to security threats, including man-in-the-middle attacks, key compromises, and the potential for re-identification when auxiliary information is available [
4,
5]. Such vulnerabilities highlight the inherent risks of transmitting real patient data, even when encrypted. These challenges underscore the importance of developing methods that avoid transmitting sensitive information entirely.
In recent years, Synthetic Data Generation (SDG) has emerged as a promising alternative, generating artificial patient data that mimics real data while preserving privacy. Synthetic data can augment existing datasets and facilitate research without compromising patient confidentiality [
6,
7]. Generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), offer powerful tools for this SDG. GANs [
8,
9], capable of generating diverse data, have limitations in handling mixed data types and specific feature values, particularly with imbalanced datasets. Conditional GANs, such as CTGAN [
10], address these challenges using a conditional vector to specify desired labels. VAEs, on the other hand, provide a probabilistic framework for data generation, offering flexibility in handling complex data distributions. Among VAEs, a recent extension of TVAE [
10] with a Bayesian Gaussian mixture model (BGM), VAE-BGM [
11], demonstrates superior performance in generating high-quality synthetic tabular data. These models effectively capture the underlying structure and distribution of real-world data, enabling the creation of realistic, anonymized patient data for research and analysis. By leveraging generative artificial intelligence models, healthcare institutions can overcome the limitations of real-world data and accelerate medical research while protecting patient privacy. However, the effectiveness of these models and SDG hinges on the quality and quantity of the underlying real data. If an institution lacks sufficient data to generate reliable synthetic data, the process may require tailored approaches to minimize the impact of the lack of samples [
12].
Federated Learning (FL) [
13] has emerged as a promising framework for collaborative Machine Learning (ML), particularly in scenarios where data privacy is a significant concern. While traditionally employed to enhance model performance and generalization, FL’s potential for improving SDG has garnered increasing attention. By enabling institutions to train models locally on their private datasets and aggregate the learned parameters, FL facilitates decentralized SDG. This approach not only protects data privacy but also leverages the diversity of data across institutions to augment model performance and generalization. FL can be particularly advantageous for institutions with limited data, as they can benefit from the collective intelligence of a more extensive network of institutions. Traditional FL techniques, such as FedSGD and FedAvg [
14], are particularly effective when institutions have similar data distributions (i.e., independent and identically distributed (IID) data). However, data heterogeneity is prevalent in real-world medical contexts due to population and institutional-specific practices. These disparities can lead to biased models if not adequately addressed [
15,
16]. Consequently, developing techniques that can effectively handle non-IID data is imperative for achieving the full potential of FL in medical research.
Researchers have explored various data-level techniques to mitigate the challenges posed by data heterogeneity in FL [
17,
18]. These techniques include private data processing (e.g., data collection, filtering, cleaning, and augmentation) and leveraging external data through knowledge distillation or unsupervised representation learning. For instance, Federated Distillation (FD) methods, such as Federated Augmentation (FAug) [
19], provide innovative solutions that enhance the quality of SDG. FD enables more flexible knowledge transfer between clients and the server, surpassing the limitations of only sharing model parameters. FAug, in particular, tackles data heterogeneity by generating synthetic data to augment local datasets, allowing them to resemble IID distributions. Another notable method, Astraea [
20], collects local data distributions and performs data augmentation based on global distributions to alleviate imbalance. By rearranging the training of clients based on KL divergence, Astraea ensures that local models are trained on more representative data. However, it is essential to emphasize that all these methods, along with other emerging techniques such as [
21,
22], are primarily designed for supervised learning contexts. These approaches focus on improving model training and parameter optimization by leveraging labeled data, offering significant insights into addressing data heterogeneity in supervised tasks using FL. However, they do not directly tackle the challenges associated with SDG. Therefore, FL remains a promising framework for SDG in heterogeneous environments. However, tailored strategies are required to address the distinct challenges of training a deep generative model that takes advantage of the data in different data centers.
Building upon the existing literature, this paper proposes to address data heterogeneity in FL specifically for SDG. While this method could be compared to data augmentation, we move beyond its capabilities by focusing on generating entirely new synthetic patient data rather than transforming existing samples. Traditional data augmentation typically takes an existing dataset and applies modifications—such as rotation, scaling, or cropping in images—to create additional, varied instances within the same data collection. This transformation approach increases the dataset’s diversity but does not introduce fundamentally new information, as it only reuses the original samples. In contrast, SDG, as applied in this study, involves creating entirely new, artificial patient records that replicate the underlying patterns of real data without directly mirroring specific records. Instead of relying solely on local data augmentation or knowledge distillation, we explore the potential of sharing locally generated synthetic patients among participating institutions. By leveraging the collective knowledge and diverse data distributions across the federation, we hypothesize that Synthetic Data Sharing (SDS) can enhance the quality and representativeness of generated data for all institutions, particularly those with limited or biased datasets. SDS offers several advantages: (1) it can help institutions with insufficient data benefit from the more diverse and representative synthetic data generated by others; (2) it can improve the ability of models to generalize to unseen real-world scenarios by exposing them to a broader range of synthetic patient data; and (3) it can reduce the computational burden by reusing models with minimal retraining once synthetic patients have been generated. This method parallels Domain Randomized Search (DRS), a meta-learning approach where models are trained across tasks to generalize across diverse domains [
23]. Similar to DRS, where data from multiple tasks are aggregated to improve generalization, our method uses synthetic patient data from different nodes to address the issue of data heterogeneity in FL. By sharing synthetic patients, institutions with low-quality data can benefit from the more diverse data generated at other nodes, potentially improving the overall model performance. This approach could be seen as meta-learning within the FL framework, where aggregated synthetic data helps balance the disparities between nodes.
Our research contributes to the field of FL by proposing and evaluating an SDG model within a heterogeneous data environment.
This paper is structured as follows.
Section 2 outlines our methodology for the proposed SDG model within the FL framework.
Section 3 presents the experimental setup, details the datasets used, and analyzes our results. Finally,
Section 4 summarizes our findings, discusses the implications of our research, and proposes future research directions.
2. Materials and Methods
2.1. VAE-BGM Model
A novel approach to synthetic tabular data generation is introduced in [
11], which integrates a BGM within the framework of a VAE. This approach addresses the limitations observed in existing models like CTGAN and TVAE [
10]. While these earlier models demonstrate strong performance for certain data types, the VAE-BGM model offers superior results, particularly in capturing the complexity of real-world tabular data.
The model’s core innovation uses a Gaussian mixture model (GMM) to model the VAE’s latent space. More specifically, the model leverages a BGM, a type of GMM that offers greater flexibility. Unlike traditional GMMs, which require a pre-specified number of components, the BGM allows the model to automatically determine the appropriate number of components. This flexibility is essential for accurately capturing the complexity of real-world data, as it allows the model to adapt dynamically to the underlying data distribution. By integrating the BGM, the model avoids the restrictive assumption of a purely Gaussian latent space, which is common in models like TVAE. Instead, the BGM enables the model to handle more complex, non-Gaussian latent structures. This is achieved through a Dirichlet process that adjusts the number of Gaussian components in the mixture, allowing the model to adapt to the specific data characteristics without requiring manual specification. As a result, the VAE-BGM model provides a more nuanced and accurate latent representation, making it particularly effective for handling complex tabular datasets where simple Gaussian assumptions are insufficient. In addition to improving the latent space representation, the model excels in handling mixed data types, including continuous and discrete features. By permitting various differentiable distributions for individual features, the model ensures that the specific characteristics of different data types are preserved during the data generation process. This makes the VAE-BGM particularly suitable for applications in healthcare, where datasets often contain diverse information, ranging from binary indicators to continuous measurements.
Another key advantage of this approach is its ability to generate synthetic data that better reflects the marginal and joint distributions of the original data. Traditional VAEs are constrained by the Kullback–Leibler divergence () term in the loss function, which enforces a Gaussian prior in the latent space and limits the model’s ability to capture more complex data structures. Integrating a GMM into the already learned latent space overcomes this limitation, allowing for a more accurate sampling process that reflects the true diversity of the data. This enhancement leads to the generation of synthetic data that resembles the real data more closely and retains crucial feature correlations, improving its utility for downstream ML tasks.
The architecture of the proposed model follows the typical VAE design, consisting of an encoder and a decoder. The encoder learns a latent representation
z of the input data
x. This latent representation is assumed to be Gaussian. The encoder aims to learn a variational distribution
that is as close as possible to the true posterior distribution
. This is achieved by maximizing the Evidence Lower Bound (ELBO), which is a lower bound on the marginal log-likelihood of the data represented as defined in [
11]:
where
represents the KL divergence. The derivation of the ELBO is critical for understanding the VAE framework. A detailed step-by-step derivation is provided in
Appendix A, where it is demonstrated that
in Equation (
1) coincides with the ELBO expression in Equation (
A5).
On the other hand, the decoder learns the conditional distribution
to generate realistic data points from the latent space. To improve the flexibility of the latent space representation, the BGM is applied to the learned latent space
z, refining it into
. The BGM models the latent space as originating from a mixture of
K Gaussian distributions, each characterized by a mean vector
, covariance matrix
, and mixing coefficient
. This allows for a more complex and multi-modal representation of the latent space, enabling the model to capture intricate data distributions. The probability density of a point in the latent space is defined as follows:
where
is the probability density function of a multivariate Gaussian for each of the
K components. The expectation–maximization algorithm estimates these parameters, enabling the model to capture richer latent structures than a single Gaussian. The BGM allows for a more flexible representation of the latent space, enabling the model to capture complex data distributions accurately.
Figure 1 illustrates the schematic process of the VAE-BGM model.
Given its ability to handle complex data distributions and mixed data types and generate high-quality synthetic data, the VAE-BGM model presents a compelling synthetic tabular data generation approach. For these reasons, we have chosen to adopt this model as the generative model for our research.
2.2. FL Integration in SDG
Traditionally, ML models are trained in a centralized way, where all data are gathered in a single location. However, such centralization raises significant security concerns, especially in domains such as healthcare, where personal data are involved. FL offers a decentralized approach to ML, enabling the training of a shared model across multiple institutions (nodes) without the need to centralize their data. This paradigm is rooted in three core principles:
Distributed Data: Training data are partitioned across various clients, preserving data locality and privacy.
Privacy Preservation: FL mitigates privacy concerns by training models locally on each node and sharing only model updates rather than raw data.
Model Aggregation: Model updates from all nodes are aggregated to create a global model that captures the knowledge from distributed data sources.
This study simulates an FL environment comprising multiple nodes to generate synthetic data. In the context of SDG, using FL leverages decentralized data to create synthetic datasets that maintain the statistical properties of real-world data while protecting the privacy of the individuals involved. SDG has been widely explored in isolated settings, but challenges remain when considering data scarcity or heterogeneity across different institutions or geographical regions. This is where FL enters as a potential solution: by leveraging decentralized datasets in a privacy-preserving way, FL allows institutions to collaborate and generate synthetic data without transferring real patient records. In an FL context, SDG can be conducted across multiple distributed nodes with local, sensitive datasets. Rather than centralizing data, FL allows each node to train a generative model (VAE-BGM in this study) locally. The model parameters (not the data itself) are then shared with a central server, where they are aggregated to form a global generative model. This global model can then generate synthetic data that encapsulate the diverse statistical properties of data from all participating nodes.
2.3. Information Aggregation Techniques
In the proposed FL framework, we explore two techniques to train the VAE-BGM models across scarce and heterogeneous data environments: FedAvg and SDS. This section will explain both algorithms, detailing their mechanisms and how they address the challenges of non-IID data distribution across nodes. Comparing these two methods aims to clarify their respective advantages in improving SDG under the constraints of FL.
2.3.1. Federated Averaging
FedAvg, introduced by [
14], is a foundational method in FL and serves as a baseline in this study to evaluate the performance of advanced approaches such as FedSDS.
The process begins with initializing a shared model architecture, such as the VAE-BGM, that is consistent across all nodes to ensure compatibility during aggregation. Each node trains this model locally on its dataset, producing updated parameters
. These local updates are then sent to a central server, where they are aggregated using a weighted averaging approach. The contribution of each node is proportional to its data size
, and the global model is updated as
, where
represents the total number of samples across all nodes. The updated global model is then distributed back to the nodes for further refinement in iterative rounds until a stopping criterion, such as convergence or a predefined number of iterations, is met. This iterative process allows FedAvg to combine distributed knowledge effectively while preserving data privacy.
Figure 2 illustrates the overall FedAvg process, highlighting the interaction between local model training and global aggregation.
2.3.2. Synthetic Data Sharing
FedAvg has proven effective in many FL settings but can face challenges when applied to non-IID data [
16,
25]. Non-IID data are associated with scenarios where the data stored across different nodes are highly heterogeneous, leading to biased models or poor convergence. Differences in data distributions across nodes can result in variations in the local generative models. This can complicate the aggregation process, making it challenging to generate synthetic data that accurately represent the combined dataset.
Our proposal, SDS, is a technique that can address this issue: sharing synthetic patients generated locally at each node. We leverage the SDG model we intend to train at each node to generate data and enhance model performance when data are non-IID. This approach draws upon the meta-learning paradigm, specifically DRS, as introduced by [
23], which approximates Model-Agnostic Meta-Learning (MAML). Meta-learning, often described as “learning to learn” [
26], focuses on training algorithms that can generalize efficiently across tasks. By learning from related tasks, meta-learning models can rapidly adapt to new tasks with minimal data, addressing scenarios where large datasets are unavailable. MAML [
27], one of the most significant meta-learning methods, focuses on finding a set of initial parameters,
, that can be fine-tuned with minimal data for new tasks. MAML optimizes task-specific and meta-parameters through bi-level optimization, allowing rapid adaptation to new, unseen tasks. However, this bi-level optimization presents significant computational complexity, making it less suitable in environments constrained by resources or data. In contrast, DRS approximates MAML’s generalization goal while reducing computational demands. Instead of performing a bi-level optimization, DRS aggregates data across tasks and trains the model directly on this aggregated dataset, resulting in a more resource-efficient solution. DRS achieves the same goal as MAML generalization across tasks by simplifying the learning process through direct training on aggregated data. Let a task instance
represent a tuple containing a dataset
and its corresponding loss function
. Solving this task entails finding the optimal task-specific parameters
that minimize the loss
for the particular dataset
. Thus, the parameters
for each SDG model are optimized according to the following equation:
where
denotes the real, local dataset of node
i, and
signifies the synthetic datasets shared by the other nodes
, with
denoting all nodes except node
i. This method optimizes the model by minimizing the loss across the aggregated data, including the real, local data from node
i and the synthetic data from the other nodes. This approach mitigates the negative impact of non-IID data by leveraging synthetic data from multiple nodes to create a more representative and diverse training set. The ability of SDS to aggregate synthetic patient data from different nodes aligns with the meta-learning principles, allowing the model to generalize effectively across varying data distributions and enhancing model convergence in FL environments. A point worth emphasizing is the suitability of DRS over MAML in scenarios with a limited number of tasks (nodes). FL environments typically involve fewer data providers than more generalized ML setups. This small number of nodes can lead to situations where DRS, by aggregating synthetic data from these nodes, outperforms MAML. The reasoning behind this is grounded in the computational efficiency of DRS: unlike MAML, which requires bi-level optimization over multiple tasks, DRS aggregates data across tasks in a single round of optimization, making it less computationally intensive [
23].
Thus, SDS provides the FL model with a richer and more diverse dataset, improving the quality and representativeness of the generated synthetic data, particularly in nodes where the data are scarce or biased. Unlike traditional parameter aggregation methods, SDS directly introduces additional information from other nodes, potentially improving convergence and mitigating the negative effects of data heterogeneity.
Local SDG: Each node initializes its local VAE-BGM model and trains it until synthetic data are generated based on the learned latent representation of the model. This aligns with the DRS strategy of generating data across domains (nodes) to capture domain-specific features.
Synthetic Data Sharing and Aggregation: Similar to DRS, synthetic data from each node are shared with other nodes. This aggregated data forms a more diverse and representative training dataset, mitigating the effects of data heterogeneity.
Model Training with Augmented Data: Each node trains its local VAE-BGM model using the augmented dataset, including real and synthetic data. This process, akin to DRS’s task-based aggregation, leverages the diversity in the shared synthetic data to improve model performance. The training continues until the model converges.
This approach improves convergence and mitigates the negative impact of non-IID data by leveraging the diversity of synthetic data from multiple nodes. Sharing synthetic data instead of generative models further optimizes the system, making SDS a highly efficient option for complex FL scenarios.
Figure 3 depicts the explained process.
FedAvg and SDS can operate without a central server in this revised approach, allowing direct communication between nodes. However, a notable distinction lies in the number of communication rounds necessary for model convergence. FedAvg typically requires multiple parameter updates and aggregation rounds to achieve optimal accuracy. This iterative process involves continuous communication between nodes, which can be a limitation in bandwidth-constrained environments. In contrast, SDS could theoretically be executed in a single communication round. Although this study does not apply a single-round strategy, the potential for such an approach exists. By sharing synthetic patient data only once, significant improvements in model performance could be obtained, particularly in scenarios where communication resources are limited. This single-round communication would drastically reduce overhead compared to FedAvg, which depends on numerous rounds of sharing and aggregating model parameters. Additionally, in SDS, sharing not just the synthetic data but also the generative model itself, including the decoder and the BGM-derived parameters, enhances communication efficiency compared to FedAvg. Thus, SDS offers a more scalable and bandwidth-efficient solution for real-world FL applications, especially when communication restrictions are a significant concern.
4. Conclusions
This research underscores the effectiveness of FL for SDG in healthcare, particularly in addressing the challenges posed by heterogeneous and scarce data distributions. By employing VAE-BGM models across diverse medical datasets, this study demonstrates that SDS consistently outperforms traditional approaches like FedAvg and isolated training in both IID and non-IID scenarios. A key strength of SDS lies in its ability to expose nodes to diverse synthetic samples, effectively approximating a more IID-like environment even in non-IID settings. This results in significant advantages for generating high-quality synthetic data, as reflected in lower values, and supports robust model performance across nodes. Clinical utility validation confirms the practicality of synthetic data generated using SDS, achieving comparable accuracy to real data in downstream tasks. In non-IID environments, SDS proves particularly robust, addressing the challenges of unevenly distributed data among institutions by leveraging the diversity of synthetic samples to enhance representativeness and mitigate the negative effects of heterogeneity. In contrast, FedAvg demonstrates limited improvements in these scenarios, often failing to match SDS’s effectiveness, particularly in nodes with constrained data availability or skewed distributions.
These findings highlight the potential of sharing synthetic data within FL frameworks. By fostering data diversity and reducing the disparities between data-rich and data-poor nodes, SDS enables improved model generalization and supports collaborative research without exposing sensitive patient information. This approach not only bridges gaps in data accessibility and quality but also sets a foundation for advancing medical research and innovation in under-resourced regions. Future work should continue exploring the role of synthetic data in FL, focusing on increasingly heterogeneous and imbalanced data distributions to further validate and refine the methodology.
Future research should explore optimizing FL architectures for even more complex data types and more extensive networks of institutions, further refining the integration of SDG with FL to maximize efficiency and scalability. In particular, exploring SDG in low-sample settings as in [
12] could be highly beneficial. This approach, which integrates meta-learning (like DRS) and transfer learning techniques to SDG, could be adapted to augment FL environments, where leveraging knowledge from previously trained models or similar tasks could significantly enhance the quality of synthetic data in low-sample nodes. Furthermore, while the VAE-BGM model has already been compared with state-of-the-art tabular generative models (CTGAN and TVAE), future research should investigate additional architectures tailored for tabular data, further validating its performance. Promising approaches such as TabDDPM [
34], a diffusion-based generative model designed for tabular data, could be evaluated to enhance the quality of synthetic data generation and improve the overall effectiveness of SDS. Expanding our evaluation to include datasets from other domains, such as financial markets [
35] or sustainable energy [
36], would demonstrate the broader applicability of SDS. Additionally, focusing on increasingly heterogeneous and imbalanced data distributions can be used to further validate and refine the methodology. While this study varied the distribution of a single feature (BMI) across nodes, a more realistic setup could involve modifying multiple features to emulate extreme heterogeneity better. However, such modifications may limit direct comparisons with techniques like FedAvg, which rely on consistent feature sets across nodes. Exploring these scenarios independently of FedAvg could provide deeper insights into SDS’s performance under extreme variability. Exploring other data types, such as imaging or sequential data, could provide new opportunities to extend the methodology to domains requiring diverse data modalities. Incorporating these data types would further validate the flexibility and robustness of SDS in addressing challenges across a wide range of applications. Lastly, addressing privacy risks must be a future line of research on this topic. Investigating techniques to mitigate privacy risks associated with FL, such as differential privacy [
37] or homomorphic encryption [
38], can help protect sensitive patient data while enabling collaborative training. By pursuing these research directions, we can continue advancing the FL field for SDG in healthcare and develop more robust and effective methods for generating high-quality synthetic data.