1. Introduction
Deep Learning (DL) has proven to be fruitful for use in biomedical prediction tasks, but the risk of overfitting remains due to the limited size of such datasets. To mitigate overfitting in such low-data regimes, researchers often synthesize data using generative DL algorithms, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) [
1,
2]. However, the majority of such applications are geared towards classification tasks. For example, Chen et al. [
3] reviews image synthesis for medical prediction tasks and notes that most studies train a GAN for each class. Such methods are not directly applicable to ML regression tasks, but nonetheless indicate the utility of GANs for data augmentation.
Likewise, researchers have demonstrated success in biomedical tasks with VAEs. For example, Doncevic and Herrmann [
4] developed a VAE architecture with an interpretable latent space and decoder for medical application. It enabled the perturbation of input features to understand changes in the activation of hidden nodes. By doing so, it simulates the effects of genetic changes on a resulting phenotype as well as the drug response predictions of models [
4]. Likewise, Papadopoulos and Karalis [
5] employed a VAE framework to synthesize clinical study patient samples. Their results showed that including the synthetic data provides greater statistical power than using the original dataset alone [
5].
Biomedical datasets often contain multiple modalities, such as genomics, imaging, clinician notes, peripheral blood tests, audio recordings, and more. A major limitation of current multi-modal models is that they often cannot make full use of missing data, namely, missing modalities [
6]. Historically, predictive medicine models would simply discard records that did not have all of the desired modalities. Disregarding missing modalities or records drastically reduces the available sample size, which decreases the quality of the predictions and the generalizability of the model.
Novel DL methods for modality translation on medical datasets exist in the literature, but such methods are highly task-specific towards imaging modalities. As discussed by Armanious et al. [
7], most methods focus on translating between computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET). In addition to being limited to the types of imaging, they also employ specialized architectures to account for the motion or jitter often seen in medical imaging. CycleGAN is another popular method for image-to-image translation. CycleGAN involves the training of two generators and two discriminators with a cycle consistency loss [
8]. Sandfort et al. [
9] successfully applied CycleGAN to transform contrast CT images into non-contrast CT images and used the augmented dataset to improve the segmentation performance. However, CycleGAN is primarily beneficial for color or texture type transformations. It was originally developed to translate between two views of the same modality. Therefore, existing imaging-centric methods, including MedGan [
7] and CycleGAN [
8], are not directly applicable to the important case of tabular data modalities that carry distinct information.
Recently, Yang et al. [
10] proposed a model that substantially surpasses CycleGAN to align single cell RNA-seq and ATAC-seq data. They trained Autoencoders (AE) for the modalities of interest, and they aligned the AE latent spaces through adversarial training with an additional discriminator network. Then, they paired the modality A encoder and modality B decoder to align single-cell data. Zhou et al. [
11] trained an encoder of modality A along with the decoder of modality B for imputation in the cancer survival prediction task. However, these works did not consider stable training methods to regularize the Autoencoder latent spaces or the integrated oversampling of synthetic data.
The ability to translate GANs and VAEs into different modalities and data types is essential to improve DL for predictive medicine. Here, we present data augmentation for Cross-Modal Variational Autoencoders (DACMVA), which builds upon the aforementioned related works by incorporating VAEs to translate between different data modalities, including the critically important, but often neglected, tabular data types common in medicine. Specifically, DACMVA takes advantage of modality A to impute samples of modality B and vice versa. Such cross-modal imputations are particularly advantageous in the case of a large imbalance in the sample counts between the two modalities. In addition, DACMVA can carry over the outcome value associated with the modality A sample to the imputed modality B sample to circumvent the issue of imputing a continuous label. DACMVA demonstrates that regularized latent spaces in VAEs result in an improved imputation quality over deterministic AEs. Additionally, prior related works have not integrated oversampling. The inclusion of oversampling is another benefit of DACMVA.
In summary, this work presents DACMVA, a novel DL pipeline for data augmentation with Cross-Modal Variational Autoencoders. DACMVA demonstrates a superior performance in the task of cancer survival prediction using tabular gene expression data. The key contributions of this study are the following:
DACMVA proposes a pipeline for training Variational Autoencoders (VAEs) for modalities A and B with aligned latent spaces. It incorporate strategies to improve the stability during VAE training, thereby enabling regularized latent spaces.
DACMVA oversamples the imputed samples with hyperparameters to control the imputed batch size; a loss threshold for selecting imputed samples; and a weight for the loss on the imputed batches. This flexible and tunable framework integrates with the cross-modal imputation method for simple, but effective, oversampling augmentation.
The role of the adversarial training strategy proposed by Yang et al. [
10] for aligning the latent spaces is empirically investigated. In particular, we determine whether the augmentations created by Autoencoders trained adversarially result in a significantly improved performance in the prediction task. Adversarial training comes with an additional computational cost and training stability challenges which may be infeasible for large, high-dimensional datasets. Thus, this analysis is informative for many applications.
The novel DACMVA framework was applied over multiple augmentation methods for cancer survival prediction. To our knowledge, this study is the first to investigate the roles of oversampling and adversarial loss in data augmentation in cancer survival prediction. The results illustrate the ability of DACMVA to improve model predictions on multi-modal tabular biomedical data with continuous labels. The results show that the presented DACMVA framework generates high-quality imputations and provides a significantly improved task performance with both the full dataset and in low-data regimes.
The presentation of
DACMVA and its application to cancer survival predictions is organized as follows:
Section 2 introduces an overview of the
DACMVA framework and provides the details of the methodology and the training procedure.
Section 3 presents the results for the imputation quality and the performance in the multi-modal cancer survival prediction task using the original dataset and the low-data regime setting. Finally, the conclusions and future directions for
DACMVA are summarized in
Section 4.