1. Introduction
Omics is the study of comprehensive datasets from a wide spectrum of biological materials, including genomes, transcriptomes, and proteomes. As such, omics is crucial in clinical diagnosis, drug development, and precision medicine, as it deals with the discovery of disease-specific genetic information and structural variation. With advances in high-throughput technologies, such as the BeadChip array and next-generation sequencing technologies, gene expression, isoform expression, and DNA methylation data have been mass-produced [
1,
2]. Although these omics data contain various pieces of information, certain sets exhibit association [
3,
4]. For instance, DNA methylation is associated with gene expression, and gene expression is associated with protein levels [
5].
While initial omics analyses utilized a single data type [
6,
7,
8,
9,
10], the advancement of computer technology and subsequent rapid progress in information-processing capacity has enabled the integrated analysis of various omics data (multi-omics), from genomes and transcriptomes to proteomes [
11,
12,
13]. Indeed, multi-omics is an emerging strategy to develop various models for target discovery, clinical diagnosis, disease subtyping, etc. [
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25].
Advanced technologies, such as machine learning (ML), have shown promise for application in multi-omics research [
26,
27]. ML enables the consideration of interactions across diverse datasets, allowing for the utilization of a broader range of information and enabling higher levels of accuracy. In this context, applying ML to multi-omics research offers opportunities to gain profound insights into diseases by analyzing interactions among omics from various perspectives. Active research is focused on multi-omics analyses with ML in brain disease, diabetes, cancer, and cardiovascular disease [
28,
29,
30,
31]. Nevertheless, the numerous dimensions against the number of samples in multi-omics analyses negatively impact ML [
1]. Up to 20,000 protein-coding genes and 60,000 long non-coding RNAs have been identified based on Gencode 7 [
32]. In addition, 96% of 480 K CpG sites can be analyzed based on the most widely used data platform for DNA methylation [
33]. Using such high-dimensional data in ML could lead to overfitting (i.e., the curse of dimensionality) and reduce model performance [
16].
To resolve this issue, dimensionality reduction based on domain knowledge, dimensionality reduction algorithms, and deep learning generative models are typically used in ML [
34,
35,
36,
37]. In particular, generative models learn how to approximate the distribution of data. The variational auto-encoder (VAE) [
38] is a generative model that extracts features of input data, stores them in latent vectors, and generates new data similar to the input data through latent vectors. Given that latent vectors can describe the input data well, dimensionality reduction based on VAE has been applied to various omics analyses [
37,
39].
Despite showing promise, VAE-based models have limitations. First, in multi-omics analyses, missing data from a given sample can prevent its use in learning, and a reduced sample size can lead to poor performance of learning algorithms. However, obtaining sufficient sample sizes is challenging in real-world scenarios due to difficulties in patient enrollment. Second, when omics data collected from different institutions vary in type, the performance of models is limited due to poor data integration, as all samples must possess the required feature values demanded by the ML model. Consequently, related studies have utilized large-scale open data, such as The Cancer Genome Atlas (TCGA)’s Pan-Cancer Atlas (PanCan) [
40]. However, when only a portion of omics data types are available, such as in the prospectively collected data from the Korean Sarcoma Biobank Innovation Consortium (SBIC), existing integrated analytical methods cannot be applied. Therefore, a method for efficiently integrating various omics data sources while preserving the independent features of each aspect is necessary [
41].
Integrating incomplete and complete data for analysis can lead to increased availability of samples and thus expectations of improved performance. Cao and Gao [
42] proposed GLUE (Graph-Linked Unified Embedding), a modular framework for integrating unpaired single-cell multi-omics data and inferring regulatory interactions simultaneously. By leveraging biological knowledge to explicitly model inter-layer regulatory interactions connecting hierarchical functional spaces with a knowledge-based graph (“guidance graph”), this model integrates and analyzes omics data from various sources, effectively capturing diverse cell states. Du et al. [
43] used missing masks to learn conditional distributions of unseen patterns and features. By explicitly learning the conditional distributions of specific masked features (or forms) when unmasked features (or modalities) were provided, they conducted an integrated analysis of omics data and achieved high accuracy in cross-domain translation tasks.
This study proposes a novel artificial intelligence (AI) model and learning strategies that can effectively utilize incomplete data frequently found in omics datasets for pan-cancer. In this paper, “complete data” refers to that in which all omics data to be utilized exist. For example, when using mRNA gene expression, mRNA isoform expression, and DNA methylation in multi-omics, the case where all three omics data types exist can be called “complete data”. “Incomplete data” refers to data in which some of the omics data to be utilized are missing. In the same multi-omics example as that used for “complete data”, a case in which DNA methylation data do not exist can be referred to as “incomplete data” (
Figure 1).
Thus, within the current study, a single AI model capable of handling complete and incomplete data was designed. To allow for the input of samples with missing omics data in the AI model, we selected the padding strategy to replace missing omics data with 0 (
Figure 2). For the omics data (mRNA gene expression, mRNA isoform expression, and DNA methylation), samples with all column (gene symbol) values of 0 did not appear in the actual data distribution (
Figures S1–S3). It was thus hypothesized that the AI model would recognize such values in the absence of actual data. Additionally, an AI model was designed to infer missing omics data by learning from the partial input omics data. If the omics data generated for the missing data are similar to the actual values, they can replace single data points that do not exist in multi-omics. In addition, if the generated omics data are used for embedding, the resulting distribution will resemble that of the embedding based on actual data. Finally, the utility of the novel approach against previous methods that disallow the use of incomplete data was validated. To this end, a phenotype prediction function was added to the model, and its performance was evaluated.
The proposed model consists of two key components: a VAE-based multi-omics generative model that can learn the genetic patterns of tumors based on different omics data and an expanded classification model that can predict cancer phenotypes. Comprehensive analyses validated the versatility of the model. In particular, data for a few samples or omics data with missing parts, such as in the Korean SBIC data, can be applied for learning. Furthermore, the newly developed method enables the generation of virtual genomic data that resemble actual genomic data based on inferences of missing omics data.
2. Materials and Methods
2.1. Data Source
2.1.1. TCGA Pan-Cancer (PanCan)
The open TCGA PanCan database integrates mutation data for over 10,000 samples of 33 cancer types. The TCGA PanCan data are widely used in precision medicine research, including in training AI models to discover cancer diagnostic markers.
This study used biospecimens and data from the SBIC for Biomedical and Healthcare Research. All PanCan data were downloaded from the University of California Santa Cruz (UCSC) Xena platform [
44]. Data were integrated after each of the final data preprocessing sets, and the max–min normalization was applied to obtain values between 0 and 1.
2.1.2. Korea Biobank Network-SBIC
The SBIC, as part of the Korea Biobank Project, aims to construct a national registry to support research on rare cancers and promote translational research by integrating and accumulating data for various human-related materials and subsequently generating mutation data for sarcoma (SARC). The SBIC data were used in this study in collaboration with the consortium project to construct the Korean Sarcoma Biobank.
For use as the input in the proposed model, the SBIC data, as with the PanCan data, were integrated after each of the final data preprocessing steps and the max–min normalization to obtain values between 0 and 1.
2.2. Data Types
2.2.1. mRNA Gene Expression Data
Gene expression is the process by which information from a gene is used to synthesize a functional gene product that produces end-products, namely protein or non-coding RNA, ultimately affecting cell phenotype. Many studies employ gene expression data to assess tumor states and for molecular profiling [
37,
45]. The PanCan mRNA gene expression data contains 60,498 gene symbols related to 10,535 samples; the SBIC mRNA gene expression data contains 25,268 gene symbols pertaining to 45 samples. In this study, we only included 10,496 samples with phenotype data in the case of PanCan, as the proposed model performs a supervised learning task for the classification of cancer phenotypes. In addition, to amalgamate the PanCan and SBIC data, new integrated data were generated utilizing 23,191 gene symbols shared between the two datasets. The fragments per kilobase of transcript per million (FPKM) values for the respective data were log-transformed using RNA-seq by expectation maximization (RSEM; log
2(FPKM + 0.001)).
2.2.2. mRNA Isoform Expression Data
Gene isoforms are mRNAs produced from the same locus with distinct transcription start sites, protein-coding DNA sequences, and/or untranslated regions that can alter gene function [
46]. Recent studies have shown that several gene isoforms directly function in tumor transformation and growth [
47]. The PanCan mRNA isoform expression data contain 197,045 gene symbols for 10,535 samples, and the SBIC mRNA isoform expression data contain 25,268 gene symbols for 45 samples. From the PanCan data, only 10,495 samples with phenotype data were included, and an additional 23,191 gene symbols shared with the SBIC dataset were included. The FPKM values for the respective data were log-transformed using RSEM (log
2(FPKM + 0.001)).
2.2.3. DNA Methylation Data
DNA methylation is an epigenetic mechanism in which a methyl group is transferred to the C5 position of cytosine to form 5-methylcytosine. DNA methylation regulates gene expression by recruiting proteins involved in gene inhibition or by inhibiting the binding of transcription factors to DNA. Given that DNA methylation in a specific region can be directly related to oncogene expression, it is currently widely studied as a major indicator and target of cancer [
48]. The PanCan DNA methylation data contain 396,065 methylation IDs for 9639 samples; the SBIC data do not include DNA methylation data. From the PanCan data, samples other than the 9602 with phenotype data were excluded. In addition, the methylation IDs other than the 269,273 with a NaN value were excluded. The respective data were downloaded from UCSC Xena. The DNA methylation data were reformatted using the metadata from the previous methylation ID to a gene-symbol-based structure. The beta value was the mean value across the methylation IDs of each gene symbol.
2.2.4. Phenotype Data
The phenotype data comprised the combination (11,369 cases) of the samples in the PanCan and SBIC datasets, including various parameters from retrospective data, e.g., cancer type, sample type, primary site, tumor event, sex, race, and stage. The phenotypes used in the model learning and result analysis were cancer type, primary site, and sample type.
The cancer types investigated were adrenocortical carcinoma, bladder urothelial carcinoma, breast invasive carcinoma (BRCA), cervical squamous cell carcinoma, endocervical adenocarcinoma, cholangiocarcinoma, colon adenocarcinoma, lymphoid neoplasm diffuse large B-cell lymphoma, esophageal carcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, kidney chromophobe, kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma, acute myeloid leukemia, brain lower grade glioma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, pheochromocytoma and paraganglioma, prostate adenocarcinoma, rectum adenocarcinoma, SARC, skin cutaneous melanoma, stomach adenocarcinoma (STAD), testicular germ cell tumors, thyroid carcinoma, thymoma, uterine corpus endometrial carcinoma (UCEC), uterine carcinosarcoma, and uveal melanoma.
The primary sites investigated were adrenal gland, bile duct and gallbladder, bladder, bone marrow, brain, breast, cardia, cerebrum, cervix uteri, cortex of adrenal gland, esophagus, eye, head and neck, intestine, kidney, liver, lung, lymph nodes, oral, ovary, pancreas, peritoneum, pleura, prostate gland, skin, stomach, subcutaneous and soft tissues, testis, thymus, thyroid gland, tongue, and uteri.
The sample types investigated were additional–new primary, metastatic, primary blood-derived cancer–peripheral blood, primary tumor, recurrent tumor, and solid tissue normal types.
2.3. Padding
Three types of omics data were applied: mRNA gene expression, mRNA isoform expression, and DNA methylation data. Some samples exhibited only one or two types of omics data (
Table 1).
The padding strategy was applied for data processing and inferences on each sample. For each omics data sample, the padding set was expressed as the difference between the union of all omics data samples and the set of each omics data sample. The padding value was unified as 0. The size of the sample union for all genomes was 11,369, and padding data were generated for the expression of 1034 genes and 1035 isoforms and beta values for 1973 methylation sites.
2.4. Model Structure
The proposed model was structured based on semi-supervised learning with the addition of a predictive model to the generative model. The multi-omics generative model identifies tumor genetic patterns based on several omics data types for embedding in a low-dimensional vector. The expanded classification model makes inferences on the phenotype of the given cancer type. In addition, for samples with one or more missing omics data types, padding is added for the flexible generation of embedding, even in the absence of specific omics data in reference to others.
2.5. Implementation Details
The proposed model was designed using PyTorch Lightning (version 1.6.3) [
49], which provides an advanced interface for PyTorch (version 1.11.0) [
50]. Overall, the model comprises linear neural networks with the rectified linear unit activation function and layer normalization. The hyperparameters were set to their optimal values within the search range (
Table 2).
The multi-omics generative model is structured based on a modified VAE with three layers: the Expanded Encoder, the Concat Layer, and the Expanded Decoder. The Expanded Encoder layer comprises M encoders. The size of the input data for each encoder is equivalent to the number of each omics data point, while an encoder is composed of layers that create vectors with sizes of 1000, 500, and 100. Here, M is the number of omics data types used in learning. Two roles are assigned to the Concat Layer. First, it concatenates the vectors generated by the Expanded Encoder layer into a single vector with the output of the final encoding vector. Second, it realizes the reparameterization of the VAE with the output of the latent vector. The size of the latent vector, as the final output of the respective layer, is 100. The Expanded Decoder layer comprises a layer that takes the latent vector input to generate vectors with sizes of 500 or 1000 and a layer that generates vectors with sizes that correspond to the features of each omics data type. The final output of each decoder layer serves to reconstruct the original omics data.
The expanded classification model comprises several models, each with a single-layer structure. The embedding data are used to predict the phenotype data, and the learning outcome is transferred to the generative model.
2.6. Learning Strategies
Three criteria should be met for the proposed model to attempt learning. First, the model should perform even in the absence of certain omics data in the learning sample. Second, the model should be able to make inferences even in cases of missing omics data. Third, the concurrent use of incomplete data should contribute to enhanced performance. For this, the parameter optimization steps in the multi-omics generative model and expanded classification model were defined in the learning strategies.
2.6.1. Multi-Omics Generative Model
The multi-omics generative model has a VAE-based design. The VAE is a deep neural network that can detect manifolds in high-dimension raw data to generate useful features for other operations, such as classification and regression. In the VAE structure, the input data x pass through the encoder to produce the latent vector z with the feature data, and the vector z passes through the decoder, generating similar output data to the input data x. Loss in the multi-omics generative model, as with general loss in VAE structures, is based on the union of the reconstruction loss indicating the difference in the input data x from the data g restructured by the decoder and the regularization loss that controls the latent vector z after sampling by reparameterization to follow the normal distribution.
The learning strategies vary between the multi-omics generative model and general VAEs. The former applies the padding technique to utilize as much information as possible, even in the absence of certain omics data. With padding, missing data are assigned values of 0, and the respective reconstruction may limit model learning. Hence, loss attributed to padding was excluded in the calculation of the reconstruction loss by using a filter that differentiated between actual data and padding data. In addition, the final loss from the expanded classification model with the inferences on cancer phenotypes was considered so that the reconstruction of the original data could incorporate the learning outcome for phenotype data.
2.6.2. Expansion of the Classification Model
The expanded classification model comprises three models that predict three parameters (cancer type, primary site, and sample type) with a linear single-layer structure in each model. Each classification model receives the latent vector from the multi-omics generative model as the input to predict the phenotype. Instead of the categorical cross-entropy loss commonly used for general classification models, we calculated the focal loss [
51], which can improve the class imbalance issue. The final loss of the expanded classification model was calculated as the sum of the losses of each classification model.
2.7. Performance Evaluation
The classification label used in the proposed model was “phenotypes with a class imbalance problem”, i.e., substantial differences in the amount of data among classes. Hence, the F1 score for each class was calculated, and the weighted mean was obtained according to the data percentage per class to estimate the total F1 score (weighted F1 score) to ensure the accurate evaluation of model performance. The weighted F1 score was calculated using Scikit-learn [
52]. In addition, to verify that the proposed model could be used to make inferences in cases of missing omics data, the mean absolute error was used, and certain omics data were replaced with padding data and compared with the restructured data for samples with complete omics data. We used the Scikit-learn library for t-SNE calculations and the Bokeh [
53] library for visualization.
4. Discussion
Embedding techniques typically extract task-dependent significant relationships in high-dimensional feature spaces and use them in downstream analysis. Many previous studies have relied on independent embedding techniques [
34,
35,
36,
37]; however, they included fewer samples for biomedical classification and prediction tasks. To address these issues, we investigated the rationale of embedding by adding virtual omics data, which resemble actual omics data.
In this study, the utility of the newly developed method over conventional methods that disallow the use of incomplete data was validated. Furthermore, virtual omics data that resemble actual omics data based on inferences on missing omics data were generated by two tasks: phenotype classification and virtual omics generation, using embedding data. In the phenotype classification task, a performance comparison experiment was conducted between the existing multi-omics model, which can learn only complete data, and the proposed model, which can also learn incomplete data. The average performance of the model was measured via K-fold cross-validation (K = 5). Performance comparison was performed on the “111” model trained on complete data only, the “OUR” model trained on incomplete data as well, and the “110” and “001” models trained on partial omics data only. As a result, the “OUR” model, trained with incomplete data, had the highest average performance in each phenotype (“OUR” cancer type: 0.9549, primary site: 0.9212, sample type: 0.9712) (
Figure 3). Additionally, we compared the performance after training the proposed model using additional omics data. The proposed model was trained on three omics datasets (mRNA gene expression, mRNA isoform expression, and DNA methylation), while the additional training model was trained on four omics datasets (mRNA gene expression, mRNA isoform expression, DNA methylation, and miRNA mature strand expression). The average performance of the models was measured through K-fold cross-validation (K = 5). There was no significant performance difference between the two models (“OUR” cancer type: 0.9549, primary site: 0.9212, sample type: 0.9712; “OUR+” cancer type: 0.9586, primary site: 0.9175, sample type: 0.9717).
In terms of performance improvement, the proposed model did not show any significant improvement compared to the baseline model (a model that only trains on complete data). However, notably, our proposed model could learn robustly even in the absence of complete data. To demonstrate this, we conducted performance comparison experiments for extreme conditions for which complete data are rarely available.
Figure 4 shows that the proposed model could maintain classification performance, whereas the conventional model that can learn only complete data exhibited a sharp performance drop in the relative absence of complete data in the training data (99.90%). Herein, “99.90%” is the dataset in which 99.9% of the complete data in the training data were replaced with incomplete data. Therefore, the number of complete data in the training data was low (only 3), preventing the existing model from properly training. Conversely, by introducing a padding strategy, the proposed model could proceed with learning that could produce meaningful results even in extreme situations lacking complete data.
Figure 5 shows the results of the measurement of the average similarity between the virtual omics data and the actual omics data generated through the baseline model (“111”), which learned on only complete data, and the proposed model (“OUR”), which also learned on incomplete data. For the three actual omics data types, the proposed model generated virtual omics data with a higher similarity than the data for the baseline model (baseline model—mRNA gene expression: 0.9436, mRNA isoform expression: 0.9387, DNA methylation: 0.9653; proposed model—mRNA gene expression: 0.9699, mRNA isoform expression: 0.9533, DNA methylation: 0.9656).
The proposed model is characterized by its ability to generate high-quality virtual omics data for missing omics data. This function cannot be performed in the basic model structure.
Figure 6 depicts these characteristics. Incomplete data (“011”, “101”, and “110”) generated by removing specific omics data from complete data (“111”) were set as the input data of the model. Measuring the similarity between the actual omics data and virtual omics data generated from incomplete data revealed a high average cosine similarity of the three omics data types of 0.9646 (mRNA gene expression: 0.9704, mRNA isoform expression: 0.9555, DNA methylation: 0.9677).
Next, we performed visualizations to understand whether the proposed model can efficiently integrate different omics data sources by maintaining complementarity across omics data types and reducing the noise of partially independent features. Through t-SNE, we visualized the formation of clusters for three phenotypes (cancer type, primary site, and sample type) (
Figure 7,
Figures S4 and S5). Here, we captured the formation of multiple clusters of specific cancer types, such as BRCA, UCEC, KIRC, and STAD (
Figure 7). This can be explained as distortion due to the formation of separate clusters depending on the presence or absence of padding data. To solve this distortion, we validated the effectiveness of virtual omics data in the virtual omics creation task. To this end, we replaced the padding with virtual omics data generated by model training and retrained the model. The embeddings generated from the retrained model were dimensionally reduced with t-SNE, and the formation of clusters for three phenotypes (cancer type, primary site, and sample type) was visualized. Similar to the previous case, the embedding data confirmed cluster formation for the three phenotypes (
Figure 8,
Figures S18 and S19). Additionally, compared to the previous visualization (
Figure 7), certain data (e.g., BRCA, UCEC, KIRC, and STAD) that formed multiple clusters for the same carcinoma converged, indicating that the cluster distortion had been resolved (
Figure 8).
Finally, we determined whether the virtual omics data generated by the proposed model differed from the actual omics data in the distribution within the latent space. To this end, the integration of actual and virtual data for a single-omics data type was visualized via t-SNE, and the difference in the distribution within the latent space was compared. Based on the cancer type, the virtual omics data formed similar clusters to those for the actual omics data in the latent space (
Figure 9).
In summary, the novel approach developed in this study maximizes the use of multi-omics. For samples with incomplete omics data, learning outcomes from certain omics data were integrated with the learning data from samples with complete information, and the consequent embedding exhibited high quality and flexibility. Furthermore, the embedding led to the generation of virtual omics data that resembled actual omics data.
This study has certain limitations. First, our strategies were only developed and validated for a limited number of tasks, including cancer phenotype prediction, primary site prediction, and sample type prediction. However, with additional development, the strategy can be easily improved to include other classification and prediction tasks, such as target identification, identification of tissue origin and specific gene expression signatures, and multimodal predictions. In addition, other experimental data types, such as single-cell RNA-seq and competing endogenous RNA regulation, can eventually be included. However, as evident from the comparison experiment of training with additional omics data, simply adding data is insufficient to precisely analyze the relationships between omics data types and to enhance the performance. Second, the choice of primary site labeling (32 classes over 100 labels) should be investigated since it could be challenging in a single sample for a primary site. However, this was not the focus of this study. In future studies, advanced techniques such as attention mechanisms and graph neural networks can be utilized to comprehensively analyze the relationships between patients or omics, thus allowing weighted contributions from each data type to be reflected by global virtual omics data.
5. Conclusions
A model was developed with the embedding of SARC data for the Korean population from the SBIC (toward the construction of a national registry for rare cancers). However, owing to the inadequate number of samples in the Korean SBIC for learning by an AI model, publicly available cancer-related data in TCGA, including various genomic data types (from transcriptomes to epigenomes), were used. The SBIC data of the KBN (Korea Biobank Network) contain more limited data types, such as mRNA gene expression data and mRNA isoform expression data.
For an AI model, previous research has established that multi-omics data, compared to single-omics data, can more accurately capture the genetic features of cancer. Nevertheless, multi-omics data are often incomplete, limiting AI model application. For instance, the Korean SBIC data are incomplete; therefore, only partial omics data from TCGA could be used. Considering the vast array of genetic data related to cancer, using only parts of large-scale open databases in cases of incomplete data is a major limitation.
The newly proposed approach maximizes the utility of incomplete data. After integrating PanCan and SBIC data, values for mRNA gene expression, mRNA isoform expression, and DNA methylation were used as the input data. Samples that were missing omics data were treated by padding. While it sets the data specification for the learning by AI models, the padding strategy does not affect the learning process. For instance, when a given sample lacks DNA methylation data, the padding is not included in calculating the reconstruction loss for DNA methylation. In a series of steps, a model for embedding with high expandability to allow the maximized use of omics data was designed, and the generated embedding exhibited high quality and flexibility.
We validated that our proposed learning strategy maintains high classification performance and responds robustly to data sparsity to address the issues associated with incomplete omics data. The results of this study are expected to prove valuable for broad omics research. The novel methodology is also expected to be applicable across all AI-based domains.
However, the VAE architecture used in our study may struggle to capture key features within complex or high-dimensional distributions, such as omics data. This difficulty arises because assuming a simple distribution, such as a Gaussian distribution in the latent space, may not adequately reflect the complexity of the data distribution. Additionally, the latent space learned through the VAE may be intricately entangled, which can complicate interpreting the model’s data generation and prediction processes.
To address these issues, we plan to design a model that can better capture the complexity of data distributions by utilizing a transformer-based VAE architecture. Furthermore, we intend to incorporate a process that allows researchers to understand and interpret the model’s outputs by visualizing the data generation and cancer phenotype prediction processes through attention visualization.