1. Introduction
Cancer is a generic term given to a collection of related diseases that can arise in every part of the organism. Typically, there is an increased division rate, dysregulation in growth and the capacity to spread into surrounding tissues and eventually on distant tissues. The latter process is known as metastasis, and is the main cause of death by cancer [
1,
2]. Entropy can be defined as the measure of the average uncertainty of a single random variable in bits, whereas the differential entropy is the entropy of a continuous random variable with an important characteristic that it only depends on the probability density of the random variable [
3]. Unlike discrete entropy, the differential entropy can be negative [
3]. Differential entropy can be conceived as the logarithm of the equivalent side length of the smallest set that contains most of the probability. Hence, low entropy implies that the random variable is confined to a small effective volume and high entropy indicates that the random variable is widely dispersed [
3]. Entropy can be used as a descriptive and comprehensive measure of multivariate variability especially when data are non-Gaussian, since it can capture higher-order statistics and information content of the data [
4].
Liver cancer is one of the main causes of cancer-related deaths worldwide [
5]. Hepatocellular carcinoma (HCC) is the most common type of primary liver cancer and the third most common cause of cancer-related deaths [
6,
7]. Its incidence and mortality continue to rise, chronic viral hepatitis and cirrhosis being some risk factors. Early screening has a survival benefit for patients. Available methods for HCC screening are radiographic, but unfortunately diagnosis is often made at advanced states of the disease, when effectiveness of treatment has poor prognosis. Sorafenib is the recommended treatment with patients in advanced stages, but due to its side effects it is difficult for the patient to tolerate it [
8]. It is a disease with 10 to 20% recurrence even in patients with liver transplantation, which is the most successful treatment [
6]. There was a previous work in which entropy was used to detect an early warning in HCC, where the transition occurs at a very early stage of HCC [
9].
Pancreatic cancer is one of the deadliest types of cancer. It was reported as the fourth cause of death cancer-related in developed countries in 2012. The GLOBOCAN estimations of new cases were 94,700 and 92,800 and estimated deaths were 93,100 and 91,300, in males and females, respectively. In 2015 for ductal adenocarcinoma, the most common type of pancreatic cancer, 367,000 new cases were diagnosed, and 359,000 patients died that same year. A pancreatic adenoma is followed by a neoplasm which leads to a carcinoma [
10]. For diagnosis, there is currently no proper biomarker with enough specificity and sensitivity, yet the detection of the earliest stages of pancreatic cancer are urgently needed to improve the outcomes of resection, which is so far, the best treatment. Surgery is the only option for cure but only 10–15% of the newly diagnosed patients are eligible [
11]. Due to the resistance of pancreatic cancer to therapy, even with resection, most of the patients will relapse and eventually succumb to the disease [
12].
Lung cancer has the highest mortality rate. The rate of survival of patients with lung cancer is less than 5% after five years, and it tends to metastasize, thereby early diagnosis is important. Squamous cell carcinoma represents approximately 30% of all the cases [
13,
14]. Usually the stages of carcinogenesis are: squamous metaplasia → dysplasia → carcinoma in situ → squamous cell carcinoma of lung [
15,
16]. Squamous cell carcinoma (SCC) of the lung is the most common histologic subtype of non-small cell lung cancer (NSCLC). It accounts for 400,000 new cases annually worldwide, with cigarette smoking as the principal risk factor. For advanced stages, the standard care consists of a platinum-based doublet as a palliative systemic therapy [
17]. Diagnosis is made by histological analysis of small biopsies or cytological specimens as fine needle aspirates or bronchial brushings. There are new promising therapies that improve patient’s survival [
14].
Melanoma is only one among various dermatological cancers, but it is the principal cause of death due to skin cancer, accounting for approximately 80%, and with an increasing incidence in the last years with sun exposure as the main risk factor [
18,
19]. All melanomas originate from melanocytes which represents a minority of the cell population within the basilar epidermis, but it can also be found in hair follicles and other tissues. Melanocytes provide the pigment melanin to their neighboring keratinocytes. Pigment production is stimulated by UV radiation-induced DNA damage to keratinocytes [
20]. The gold standard for diagnosis is histopathological assessment [
19]. There is an ideal model of melanoma carcinogenesis ranging from benign naevi → dysplastic naevi → melanoma in situ → invasive melanoma [
20].
Teschendorff et al. [
21] proposed signaling entropy as a novel approach for analyzing and interpreting omics data, such as discriminating cells according to their differentiation potential and cancer status. In other works, some driver genes were found to be associated with reductions in network entropy [
9,
21,
22]. More recently, Brehme et al. [
23] carried out an analysis in chronic myeloid leukemia using entropy dynamics and separated the progression stages. They revealed an important difference in the chronic phase (CP) which allowed to separate it into two phases: “early” from “late” CP [
23]. That same year Park et al. [
24], using an entropy-based distance metric, were able to successfully measure the intratumor heterogeneity and propose it as a useful tool to characterize it at the RNA level using transcriptome and network information.
Cancer research has a wide variety of approaches, mainly for therapeutics usage, ranging from analysis of massive data, such as the genome or transcriptome, to more detailed analyses as of single genes or proteins that are involved in cancer hallmarks [
25]. A major challenge is the opportune detection of the early stages of the disease, which is the most desirable scenario, with better options for patients treatment and an improved outcome as seen in pancreatic cancer and lung cancer, where cancer is commonly detected in the last stages of the disease [
12,
13,
14]. Finding new therapeutics targets are important due to cancer resistance to therapies and to have more repertoires of targets that could be modulated by immunotherapy, miRNAs therapy, gene therapy, or other treatments [
26,
27].
In general, cancer progression can be divided into three states [
9]: a normal state, a pre-disease state (or a critical state), and a disease state. In the normal state, the disease is under control (immune system) and dynamically it has high resilience and robustness to perturbations. The pre-disease state is defined as the limit of the normal state, which occurs before the imminent phase transition point is reached, but it has low resilience and robustness due to its dynamical structure [
9]. The disease state represents a seriously deteriorated stage possibly with high resilience and robustness, where the system usually finds it difficult to recover or return to the normal state even after treatment, which contrasts with the pre-disease state. Therefore, it is crucial to detect the pre-disease state to prevent qualitative deterioration and to further elucidate its molecular mechanism. In cancer, it is a daunting task to predict a pre-disease state because the state of the system may change little before the bifurcation point or the critical transition is reached. There may be slight differences between the normal and pre-disease. The detection of early-warning signals can involve a myriad of genetic factors. There are leading networks in critical transitions, which make the first move from a normal state to a disease state [
9]. The leading network is the first subnetwork that breaks down the limit of a normal state to move into a disease state, which means that they are clearly related to the causal or driving genes in a disease network, in contrast to the differential gene expression that results from the disease. Therefore, identifying the leading networks during a critical transition can signal the emergence of a pre-disease state to make the early diagnosis on the disease, and help to disentangle the mechanisms of disease initiation and progression at the network level. The leading networks are dynamical signals that herald the pre-disease state, rather than the disease state detected by the traditional static biomarkers.
Herein, we contend that multivariate entropy is a useful filter for detecting driver genes. Therefore, we calculated the multivariate entropy of the gene expression profiles in four types of cancer, to wit, pancreatic cancer, melanoma, HCC, and squamous cell carcinoma of the lung. For these cancers, we also constructed their corresponding protein-protein interaction (PPI) networks considering the disease stages. We calculated the multivariate entropy for the local networks of PPI from which we estimated the average entropy of PPI networks. In general, the reliable identification of local leading networks and pre-disease stages is not easy to achieve with noisy data and a small number of samples. The identification of biomarkers and the critical states may be inaccurate. In this work, we validated our proposed biomarkers using different statistical tests, such as double filter for the differentially gene expression, from which we selected the genes for constructing the local networks. The Wilcoxon rank sum test was used to test the statistical differences in the average network entropy between the pre-disease and diseased states with the normal state. We searched the biological function of those genes whose entropy changed and some of them are already considered potential therapeutic targets, for example see [
28]. With our analyses, we also found new potential targets whose biological functions are relevant to the normal cell function. We successfully identified the most relevant genes involved in the carcinogenic processes of the four types of cancer with the help of entropic changes in local networks. Their corresponding proteins could be used as potential targets for treatments and as biomarkers of cancer.
2. Materials and Methods
Four series of raw transcriptomic data of the carcinogenesis process were retrieved from the Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (NCBI). The first one: hepatocellular carcinoma with GEO accession: GSE6764, which has 75 tissue samples (platform: Affymetrix Human Genome U133 Plus 2.0 Array [
29]). The second: melanoma, GEO accession: GSE4587, with 18 samples (platform: Affymetrix Human Genome U133 Plus 2.0 Array [
30]). The third: pancreatic cancer, GEO accession: GSE19650, with 22 samples (platform: Affymetrix Human Genome U133 Plus 2.0 Array [
31]). The fourth: squamous cell carcinoma of the lung, GEO accession: GSE33479, with 122 samples (platform: Agilent-014850 Whole Human Genome Microarray 4 × 44 K G4112F).
We used the R software with Limma package [
32] according to its manual and the Bioconductor Manual to preprocess and process all the transcriptomic data. Only melanoma and squamous cell carcinoma of the lung were retrieved preprocessed (background corrected, normalized) from GEO. The Limma package was used for differentially gene expression analysis obtaining the respective adjusted
p-value for False Discovery Rate (
FDR) [
32,
33]. Differential gene expression analysis was made between each stage and the normal one. Fold changes were also calculated, and we used a double filter to select the differentially expressed genes. Criteria consist in selecting genes with an adjusted
p-value < 0.05 and Fold Change > 1.5 [
32,
33,
34,
35]. To create the networks of Protein-Protein interactions (PPIs) of each cancer, we used the APID database [
36] which provides protein interaction data for a wide variety of species with a controlled quality using PPI found by experimental evidence. We used data set from quality level 1 (all known interactions). Herein, we used the
Homo sapiens data which were processed and cleaned using the Cytoscape Software version 3.2 [
37]. Cleanup consisted in the deletion of data of other species based on its taxonomic tag, and in the deletion of duplicated edges and nodes. The proteins retrieved from the database were the ones detected by the double filter applied to the differentially gene expression but due to the lack of information of the Human interactome, we only retrieved at least 50% of the proteins coded by the genes detected by the double filter analysis. Then, we obtained the first neighbors for each node in the network using Rcy3 [
38] which allows a connection between R and Cytoscape. The result was used to create the
local networks of each node (protein). We tested the hypothesis that the distribution of expression levels across all the selected genes for the four cancers followed a normal or a log-normal distribution (
Appendix A) [
39]. The density function of the multivariate normal distribution is given by [
3,
38]:
where
x is a random vector,
μ is the mean vector, and Σ is the covariance matrix.
In each stage, the local networks data were matched with their respective gene expression of each sample to create matrices of local networks where the genes expression were the rows and each column a sample. For each local network matrix, a covariance matrix was calculated and then we applied Equation (2) to obtain the multivariate entropy of each local network. Each sample is a set of vectors and so on. The expression level of the genes are elements of the vectors.
We also group nodes (proteins) to create subnetworks. Based on the maximum and minimum calculated entropy values of the healthy states, we established the limit from where a preset range was applied as follows: HCC in ranges of 10 units resulting in 11 subnetworks, Melanoma in ranges of three units resulting in eight subnetworks, Pancreatic in ranges of five units resulting in nine subnetworks, Squamous cell lung carcinoma in ranges of 10 units resulting in 19 subnetworks. The starting point was the 0 unit. This healthy-generated groups were kept in the four types of cancer resulting in that each successive stage has the same groups with same nodes with a unique variation in their entropy values. To calculate the entropy for each local network, we consider the entropy of multivariate normal distribution given by [
3,
40]:
where
is the determinant of the covariance matrix;
X is the random variable (set of vectors);
n is the size of the covariance matrix.
Entropy of a local network is calculated with Equation (3), where a single value is calculated from the data with all genes within the same local network, and colored rectangles (
Supplementary Materials) represent the entropy change for the
local network.
The average network entropy,
H(
t) is given by:
where
n is the number of nodes in the network (differentially expressed genes) and
hi(
t) is the entropy of the local network
i. Equation (3) is the same as the one used in [
9], but with a negative sign to obtain positive values of entropy and to have units of information in bits.
Gene expression entropies (Figure 2, Tables 1–4) were calculated using all selected genes by the double filter to create a single matrix for every stage in each cancer, without using the local networks data. The level expression of the genes are elements of the vectors. From the matrix, a covariance matrix was calculated and then Equation (2) was applied to obtain the gene expression entropies.
For each entropy calculation, the following number of samples in each stage were used: HCC: seven samples; melanoma: two samples; pancreatic cancer: three samples; squamous cell lung cancer: 12 samples. The identification was made by looking at the color changes in the nodes. Colors represent a range of entropy values, so if a node has a change in color from one to another we record it.
2.1. Statistical Analysis of Local Network Entropy
Wilcoxon Rank Sum test was used to compare the results of local network entropy in each stage of the carcinogenic process versus its respective control for the four types of cancers (see
Appendix A). Calculations were made with the R package. This probe is a non-parametric statistical test to compare two population medians, and it allows to determine if two probes have significant differences.
2.2. Local Networks
Multivariate entropy of gene expression profiles is influenced only by gene expression (Figure 2). Multivariate entropy of local networks of PPIs is also influenced by gene expression with the addition of another level of complexity: the PPIs of each node. Entropy values of local networks will be influenced by which nodes constitute the local network and their respective gene expression values (
Figure 1)
3. Results
Herein we illustrate how entropy changes throughout the distinct stages of the carcinogenesis process in the four types of cancer. We used Equation (2) to obtain the global change in entropy using the gene expression of the selected genes by the double filter criteria. We found that each cancer stage displays a characteristic entropic value when compared with pre-cancerous stages.
We observe that the last part of the carcinogenesis processes, the gene expression possesses the largest positive entropy than all previous stages with the only exception of melanoma in situ. The latter could be ascribed to different pathways and inherent heterogeneities among the different regions of the body where biological carcinogenesis of melanoma ensues (
Figure 2 and
Table 1,
Table 2,
Table 3 and
Table 4) [
20]. In
Figure 2a we observed variations in entropy values throughout all carcinogenic process of Melanoma, with the most evident change occurring at in situ stage in which its entropy value reaches a minimum. This change can be used as a first cancer early warning. Notice in
Figure 2b that in HCC carcinogenic process the entropy values in pre-cancerous stages were decreasing until a sudden change in the transition point from a pre-cancerous to cancer stage. This change is in agreement with a first cancer early warning [
9]. In
Figure 2c we observe for pancreatic cancer that entropy increases as the carcinogenic process progresses. A sudden change occurs between intraductal papillary-mucinous adenoma and intraductal papillary-mucinous neoplasm stages that are not cancer stages and therefore we need to be careful to talk about a cancer early warning. In
Figure 2d we observe that for squamous cell lung carcinoma, entropy is increasing gradually in the precancerous stages. The last stages correspond to cancer and they appear to have a greater change in entropy with an obvious change in entropy between carcinoma in situ and the squamous cell carcinoma of the lung. The change between severe dysplasia and carcinoma in situ could be a cancer early warning.
To find out which genes had major changes in every stage and how they are related between them, we constructed a protein-protein network for each carcinogenic process and assigned their respective local entropies for each stage. Then we calculated the average network entropy using the local entropy values and built groups based in health controls and for each stage of the carcinogenic process. The PPIs networks of melanoma are displayed in
Figure 3. The PPIs networks with detail of melanoma, HCC, pancreatic cancer and squamous cell carcinoma of the lung are displayed in
Supplementary Materials (Figures S1–S26) The entropy values were ranked and graded by colors. Notice that color variations in each group correspond to variations in entropy.
The calculated network entropy of PPIs and the entropy values calculated from expression data are positive (
Figure 2 and
Figure 4). The average network entropy of local networks of melanoma and HCC exhibit a concave pattern in its entropy values of only gene expression, although the magnitudes are greater for HCC than for melanoma possibly due to the number of samples in each case. The interesting behaviors of the calculated average network entropy are seen in
Figure 4c,d. In
Figure 4c of the pancreatic carcinogenic process, we can identify an early warning due to a sudden change observed between the non-cancer to cancer stage which is statistically significant by Wilcoxon rank sum test with continuity correction (
p-value = 4.537 × 10
−7, see
Appendix A), albeit this was not observed with its entropy values from only gene expression. Something similar occurs with
Figure 2d, in which there is a smooth tendency of increasing entropy as the process progresses, but this pattern is not the same with the one observed in
Figure 3d. In this case, there is an evident variation among stages with three major changes in metaplasia with an increased entropy value, carcinoma in situ stage with a slight decrease compared with SQ in which entropy decreases drastically and this change is statistically significant (calculated
p-value = 0.0001274, see
Appendix A). The melanoma carcinogenic process (
Figure 3a) also denotes variations. There is an important change, melanoma in situ stage increases its entropy and the following stages decrease it and these changes are statistically significant (
p = 2.2 × 10
−16, see
Appendix A). No statistically significant changes for HCC were found (not shown). Dunnett’s test were applied for each cancer with no statistically significant results (
Appendix A,
Table A5,
Table A6,
Table A7 and
Table A8). For these cases, the average network entropy is useful as a first observation, whereas the most important data come from the local network entropy. Local network entropy values permit us to dissect each cancer to visualize the most important variations at each stage.
Some genes like CYP2C9, FDX1, MUT, VAMP4, IL33, EMP2, DENND4A have drastic changes in its entropy during the transition from atypical nevi to melanoma in situ, as observed in
Figure S10 and S11 (see
Supplementary Materials). In the case of IL33, which is implicated in maturation of Th2 cell and the activation of MPK signaling pathway through IL1RL1/ST2 receptor and this pathway improves cell proliferation [
41]. Previous studies tried to associate this protein with cancer promotion, but different results were found [
28]. CYP2C9 polymorphisms were associated with colorectal cancer risk [
42] and may influence breast cancer [
43]. EMP2 is a protein implicated in progression and survival in endometrial cancer, and recently it was proposed as a possible oncoprotein [
44]. We highlight that there are not studies in cancer for FDX1, MULT, VAMP4 and DENND4A genes or its protein products.
We found some genes in the transition of HGDLT to VECH in HCC like SOX6, ASPH, UBAP2L, CEP41. SOX6 encodes a transcription factor with a key role in developmental processes and it has been associated to HCC progression by its decreased progression [
45]. UBAP2L was associated with the metastatic ability in some HCC cell lines via SNAIL1 [
46]. Recently ASPH was suggested as a potential biomarker in gliomas [
47]. There are no studies of the role of CEP41 in cancer. In pancreatic cancer, the transition between IOIPMN to IPMC includes the following genes: FAR1, CEACAM1, HCCS. CEACAM1 has 11 different splice variants, as reported in some studies in vivo, and restoration of its expression abolishes oncogenicity of tumor cell lines, but when it is expressed de novo it increases the risk of metastasis [
48] CEACAM1 has also been proposed as a potential biomarker for breast cancer [
49]. There are no studies for HCCS and FAR1 in cancer. In the transition of squamous severe dysplasia to carcinoma in situ, the genes UBET2 PIH1D2, KIF23 showed a change in their entropy. KIF23 is a protein essential for cytokinesis in Rho-mediated signaling. Its overexpression is associated with lung cancer cell growth and has been suggested as a novel therapeutic target for patients with advanced lung cancer and primary lung tumors [
50]. A recent study showed that a knockdown of UBET2 induced an inhibition in the progression of gastric cancer in vivo and in vitro via WNT signal pathway [
51]. There are no studies of PIH1D2 in cancer. A summary of the proposed genes for each cancer is shown in
Table 9.
4. Discussion
In this work, we have succeeded in the identification of the most relevant genes involved in carcinogenic processes of four types of cancer with the help of entropic changes in local networks. We are validating the use of multivariate entropy by testing that the distributions of gene expressions can follow either a normal distribution (SCC of the lung and melanoma) or a log-normal distribution (HCC and pancreatic cancer) (
Appendix A).
We found that cancer entropic values from the average network entropy were lower than the observed values in healthy controls, which agrees with previous analysis [
21,
22]. The entropic values of the networks at the final stages for all examined cancers fully comply with the latter observation. However, not all pre-advanced cancer stages followed this behavior as we observed in melanoma in situ where entropy is higher than the control. Overall, entropy values correlate to each cancer stage. Interestingly, entropy values from only gene expression have a different behavior than those from the PPIs. In some cases, they could look like a mirror image of each other as is the case of HCC, but this is not always the case as was observed in pancreatic cancer or squamous cell carcinoma of the lung. As we mentioned previously, the differential entropy can be negative [
3].
Our analysis of the carcinogenic processes allowed us to identify initial stages of the four types of cancer at which entropy changed with respect the control. A similar early warning measured with entropy for the average network was observed in HCC but this change was not statistically significant [
9]. In pancreatic cancer, however, we found that the sudden change in average network entropy is statistically significant.
We also characterized each stage by local network entropy of its genes, which permitted to identify the most important genes in each stage and the ones in the transition between them. We illustrated the transitions from pre-cancer to cancer stage in which some of the found genes have been previously reported and even proposed to be early biomarkers in cancer by experiments in vivo and/or in vitro [
47,
49]. We also proposed new genes as biomarkers and as potential therapeutic targets (
Table 1) that have not been previously reported: for melanoma: FDX1 (essential for the synthesis of various steroid hormones [
41]), MUT (involved in degradation of several amino acids, odd-chain fatty acids and cholesterol via propionyl-CoA to the tricarboxylic acid cycle [
41]), VAMP4 (involved in the pathway that functions to remove an inhibitor of calcium-triggered exocytosis during the maturation of secretory granules [
41]), DENND4A (promotes the exchange of GDP to GTP, converting inactive GDP-bound Rab proteins into their active GTP-bound form [
41]); for HCC: CEP41 (required during ciliogenesis for tubulin glutamylation in cilium [
41]); for pancreatic cancer: FAR1 (catalyzes the reduction of saturated and unsaturated C16 or C18 fatty acyl-CoA to fatty alcohols [
41]) and HCCS (stress-activated component of a protein kinase signal transduction cascade and regulates the JNK and p38 pathways [
41]); for squamous cell carcinoma of the lung: PIH1D2 (exhibits a Ral GTPase binging which means a selectively interacting and non-covalently with Ral protein [
41]). Our proposed genes as potential biomarkers or therapeutic targets for melanoma and pancreatic cancer must be taken with caution due to their small sample sizes. Due to its biological relevance, we hope that our results of local networks strongly inspire further experimental work for testing the proposed genes as biomarkers and/or therapeutic targets. In the
Supplementary Materials, we provide entropic values for all the local networks of genes from healthy stage to all stages of cancer. This work was focused only in protein coding genes and the PPIs of its products, that represent less than 1.5% of the human genome [
52]. There is a wide field of cancer research such as non-coding-DNA/RNA, single nucleotide polymorphisms, copy number variations, and epigenetic factors such as methylation and acetylation, that could lead us to a better understanding of the dynamics of cancer diseases.