Symmetry and Complexity in Gene Association Networks Using the Generalized Correlation Coefficient

Ospina, Raydonal; Xavier, Cleber M.; Esteves, Gustavo H.; Espinheira, Patrícia L.; Castro, Cecilia; Leiva, Víctor

doi:10.3390/sym16111510

Open AccessArticle

Symmetry and Complexity in Gene Association Networks Using the Generalized Correlation Coefficient

by

Raydonal Ospina

^1,2

,

Cleber M. Xavier

³

,

Gustavo H. Esteves

⁴

,

Patrícia L. Espinheira

^1,2

,

Cecilia Castro

^5,*

and

Víctor Leiva

⁶

¹

Departamento de Estatística, LInCa, Universidade Federal da Bahia, Salvador 40170-110, Brazil

²

Departamento de Estatística, CASTLab, Universidade Federal de Pernambuco, Recife 50670-901, Brazil

³

Departamento de Estatística e Ciências Atuariais, Universidade Federal de Sergipe, São Cristóvão 49107-230, Brazil

⁴

Departamento de Estatística, Universidade Estadual da Paraíba, Campina Grande 58429-500, Brazil

⁵

Centre of Mathematics, Universidade do Minho, 4710-057 Braga, Portugal

⁶

Escuela de Ingeniería Industrial, Pontificia Universidad Católica de Valparaíso, Valparaíso 2362807, Chile

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(11), 1510; https://doi.org/10.3390/sym16111510

Submission received: 23 September 2024 / Revised: 16 October 2024 / Accepted: 5 November 2024 / Published: 11 November 2024

(This article belongs to the Special Issue Symmetry and Asymmetry in Nonlinear Systems)

Download

Browse Figures

Versions Notes

Abstract

:

High-dimensional gene expression data cause challenges for traditional statistical tools, particularly when dealing with non-linear relationships and outliers. The present study addresses these challenges by employing a generalized correlation coefficient (GCC) that incorporates a flexibility parameter, allowing it to adapt to varying levels of symmetry and asymmetry in the data distribution. This adaptability is crucial for analyzing gene association networks, where the GCC demonstrates advantages over traditional measures such as Kendall, Pearson, and Spearman coefficients. We introduce two novel adaptations of this metric, enhancing its precision and broadening its applicability in the context of complex gene interactions. By applying the GCC to relevance networks, we show how different levels of the flexibility parameter reveal distinct patterns in gene interactions, capturing both linear and non-linear relationships. The maximum likelihood and Spearman-based estimators of the GCC offer a refined approach for disentangling the complexity of biological networks, with potential implications for precision medicine. Our methodology provides a powerful tool for constructing and interpreting relevance networks in biomedicine, supporting advancements in the understanding of biological interactions and healthcare research.

Keywords:

asymmetry; bioinformatics; gene expression analysis; high-dimensional data; non-linear associations; robust statistical methods

MSC:

62H20; 92D10

1. Introduction

Biomedical informatics, an interdisciplinary field at the intersection of biology, data science, and medicine, plays a critical role in deciphering complex molecular interactions, thereby driving advancements in medical diagnostics and treatments. A core challenge in this field is the analysis of high-dimensional gene expression data, which frequently present complexities that conventional statistical methods fail to adequately capture [1,2,3,4]. Specifically, widely used correlation measures, such as Kendall, Pearson, and Spearman coefficients, often struggle to account for the presence of non-linear relationships and the influence of outliers in this type of data [5,6,7,8]. Moreover, gene expression data are commonly asymmetrically distributed, adding another layer of complexity to their analysis [9]. Several statistical methods have been developed to handle the high variability and asymmetry observed in gene expression data.

Methods such as variance stabilizing transformations, including the generalized log-normal (glog-normal) distribution [10], have been applied to genomic contexts [11,12]. Additionally, robust correlation measures like the percentage bend and skipped correlations [13,14], as well as the maximal information coefficient [15], have shown promise in capturing both linear and non-linear dependencies in biological data. Machine learning techniques and advanced clustering algorithms have contributed to the analysis of high-dimensional datasets [7,8,16,17,18,19,20]. Despite the mentioned studies, no single method has fully addressed the wide array of problems inherent to bioinformatics data.

A promising solution to these problems is the generalized correlation coefficient (GCC) [21], which introduces a flexibility parameter capable of adapting to various levels of data complexity. Unlike traditional correlation measures, the GCC can smoothly transition between characteristics of both the Pearson and Spearman correlation coefficients, providing a powerful balance of sensitivity and robustness. This makes the GCC particularly effective in capturing a broader range of linear and non-linear relationships within gene association networks. These networks, often referred to as relevance networks (RNs), provide a framework for visualizing gene interactions where high correlations are represented as edges between genes [22]. The adaptability of the GCC makes it a strong candidate for constructing such networks.

Although the GCC has demonstrated potential, to the best of our knowledge, it has not yet been applied to gene association networks—a domain where traditional correlation methods often fall short, particularly in challenges when handling outliers and deviations from normality [21,23]. These challenges are common in bioinformatics, underscoring the need for more robust methods capable of addressing such challenges [24].

The main objective of this study is to extend the application of the GCC to gene association networks, overcoming the limitations of conventional correlation techniques. To achieve this objective, we refine existing computational methods and theoretical developments.

We employ robust estimators for the GCC based on U-statistics, valued for their resilience in complex data. We also develop Fisher-consistent estimators with flexible parameters, enhancing adaptability across various data structures. To ensure the reliability of these estimators, particularly in large-sample cases, we incorporate advanced techniques such as the delta method.

The present work broadens the applicability of the GCC within biomedical informatics, with a focus on analyzing high-dimensional biological data, such as gene expression profiles. By refining the existing methods and extending them to new contexts, we aim to enrich the statistical toolkit available for analyzing complex biological datasets [3,25,26]. As a result, our study represents a step forward in applying correlation analysis to areas such as genomics [27], health sciences, and epidemiology, tackling key challenges like asymmetry in the data distribution and non-linear dependencies.

The remainder of this article is organized as follows. Section 2 explores the theoretical advancements and computational developments of the GCC, focusing on its application to high-dimensional biological data. In Section 3, we conduct a comprehensive simulation study to evaluate the performance of the proposed estimators under various scenarios. In Section 4, practical applications of the GCC are presented, particularly in the construction of RN using gene expression data. At last, Section 5 summarizes our key findings and discusses the broader implications of this work, with an emphasis on future directions for biomedical research and statistical methodology.

2. Advancements and Applications of the Generalized Correlation Coefficient

In this section, we present the theoretical framework of the GCC and recent advancements that demonstrate its effectiveness in addressing challenges in the analysis of complex data, particularly in bioinformatics.

2.1. Theoretical Foundations and Developments of the Generalized Correlation Coefficient

Quantifying relationships between variables is essential in biological systems, especially through measures of association. Correlation coefficients serve as fundamental tools to determine the strength of the association between two gene expression profiles, providing key insights in various biological contexts [28,29,30,31,32,33,34,35,36,37]. These measures typically exhibit positive values when high (or low) values of one variable correspond with high (or low) values of another and negative values when high values of one variable correspond with low values of another.

For a pair of random variables,

(X, Y)

namely, with a joint cumulative distribution function (CDF) F and finite second-order moments, the population Pearson correlation coefficient is defined as

ρ = \frac{Cov (X, Y)}{σ_{X} σ_{Y}} = \frac{E [(X - μ_{X}) (Y - μ_{Y})]}{\sqrt{E [{(X - μ_{X})}^{2}] E [{(Y - μ_{Y})}^{2}]}},

where

Cov (X, Y)

is the covariance,

σ_{X}^{2} = Var [X]

and

σ_{Y}^{2} = Var [Y]

are the variances, and

μ_{X} = E [X]

and

μ_{Y} = E [Y]

are the expected values of X and Y, correspondingly. This coefficient quantifies the linear dependence between two variables,

(X, Y)

in our case.

The sample Pearson correlation coefficient is stated as

r_{P} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}},

(1)

where

x_{i}, y_{i}

are the sample observations and

\bar{x}, \bar{y}

are the sample means of X and Y, respectively. If X and Y follow a bivariate normal distribution, denoted as

(X, Y) \sim N_{2} (μ_{X}, μ_{Y}; σ_{X}^{2}, σ_{Y}^{2}; ρ)

, then the sample correlation coefficient presented in (1) is the maximum likelihood (ML) estimate of the population Pearson correlation coefficient

ρ

. This estimator is consistent, meaning that as the sample size n increases,

r_{P}

converges in probability to the population correlation

ρ

.

However, when the assumption of bivariate normality is violated, several issues arise. Outliers can disproportionately influence

ρ

, leading to misleading conclusions about the relationship between X and Y. Furthermore, in the presence of non-linear relationships,

ρ

may underestimate the true strength of association, as it captures only linear dependencies [38,39]. Non-parametric alternative measures, such as the Kendall tau and Spearman rank correlation, are less sensitive to outliers and better capture monotonic relationships, making them more robust in such scenarios [40]. These measures provide more reliable estimates and reflect the true relationships in the data, particularly when dealing with biological variables that exhibit complex dependencies.

The Kendall correlation coefficient [41,42] is expressed as

r_{K} = \frac{2}{n (n - 1)} \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} sign ((x_{i} - x_{j}) (y_{i} - y_{j})),

where

sign (\cdot)

is the sign function that assigns a value of 1 to positive differences, −1 to negative differences, and 0 when there is no difference. Its population version is given by

ρ_{K} (F) = E_{F} [sign ((X_{1} - X_{2}) (Y_{1} - Y_{2}))],

with

E_{F}

representing the expectation with respect to the CDF F. For a bivariate normal distribution, whose CDF is denoted by

Φ_{2}

, the relationship is established as

ρ_{K} (Φ_{2}) = \frac{2}{π} \arcsin (ρ),

which differs from the Pearson correlation when

ρ \neq 0

[43].

The Spearman rank correlation coefficient [44] is also less sensitive to outliers, providing a robust estimate of the correlation. Denote

H (t) = P_{F} (X \leq t)

and

G (t) = P_{F} (Y \leq t)

as the marginal CDFs of X and Y, respectively. The Spearman correlation coefficient is formulated as

ρ_{S} (F) = Cor (H (X), G (Y)) = 12 E_{F} [H (X) G (Y)] - 3,

and its sample estimate is given by

r_{S} = 1 - \frac{6 \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)},

where

d_{i}

is the rank difference between paired observations

(x, y)

.

This discussion on correlation coefficients reveals challenges of the complex and adaptive nature of correlation measures in statistical analysis, showing their varied behavior under different distributional scenarios. In the empirical study of biological systems, particularly in gene expression data analysis, datasets are often prone to noise and measurement errors. Accurately estimating correlations in such systems is critical to prevent misinterpretations in both biological inference and statistical conclusions.

In response to these challenges, the GCC emerges as a highly robust tool for analyzing complex biological data. What sets the GCC apart is its heightened sensitivity to both linear and non-linear patterns, making it a far more versatile alternative to traditional metrics like the Pearson and Kendall coefficients [21]. By seamlessly adapting to the nuances of high-dimensional data and demonstrating remarkable resilience to outliers, the GCC provides a more precise and comprehensive measure of association, especially in complex biological datasets. This adaptability stems from the key role played by the function

g_{γ} (z) = sign (z) {| z |}^{γ}

, where

γ \in [0, 1]

, which underpins the definition of the GCC. This function modulates how differences between variables are weighted based on both magnitude and sign, allowing the GCC to dynamically adjust its sensitivity. As a result of this flexibility, the GCC is capable of capturing a wide spectrum of correlation structures, ranging from purely linear to highly intricate non-linear dependencies. This adaptability is achieved through three key population parameters that define the functional form of the GCC, each reflecting a different aspect of the relationship between variables, indicated as follows:

$θ_{1} (F) = E_{F} [g_{γ} (X_{i} - X_{j}) g_{γ} (Y_{i} - Y_{j})]$ —which captures the mutual influence between differences in pairs of variables X and Y.
$θ_{2} (F) = E_{F} [g_{γ}^{2} (X_{i} - X_{j})]$ —which quantifies the squared influence of differences in the variable X.
$θ_{3} (F) = E_{F} [g_{γ}^{2} (Y_{i} - Y_{j})]$ —which measures the squared influence of differences in the variable Y.

From these parameters, the GCC is formally defined as

ρ_{γ} (F) = \frac{θ_{1} (F)}{\sqrt{θ_{2} (F) θ_{3} (F)}},

(2)

where the parameter

γ

modulates the degree of similarity between the GCC and traditional correlation coefficients. Specifically, when

γ = 1

,

ρ_{γ} (F)

aligns with the Pearson correlation coefficient, capturing linear relationships, while for

γ = 0

, it approximates the Kendall rank correlation, which is more sensitive to ordinal relationships.

2.2. Practical Implementations and Computational Refinements of GCC

Building on the theoretical foundation of the GCC, advanced computational methods have been developed to apply its principles effectively in practice. A key innovation is the creation of an estimator for

ρ_{γ} (F)

based on U-statistics, which provides a robust statistical approach commonly used to construct estimators that are resilient to outliers and irregularities in data [45].

The estimator for

ρ_{γ}

is presented as

{\tilde{ρ}}_{γ} = \frac{U_{γ, X Y}}{\sqrt{U_{γ, X X} U_{γ, Y Y}}},

(3)

where

U_{γ, X Y}

,

U_{γ, X X}

, and

U_{γ, Y Y}

are U-statistic estimators corresponding to the parameters

θ_{1} (F)

,

θ_{2} (F)

, and

θ_{3} (F)

, respectively, as defined in (2). These estimators are computed as

\begin{matrix} U_{γ, X Y} & = \frac{2}{n (n - 1)} \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} g_{γ} (X_{i} - X_{j}) g_{γ} (Y_{i} - Y_{j}), \\ U_{γ, X X} & = \frac{2}{n (n - 1)} \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} g_{γ}^{2} (X_{i} - X_{j}), \\ U_{γ, Y Y} & = \frac{2}{n (n - 1)} \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} g_{γ}^{2} (Y_{i} - Y_{j}) . \end{matrix}

By utilizing U-statistics, we offer a refinement based on robust and unbiased means for estimating

ρ_{γ} (F)

, even in the presence of complex or irregular data. The use of U-statistics ensures consistency and resilience to non-normal distributions and outliers, making them particularly suitable for biological datasets, which frequently exhibit such challenges.

Further computational refinements include an explicit formulation for

ρ_{γ}

in the case of a bivariate normal distribution, with CDF denoted as

Φ_{2}

[23]. This distribution describes the joint behavior of two normally distributed random variables with a specified correlation

ρ

. The explicit formulation for the GCC in this context is given by

W (γ, ρ) = K (γ) ρ {}_{2}F_{1} (\frac{1}{2} (1 - γ), \frac{1}{2} (1 - γ); \frac{3}{2}; ρ^{2}),

(4)

where

K (γ) = 2 {(Γ (γ / 2 + 1))}^{2} / Γ (γ + 1 / 2) \sqrt{π}

and

{}_{2}F_{1} (a, b; c; x)

represents the Gaussian hypergeometric series, a specialized mathematical function used to describe complex relationships between variables [46], and

Γ

is the traditional gamma function.

Thus, the expression

W (γ, ρ)

presented in (4) quantifies how the GCC

ρ_{γ} (Φ_{2})

deviates from the traditional Pearson correlation coefficient

ρ

when

ρ \neq 0

. Specifically,

W (γ, ρ)

adjusts the weighting of linear versus non-linear relationships based on the parameter

γ

. As

γ

changes, the GCC adapts to capture different types of dependencies, making it more versatile than traditional correlation coefficients.

To ensure that the GCC estimator remains Fisher-consistent—meaning it retains accuracy across different populations—an inverse transformation of

W (γ, ρ)

is applied, keeping

γ

fixed within the interval

[- 1, 1]

, as described in (4). This transformation guarantees that the estimator adapts to various datasets while preserving statistical consistency.

Let

r_{Q}

denote the correlation estimator corresponding to

ρ_{Q} (F)

. In the normal model

Φ_{2}

, it is established that all considered correlation estimators asymptotically follow a normal distribution, that is, we have that

\sqrt{n} (r_{Q} - ρ_{Q} (Φ_{2})) \to N (0, AV (ρ_{Q} (Φ_{2}), Φ_{2})),

where

AV (R, F)

represents the asymptotic variance, defined as

E_{F} [IF {((X, Y), R, F)}^{2}]

, with

IF ((x, y), R, F)

being the influence function of the statistical functional R at the CDF F [39].

For the bivariate normal CDF

Φ_{2}

and any correlation

ρ

in the range

[- 1, 1]

, the asymptotic variances for different correlation measures are established. For the Pearson correlation coefficient, the asymptotic variance is given by

AV (ρ (Φ_{2}), Φ_{2}) = {(1 - ρ^{2})}^{2},

as demonstrated in [47,48].

For the Kendall correlation coefficient, the asymptotic variance is established as

AV (ρ_{K}^{*} (Φ_{2}), Φ_{2}) = (1 - ρ^{2}) (\frac{π^{2}}{4} - \arcsin^{2} (ρ)),

as discussed in [43]. The variances for the Spearman correlation

ρ_{S} (Φ_{2})

and the GCC

ρ_{γ} (Φ_{2})

are detailed in [21,49].

Given the Fisher consistency of the Pearson (

r_{P}

) and Spearman (

r_{S}

) estimators under a CDF

Φ_{2}

, we apply two Fisher-consistent estimators for

ρ_{γ} (Φ_{2})

for a fixed value of

γ

, defined as

\begin{matrix} {\hat{ρ}}_{γ} & = W (γ, r_{P}), — ML estimator —, \end{matrix}

(5)

\begin{matrix} {\bar{ρ}}_{γ} & = W (γ, 2 sin (\frac{π}{6} r_{S})), — Spearman - based estimator — . \end{matrix}

(6)

The derivative of

W (γ, ρ)

, denoted by

w (γ, ρ)

, is given by

w (γ, ρ) = \frac{\partial W (γ, ρ)}{\partial ρ} = \frac{1}{3} {(1 - γ)}^{2} ρ^{2} {}_{2}F_{1} (\frac{(3 - γ)}{2}, \frac{(3 - γ)}{2}; \frac{5}{2}; ρ^{2}) + W (γ, ρ) .

Using the delta method [50], we derive the asymptotic distributions of these estimators for a fixed value of

γ

as

\sqrt{n} ({\hat{ρ}}_{γ} - ρ_{γ}) \to N (0, w {(γ, ρ)}^{2} AV (ρ (Φ_{2}), Φ_{2}))

and

\sqrt{n} ({\bar{ρ}}_{γ} - ρ_{γ}) \to N (0, w {(γ, ρ)}^{2} AV (ρ_{S} (Φ_{2}), Φ_{2})),

confirming that both

{\hat{ρ}}_{γ}

and

{\bar{ρ}}_{γ}

exhibit asymptotic normality, with their variances scaled by

w {(γ, ρ)}^{2}

. This demonstrates the effectiveness of these estimators in approximating

ρ_{γ}

under

Φ_{2}

.

In summary, these refinements reinforce the theoretical foundations of the GCC, particularly for high-dimensional biological data. By ensuring Fisher consistency and leveraging robust statistical methods, the proposed estimators provide precise and reliable correlation analyses, which are critical in fields such as epidemiology, genomics, and health sciences. Having established the theoretical and practical foundations for the GCC, we proceed to evaluate its performance through a comprehensive simulation study.

3. Simulation Study

This section evaluates the performance of the GCC estimators under various simulated scenarios, providing empirical evidence of their efficacy and robustness in analyzing complex biological data, particularly gene expression profiles with prevalent non-linear dependencies and asymmetries.

3.1. Simulation Design

We design several simulation scenarios to reflect real-world conditions commonly encountered in gene expression data. We assess the performance of the estimators under various correlation structures, sample sizes, and contamination levels. Specifically, we consider the following cases:

Case 1—Standard bivariate normal distribution without contamination, where samples were drawn from a bivariate normal distribution $N_{2} (μ_{X}, μ_{Y}; σ_{X}^{2}, σ_{Y}^{2}; ρ)$ with the following parameters: means $μ_{X} = μ_{Y} = 0$ ; variances $σ_{X}^{2} = σ_{Y}^{2} = 1$ ; and correlation coefficients $ρ \in {0, 0.3, 0.9}$ , representing cases of no correlation, moderate correlation, and high correlation, respectively. This case evaluates the estimators under ideal conditions with no contamination and different correlation strengths.
Case 2—Bivariate normal distribution with shifted means. To assess the robustness of the estimators to location shifts, we generate samples from a bivariate normal distribution $N_{2} (- 0.5, 0.5; 1, 1; ρ)$ , with shifted means, with the same correlation coefficients $ρ \in {0, 0.3, 0.9}$ being used. This case evaluates the effect of mean shifts on the estimator performance.
Case 3—Bivariate normal distribution with increased variance. To investigate the impact of increased variability, we generate samples from the distribution $N_{2} (0, 0;$ $σ_{X}^{2} = 4$ , $σ_{Y}^{2} = 4; ρ)$ , with variances four times greater than in previous cases and the correlation coefficients remaining as $ρ \in {0, 0.3, 0.9}$ . This case simulates scenarios with high variability in biological data.
Case 4—Contaminated bivariate normal distribution. In this case, we create a mixture consisting of 60% of a bivariate normal distribution with high correlation ( $ρ = 0.9$ ) and 40% of a bivariate normal distribution with no correlation ( $ρ = 0$ ). The mixture proportions considered are 60%, 40%, with correlation coefficients $ρ \in {0.1, 0.5, 0.9}$ . This case evaluates the performance of the estimators in the presence of heterogeneous subpopulations with varying correlation patterns.
Case 5—Mixture of bivariate normal distributions. To simulate heterogeneous data commonly observed in gene expression analysis, we generate samples from a mixture of two bivariate normal distributions with different means and/or covariances. The mixture proportions considered are 10%, 30%, and 50%, with a weak correlation $ρ = 0.1$ . This case evaluates the performance of the estimators when data arise from different subpopulations with distinct correlation patterns.

For each of the five cases, we evaluate the performance of the following estimators:

GCC estimator based on U-statistics ( ${\tilde{ρ}}_{γ}$ )—GCC-U—as defined in (3).
GCC estimator based on ML ( ${\hat{ρ}}_{γ}$ )—GCC-ML—as stated in (5).
Adjusted Spearman rank correlation coefficient ( ${\bar{ρ}}_{γ}$ )—adjusted Spearman—as presented in (6).

The contamination scenarios were specifically chosen to reflect conditions frequently observed in biological data, such as those in molecular biology and epidemiology. These scenarios generate asymmetries and provide a comprehensive representation of real-world challenges, simulating outliers and heavy-tailed distributions. Additional contamination settings were deemed unnecessary, as they would not contribute further insights beyond what is already observed under the tested conditions.

We conducted 5000 Monte Carlo replicates for each simulation scenario to ensure robust results and to provide reliable estimates of the behavior of the estimators. The sample sizes considered were

n \in {10, 50, 100, 250, 500}

, chosen to evaluate the performance of the estimators in both small-sample and large-sample settings. These sample sizes allowed us to assess the consistency and convergence properties of each estimator as n increases.

As discussed earlier, we evaluated the influence of the flexibility parameter

γ

at three key values,

γ \in {0, 0.5, 1}

say, which capture a range of behaviors from rank-based to linear dependencies. By varying both the sample size and the parameter

γ

, we assessed the performance of the estimators across different data complexities, including varying levels of correlation, non-linear dependencies, and robustness to outliers. This comprehensive evaluation provides valuable insights into how the estimators perform under diverse conditions typically encountered in the analysis of biological data, such as gene expression profiles.

3.2. Simulation Results

The performance of the estimators was evaluated using the root mean square error (RMSE), calculated as

RMSE = {((1 / N) \sum_{k = 1}^{N} {(ρ_{γ}^{(k)} - ρ_{γ})}^{2})}^{1 / 2}

, where

ρ_{γ}^{(k)}

is the estimated value of the GCC in the k-th simulation,

ρ_{γ}

is the true value of the GCC, and N is the total number of simulations. We consider the following cases:

Case 1—Standard bivariate normal distribution without contamination. The RMSE values for each estimator are in Table 1 for different correlation values $ρ \in {0, 0.3, 0.9}$ , flexibility parameters $γ \in {0, 0.5, 1}$ , and sample sizes $n \in {10, 50, 100, 250, 500}$ .
The results presented in Table 1 lead to the following key observations:
–
Superior performance of ${\hat{ρ}}_{γ}$ (GCC-ML)—Across all correlation levels and sample sizes, the ML estimator ( ${\hat{ρ}}_{γ}$ ) consistently achieves the lowest RMSE. This demonstrates its robustness and accuracy, particularly for small to moderate sample sizes. The GCC-ML estimator effectively handles different correlation structures, making it a reliable choice in both low- and high-correlation scenarios.
–
Convergence with increasing sample size—As the sample size increases, all estimators show a reduction in RMSE, indicating convergence towards the true value of $ρ_{γ}$ . For $n \geq 100$ , the RMSE differences between estimators narrow, but GCC-ML continues to exhibit a slight advantage.
–
Impact of correlation strength—In high-correlation settings ( $ρ = 0.9$ ), all estimators show an improvement with markedly lower RMSE values, reflecting better performance in strong linear relationships. This improvement is more pronounced for large sample sizes, where RMSE values decrease rapidly.
–
Effect of the flexibility parameter $γ$ —The parameter $γ$ influences the estimator sensitivity to different types of dependencies. When $γ = 0$ (similar to the Kendall tau), the RMSE is higher for small sample sizes, indicating a sensitivity to rank-based measures. As $γ$ increases, the estimators capture more linear dependencies, leading to a decrease in RMSE. The intermediate value of $γ = 0.5$ offers a balance between capturing rank-based and moment-based correlation properties.
–
Relative performance of ${\tilde{ρ}}_{γ}$ (GCC-U) and ${\bar{ρ}}_{γ}$ (adjusted Spearman)—While both ${\tilde{ρ}}_{γ}$ (GCC-U) and ${\bar{ρ}}_{γ}$ (adjusted Spearman) generally exhibit higher RMSE compared to ${\hat{ρ}}_{γ}$ (GCC-ML), their performance improves with large sample sizes. In small sample sizes ( $n = 10$ or $n = 50$ ), GCC-U tends to slightly overestimate $ρ_{γ}$ for $γ = 0$ , especially in low-correlation settings ( $ρ = 0$ ). In addition, the adjusted Spearman estimator tends to underestimate $ρ_{γ}$ , particularly for moderate correlations ( $ρ = 0.3$ ).
These findings indicate that, while the three estimators exhibit consistency with increasing sample sizes, the ML estimator provides the most reliable and accurate estimates for a wide range of scenarios. The choice of $γ$ should be based on the underlying correlation structure and desired sensitivity to linear or non-linear dependencies.
Case 2—Bivariate normal distribution with shifted means. In this case, we evaluate the robustness of the estimators when data are drawn from a bivariate normal distribution with shifted means, reflecting deviations commonly encountered in real-world datasets, such as gene expression profiles. Specifically, samples were generated from a bivariate normal distribution $N_{2} (μ_{X} = - 0.5, μ_{Y} = 0.5; σ_{X}^{2} = 1, σ_{Y}^{2} = 1; ρ)$ while maintaining the same correlation levels as in Case 1, that is, $ρ \in {0, 0.3, 0.9}$ . The shift in means introduces an additional layer of complexity, testing the ability of the estimators to adapt to changes in location. Although the variances remain constant, the altered central tendency requires the estimators to perform effectively under different distributional settings. RMSE values for each estimator are presented in Table 2.
Key observations from the results of Table 2 are the following:
–
Similar to Case 1, ${\hat{ρ}}_{γ}$ (GCC-ML) consistently shows the lowest RMSE values across most correlation levels and sample sizes, demonstrating robustness to mean shifts. The estimator remains stable even under these non-standard conditions, with minimal sensitivity to the shifted means, especially for higher values of $γ$ (closer to the Pearson correlation), where RMSE is lowest across all sample sizes.
–
Both ${\tilde{ρ}}_{γ}$ (GCC-U) and ${\bar{ρ}}_{γ}$ (adjusted Spearman) are more affected by the mean shift, particularly for small sample sizes ( $n = 10$ and $n = 50$ ). RMSE values for ${\tilde{ρ}}_{γ}$ increase slightly compared to Case 1, reflecting reduced performance in adapting to location shifts. This effect is more noticeable for $γ = 0$ , suggesting that rank-based estimators are more sensitive to shifts in location. The adjusted Spearman estimator tends to underestimate $ρ_{γ}$ but shows less sensitivity to the mean shift than the GCC-U estimator.
–
In high-correlation scenarios ( $ρ = 0.9$ ), estimators exhibit lower RMSEs, confirming their ability to capture strong relationships despite the mean shift. The ML estimator displays the least variability across different $γ$ values, maintaining its advantage. For moderate correlations ( $ρ = 0.3$ ), the mean shift has a more pronounced effect on the adjusted Spearman estimator, which exhibits higher RMSE values compared to the ML and U-statistic-based estimators.
–
As sample sizes increase, RMSE values for all estimators decrease, with differences between them becoming less pronounced. For $n = 250$ and $n = 500$ , RMSE values converge across all values of $γ$ , but the ML estimator continues to perform slightly better, particularly for small and moderate sample sizes.
–
The parameter $γ$ continues to influence estimator performance. For $γ = 1$ (similar to Pearson correlation), the estimators are unaffected by the mean shift. However, for $γ = 0$ (similar to the Kendall tau), the impact of the mean shift is evident, particularly for the GCC-U estimator. Lower values of $γ$ show high sensitivity to location shifts, reflecting the rank-based nature of the estimator in such settings.
The introduction of mean shifts provided valuable insights into the robustness of the estimators. While all estimators showed convergence as sample sizes increased, the estimator ${\hat{ρ}}_{γ}$ consistently outperformed the others across a wide range of conditions. The mean shift had a noticeable impact on the performance of the GCC-U and adjusted Spearman estimators, particularly for small sample sizes and low values of $γ$ . These findings highlight the importance of choosing an appropriate value for $γ$ based on the data structure and the expected behavior of the estimators under non-standard conditions such as location shifts.
Case 3—Bivariate normal distribution with increased variance. In this case, samples are drawn from a bivariate normal distribution $N_{2} (0, 0; 4, 4; ρ)$ , where the variances are increased fourfold for both variables. This case simulates high-variability conditions, often observed in genomics and biological data, where their variability can obscure underlying correlation patterns. Table 3 presents the RMSE results for this case, considering the same range of correlation coefficients $ρ \in {0, 0.3, 0.9}$ ; flexibility parameters $γ \in {0, 0.5, 1}$ ; and sample sizes $n \in {10, 50, 100, 250, 500}$ .
Key observations from the results in Table 3 include the following:
–
The increase in variance gives greater dispersion, making correlation estimation more challenging. This is reflected in the slightly higher RMSE values, particularly for small sample sizes ( $n \in {10, 50}$ ), when compared to the previous cases.
–
Despite the high variance, the GCC-ML estimator continues to exhibit the lowest RMSE across most scenarios, consistent with previous observations. However, in certain conditions, such as low correlation ( $ρ = 0$ ) and low flexibility ( $γ = 0$ ), the adjusted Spearman estimator may display slightly lower RMSE. This emphasizes the robustness of the ML estimator in varied data conditions, though the adjusted Spearman estimator remains a competitive alternative in some settings.
–
As in previous cases, RMSE values for all estimators decrease as the sample size grows, indicating their consistency and convergence toward the true $ρ_{γ}$ . For large sample sizes ( $n = 250$ and $n = 500$ ), differences between estimators become less pronounced, though the GCC-ML estimator maintains a slight advantage.
–
The parameter $γ$ continues to play a relevant role in the performance of the estimators. For $γ = 0$ (similar to the Kendall tau), the RMSE tends to be higher for small sample sizes, reflecting greater sensitivity to rank-based associations. As $γ$ increases to 1 (similar to the Pearson correlation), the estimators perform better in capturing linear relationships, resulting in lower RMSE values.
–
While ${\tilde{ρ}}_{γ}$ and ${\bar{ρ}}_{γ}$ show slightly higher RMSE values compared to ${\hat{ρ}}_{γ}$ , their performance improves as the sample size increases. For small sample sizes, the GCC-U estimator tends to overestimate $ρ_{γ}$ when $γ = 0$ , particularly in low-correlation scenarios. Additionally, the adjusted Spearman estimator tends to underestimate $ρ_{γ}$ , especially at moderate correlation levels ( $ρ = 0.3$ ).
Therefore, while the increased variance in the data leads to slightly higher RMSE values for all estimators, ${\hat{ρ}}_{γ}$ continues to demonstrate superior performance across all conditions. The choice of $γ$ remains crucial, influencing the estimator sensitivity to different types of dependencies. In particular, $γ = 0.5$ provides a balanced performance across linear and rank-based correlations.
Case 4—Contaminated bivariate normal distribution. In this case, we model contamination by introducing a mixture of bivariate normal distributions, where 60% of the data is drawn from a bivariate normal distribution with a correlation of $ρ = 0.9$ and 40% from a bivariate normal distribution with zero correlation ( $ρ = 0$ ). This setup simulates the presence of uncorrelated observations, effectively introducing outliers and reflecting scenarios commonly observed in real-world data. The results of this case are summarized in Table 4.
Key observations from Table 4 include the following:
–
For $ρ = 0$ , both estimators ${\hat{ρ}}_{γ}$ and ${\tilde{ρ}}_{γ}$ tend to overestimate the value of $ρ_{γ}$ when the sample size is small ( $n = 10$ ). However, as the sample size increases, these estimators converge toward the true value, with the GCC-ML estimator showing marginally lower RMSE values. The estimator ${\bar{ρ}}_{γ}$ consistently yields the smallest RMSE, demonstrating strong robustness to contamination in this scenario.
–
At a moderate correlation ( $ρ = 0.3$ ), the GCC-ML estimator underestimates the true value of $ρ_{γ}$ , particularly for small sample sizes. Conversely, the GCC-U estimator tends to overestimate the true correlation when $n = 10$ . However, as the sample size increases, the performance of the GCC-U estimator improves, and its RMSE decreases. The adjusted Spearman estimator continues to perform well, although it slightly underestimates the true value of $ρ_{γ}$ across all sample sizes.
–
For high correlation ( $ρ = 0.9$ ), the GCC-ML estimator shows consistent underestimation of $ρ_{γ}$ , though its variability is reduced compared to the moderate correlation case. The GCC-U estimator tends to overestimate $ρ_{γ}$ when the sample size is small, but this tendency diminishes with large sample sizes. The adjusted Spearman estimator exhibits a slight underestimation but demonstrates less variability than the other estimators for large sample sizes.
–
Contamination influences the estimators differently based on the value of $γ$ . For small values of $γ$ (closer to the Kendall tau), the estimators tend to be more robust, with the adjusted Spearman estimator showing the highest robustness. As $γ$ increases, moving closer to the Pearson correlation, the estimators become more sensitive to outliers, leading to higher RMSE values, particularly for the GCC-ML and GCC-U estimators in small sample settings.
–
As observed in previous cases, RMSE values decrease as the sample size grows, reflecting consistency and convergence toward the true correlation value. The GCC-ML estimator continues to hold an advantage for large sample sizes, while the adjusted Spearman estimator shows greater stability across different $γ$ values.
The results of Case 4 highlight the influence of contamination on estimator performance. Although the GCC-ML estimator generally performs well, its sensitivity to outliers is more pronounced for small sample sizes and high values of $γ$ . The adjusted Spearman estimator exhibits greater robustness under these conditions, particularly in moderate- and high-correlation settings.
Case 5—Mixture of bivariate normal distributions. In this case, we simulate heterogeneity in the data by generating samples from a mixture of two bivariate normal distributions with different means and/or covariances. The mixture proportions considered are 10%, 30%, and 50%, and the performance of the estimators is evaluated for a weak correlation ( $ρ = 0.1$ ). Additionally, the flexibility parameter $γ$ is assessed for values of $γ = 0$ , $γ = 0.5$ , and $γ = 1$ , covering a range from rank-based to moment-based correlation measures. The results of this case are summarized in Table 5.

Key observations from Table 5 include the following:

With 10% contamination, the estimator ${\hat{ρ}}_{γ}$ tends to slightly underestimate $ρ_{γ}$ for $γ = 1$ , particularly when the sample size is small. However, as the sample size increases, all estimators converge to the true value. The estimator ${\tilde{ρ}}_{γ}$ exhibits more variability, especially for small sample sizes and high values of $γ$ . The estimator ${\bar{ρ}}_{γ}$ shows a slight underestimation for $γ = 1$ , but it converges as the sample size increases.
With 30% contamination, the GCC-ML estimator tends to slightly overestimate $ρ_{γ}$ for $γ = 0.5$ and a small sample size ( $n = 10$ ). The GCC-U estimator also shows some overestimation for small sample sizes but improves with large sample sizes. The adjusted Spearman estimator underestimates $ρ_{γ}$ across all values of $γ$ , although its performance improves considerably with large sample sizes.
With 50% contamination, the GCC-ML and GCC-U estimators exhibit high variability for small sample sizes, particularly for $γ \in {0.5, 1}$ , with both tending to overestimate $ρ_{γ}$ . The adjusted Spearman estimator remains consistent, slightly underestimating the true value but showing much lower variability as the sample size increases.
As contamination levels increase (from 10% to 50%), all estimators show increased variability, particularly for small sample sizes. However, for large sample sizes ( $n = 250$ and $n = 500$ ), the RMSE values decrease, indicating convergence toward the true value. The estimators generally perform better with lower contamination levels, and the impact of contamination is pronounced for high values of $γ$ .
The parameter $γ$ affects the estimators’ performance. For $γ = 0$ (similar to the Kendall tau), the estimators tend to be more robust against contamination, especially for large sample sizes. For $γ = 1$ (similar to the Pearson correlation), the estimators become more sensitive to contamination, resulting in higher RMSE values, particularly for small sample sizes.

The results of Case 5 illustrate the impact of contamination on the performance of the estimators. The GCC-ML estimator shows better performance for large sample sizes, while the adjusted Spearman estimator provides a robust alternative, especially for moderate and high contamination levels.

The insights gained from the simulation study demonstrate the varied performance of the proposed estimators across different conditions of correlation, contamination, variance, and sample size. Overall, the ML-based estimator

{\hat{ρ}}_{γ}

consistently outperformed the other estimators in terms of accuracy, particularly for small to moderate sample sizes. Its robustness to different correlation levels and contamination was evident, although it showed slight sensitivity to high levels of contamination and extreme values for large

γ

. As sample sizes increased, the differences in performance between GCC-ML and the other estimators diminished, but the ML-based estimator maintained its advantage in terms of lower RMSE values.

The U-statistics-based estimator

{\tilde{ρ}}_{γ}

, while showing more variability in certain cases—particularly in small sample sizes or under shifts in location (as seen in Case 2)—improved as sample sizes increased. However, the GCC-U estimator had a tendency to overestimate

ρ_{γ}

for small values of

γ

and low correlation levels, and it was more sensitive to contamination, particularly when

γ = 0

. This sensitivity reflects the rank-based nature of the estimator, which is more impacted by outliers and distribution shifts.

The adjusted Spearman estimator

{\bar{ρ}}_{γ}

exhibited strong robustness to contamination, consistently producing low RMSE values in cases with moderate contamination levels. However, it tended to underestimate

ρ_{γ}

, especially at moderate correlation levels. Despite this, the adjusted Spearman estimator performed stably across different contamination and correlation levels, making it a reliable option for highly contaminated datasets.

The flexibility parameter

γ

played a crucial role in the behavior of all estimators. Low values of

γ

, particularly

γ = 0

(similar to the Kendall tau), offered more robustness against contamination and outliers, while high values of

γ

(closer to the Pearson correlation) performed better at capturing linear relationships in uncontaminated datasets. The intermediate value of

γ = 0.5

balanced sensitivity to both linear and non-linear dependencies effectively, providing a flexible approach to varying data complexities.

Across all scenarios, the impact of sample size was clear: as the sample size increased, all estimators showed improved performance, with reduced RMSE values and better convergence toward the true value of

ρ_{γ}

. The performance gap between the estimators was more noticeable for small sample sizes (

n = 10

and

n = 50

), but this gap narrowed as the sample sizes grew (

n = 250

and

n = 500

). Notably, the ML-based estimator demonstrated the most rapid convergence, particularly in large sample sizes.

In summary, the GCC-ML estimator provided the most reliable and accurate performance across diverse scenarios, with the choice of the flexibility parameter

γ

influencing the estimators’ behavior. Low values of

γ

favored robustness, particularly in contaminated datasets, while high

γ

values excelled at capturing linear dependencies. Therefore, this study highlights the importance of selecting an appropriate value of

γ

based on the underlying data structure to optimize estimator performance.

The simulation results found the capacity of the GCC to handle non-linear dependencies, high-dimensional data, and contamination, all of which are common challenges in the analysis of gene expression studies. These strengths position the GCC as a valuable tool for constructing and analyzing RNs, crucial for identifying complex interactions and potential biomarkers in biological systems. With these findings, we now transition to the practical application of these methods in constructing RNs.

4. Relevance Networks and Advanced Statistical Applications

This section explores the practical application of the GCC and its adaptations, focusing on the construction of RNs using gene expression data. We detail the data collection process, including sample acquisition, processing, and interpretation, while demonstrating the enhanced performance of the GCC estimators in real biological data. Special attention is given to the role of these estimators in improving the robustness of the analysis, particularly when compared to traditional methods like Pearson and Spearman correlations.

4.1. Data Collection and Relevance Network Methodology

This study utilized gene expression data collected from high-throughput Agilent microarray platforms. Detailed specifications of the platforms can be found on the manufacturer website: http://www.genomics.agilent.com (accessed on 4 November 2024).

The dataset comprises complementary deoxyribonucleic acid (cDNA) samples from biopsies performed on approximately one thousand patients undergoing diagnostic procedures for oncological or precancerous conditions in the esophageal and gastric regions.

The dataset provides an ideal foundation for applying the enhanced GCC methodology validated in our simulation study. We utilized three estimators,

{\hat{ρ}}_{γ}

(GCC-ML),

{\tilde{ρ}}_{γ}

(GCC-U), and

{\bar{ρ}}_{γ}

(adjusted Spearman), all of which showed robust performance under non-linear dependencies, high-dimensional data, and contamination scenarios. This robustness makes them particularly suited for exploring gene interactions.

Leveraging this advanced methodology, we constructed RNs to analyze complex gene interactions in pathological conditions. The RNs generated provided a more comprehensive view of gene associations compared to traditional methods, enabling the identification of intricate interactions that could serve as biomarkers or therapeutic targets in oncology.

These methodologies, supported by extensive simulations, offer a robust framework for analyzing high-dimensional biological datasets and demonstrate their practical utility in the construction and interpretation of RNs.

4.2. Integration of Advanced Statistical Methods in RN Analysis

The GCC estimators were integrated into the RN construction process to enhance the identification and interpretation of gene correlations.

As demonstrated in the simulation study, the GCC provides superior handling of non-linear dependencies and contamination, making it a valuable tool for real-world datasets.

Applying these methods to gene expression data reveals meaningful patterns that traditional measures may overlook, highlighting their importance in biomedical informatics.

In this study, we identified a cohort of 146 individuals for deeper analysis, including samples of normal, inflamed, and metaplastic mucosa. The RNs were constructed using data from 57 normal gastric tissue samples.

Gene expression associations were measured using the Kendall

ρ_{K}

, Pearson

ρ

, and Spearman

ρ_{S}

correlation coefficients, as well as non-linear indices, such as mutual information [35,51,52]. For RN construction, we applied the squared sample Pearson correlation coefficient

r_{P}^{2}

to calculate pairwise gene correlations, forming a fully connected graph. A threshold criterion

r_{P}^{2^{'}}

was used to segment the graph into smaller interconnected subnetworks based on the condition

r_{P}^{2} > r_{P}^{2^{'}}

.

For this study, RNs were constructed using data from 57 normal gastric tissue observations. Pearson, Spearman, and GCC metrics were used for comparative analysis.

The empirical evaluation, supported by the histogram in Figure 1, suggests that a normal distribution fits the gene expression data well.

The histogram was generated through random sampling and includes a kernel density estimate with an overlay of the normal distribution curve, providing a visual comparison between the empirical data and the theoretical model.

In genomic research, Pearson correlation has traditionally been the default method for measuring associations, particularly when the GCC is set to

γ = 1

, aligning with the Pearson linear measure. However, relying solely on Pearson correlation can limit important molecular interactions, particularly those involving non-linear relationships that deviate from the assumptions of a linear model.

To address this limitation, we applied the GCC with varying values of

γ

, expanding the analysis to capture a broader spectrum of associations within the biological system. By adjusting

γ

, we generated networks that capture a broad range of gene interactions, providing a more comprehensive understanding.

The diverse networks offer strong candidates for further biological validation and may uncover deeper insights into the underlying molecular mechanisms. The parameter

γ

was set at multiple levels in constructing the RNs, that is,

γ \in {1, 0.86, 0.71, 0.57, 0.43, 0.29, 0.14, 0},

following guidelines from previous studies [52]. A threshold criterion of

r_{P}^{2^{'}} > 0.5

was used to identify subgraphs, defining the RNs.

Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11 present a detailed illustration of the RNs derived using the GCC for various values of

γ

, as well as networks constructed using the Spearman correlation coefficient. In these figures, green edges represent negative correlations, while red edges represent positive correlations.

As

γ

decreases, the GCC becomes more selective, isolating stronger and more robust correlations, resulting in sparser but potentially more biologically relevant networks. This indicates that the GCC becomes more sensitive to the strongest and most robust correlations, resulting in sparser networks with fewer, but potentially more biologically relevant, connections.

Interestingly, the network structure obtained using the Spearman correlation closely resembles that derived from the GCC at

γ = 0.57

. This resemblance arises because the GCC at

γ = 0.57

captures both linear and monotonic relationships, similar to those measured by the Spearman correlation. These results underscore the flexibility of the GCC in adapting to different types of dependencies present in gene expression data [35].

The impact of varying the parameter

γ

on network topology is illustrated in Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11. In these networks, nodes represent genes, and edges represent high correlations between gene expression levels, with the correlation coefficients indicated along the edges. Blue edges represent correlations that have weakened compared to the preceding value of

γ

, while violet edges indicate correlations that have remained strong or increased. Through the analysis of these RNs, we observe that lower values of

γ

effectively filter out weaker correlations, allowing the GCC to emphasize the strongest and most biologically relevant associations. This allows us to focus on the strongest interactions as

γ

decreases, consistent with the findings from our earlier simulation study, highlighting the practical utility of the GCC in biological applications. The flexibility to adjust

γ

provides a powerful tool for examining data from multiple perspectives, ensuring that important non-linear or complex dependencies are captured.

When the value of

γ

is reduced from 1 to 0.86, some correlations slightly decrease in magnitude, while others increase. For example, the correlation between PRNP and LRP1 changes from

- 0.5905

to

- 0.6012

, reflecting a subtle increase in absolute value. This occurs because higher

γ

values emphasize linear relationships, making non-linear interactions more noticeable as

γ

decreases [52]. As

γ

decreases further from 0.86 to 0.71, approximately 23% of the correlations increase in strength, a reduction compared to the previous step.

For instance, the correlation between KBTBD4 and SMG7 decreases from 0.6861 to 0.6808, highlighting the increasing selectivity of the GCC at low

γ

values, where it focuses on stronger correlations. At

γ = 0.57

, only about 12% of the correlations show an increase in strength, continuing the trend of filtering out weaker associations. This demonstrates how the GCC progressively isolates the most robust interactions, prioritizing biologically relevant connections as

γ

decreases. Notably, as

γ

is reduced from 0.29 to 0.14 and eventually to 0, the number of edges in the networks decreases, reflecting less correlations surpassing the threshold of 0.5. This emphasizes the role of the GCC in isolating the strongest interactions, providing clearer insights into high gene associations.

The network derived using the Spearman correlation coefficient, shown in Figure 11, closely resembles the GCC network at

γ = 0.57

. However, a unique negative correlation exceeding 0.5 between HDDC3 and PRNP is captured by the Spearman coefficient, which is not detected in any GCC configuration. This highlights the Spearman ability to capture monotonic relationships that may not align with linear models, revealing interactions that could be overlooked when using only the GCC. These observations underscore the sensitivity of the GCC to the parameter

γ

, while the comparison with the Spearman correlation demonstrates the importance of using multiple correlation measures to capture a broader spectrum of interactions. This comprehensive approach is essential for fully understanding the complexity of gene expression data and the underlying biological processes. Figure 12 provides a visual guide to the methodology employed in constructing RNs, illustrating the application of the GCC across different thresholds for

γ

. This flowchart clarifies the analytical process from data collection to network construction, and serves as a reference for understanding how different parameter settings affect the results.

In summary, our methodology highlights the importance of adaptive analytical strategies in unraveling complex biological networks. By moving beyond conventional correlation measures, we invite researchers to explore a wider range of interactions within molecular systems. Deriving meaningful insights from quantitative data requires a delicate balance between statistical precision and biological interpretation.

5. Conclusions

Biomedical informatics plays a pivotal role in elucidating molecular interactions, which are essential for advancing medical diagnostics and therapeutic development. One of the ongoing challenges is the analysis of high-dimensional gene expression data, where traditional correlation coefficients, like Pearson and Spearman, often fall short, particularly when addressing non-linear relationships. This challenge emphasizes the need for more robust and flexible analytical tools. In this study, we applied the generalized correlation coefficient, utilizing its flexibility parameter, to analyze gene association networks. The generalized correlation coefficient provides a versatile tool for capturing complex dependencies in molecular biology data, adapting to different correlation structures. To our knowledge, this research represents one of the earliest applications of the generalized correlation coefficient in genomic studies, addressing the shortcomings of conventional correlation methods in dealing with outliers and deviations from normality.

We introduced computational refinements, including robust estimators based on U-statistics and Fisher-consistent estimators, supported by advanced techniques such as the delta method. These improvements enhance both the reliability and the practical utility of the generalized correlation coefficient when analyzing high-dimensional biological data, such as gene expression profiles. However, it is important to acknowledge that the generalized correlation coefficient increased computational demands compared to traditional methods, posing a challenge, particularly for large-scale genomic datasets—a common scenario in modern research. Key findings from our analysis include the following:

The adaptability of the generalized correlation coefficient to various data complexities, demonstrating robustness and sensitivity in gene association network analysis.
The influence of the flexibility parameter on network topology, where low values of this parameter lead to sparser networks, emphasizing the strongest correlations.
The detection of unique interactions using the Spearman correlation, not captured by any configuration of the generalized correlation coefficient, underscoring the importance of applying multiple correlation measures for comprehensive data analysis.

While the focus of this study has been on gene expression data and relevance networks, the flexibility of the generalized correlation coefficient allows it to be applied to other types of biological data that exhibit non-linear dependencies. For instance, protein–protein interaction networks and microbiome data, which often involve complex and high-dimensional relationships, can benefit from the robust correlation measures provided by the generalized correlation coefficient. This extends the method applicability beyond genomics, making it a valuable tool for broader applications in systems biology and molecular interactions.

Despite the advantages in using the generalized correlation coefficient, its computational burden presents a practical challenge, particularly when applied to large datasets. Optimizing the algorithm’s computational efficiency, or integrating it into high-performance computing environments, would facilitate its broader use in genomic research. Additionally, while comprehensive, the dataset employed in this study was based on high-throughput microarray technology, which has inherent limitations, such as background noise and the inability to detect novel transcripts or non-coding ribonucleic acids. These limitations may affect the accuracy of the constructed gene association networks. Moreover, while the generalized correlation coefficient was applied here to gene expression data, extending the usual correlation coefficients, such as those arising in single-cell analysis, represents a promising avenue for future research. This could help in capturing even more complex dependencies in high-dimensional biological data, further broadening the applicability of the generalized correlation coefficient to modern challenges in data analysis.

Our empirical analysis was conducted on a subset of 57 normal gastric tissue observations using the statistical R software—version 4.4.2—[53]. While this subset provided valuable insights, it may not capture the full spectrum of biological variability. Future work should validate the application of the generalized correlation coefficient using larger and more diverse datasets, including different tissue types and pathological conditions, to ensure the generalizability of the findings.

Another aspect for further study concerns the asymmetries identified in genomic data distributions, where quantile regression methods [54,55] could be explored. Researchers might also consider employing other types of asymmetric distributions for the tests under study. Moreover, utilizing machine learning techniques in genomic data analysis is a promising avenue for future research.

In conclusion, this study demonstrated the flexibility and strength of the generalized correlation coefficient for analyzing complex molecular interactions in biomedical informatics. By capturing both linear and non-linear relationships, this coefficient proved to be effective for researchers working with high-dimensional biological data. Our methodology expands the statistical toolkit for genomic and biomedical research, with applications in controlled simulations and real-world datasets. The adaptability of the mentioned coefficient to varying correlation structures and data complexities offers valuable insights into gene expression dynamics and their implications for precision medicine.

Author Contributions

Conceptualization, R.O., C.M.X., G.H.E., P.L.E., C.C. and V.L.; data curation, R.O., C.M.X., G.H.E., P.L.E. and C.C.; formal analysis, R.O., C.M.X., G.H.E., P.L.E., C.C. and V.L.; investigation, R.O., C.M.X., G.H.E., P.L.E., C.C. and V.L.; methodology, R.O., C.M.X., G.H.E., P.L.E., C.C. and V.L.; writing—original draft, R.O., C.M.X., G.H.E. and P.L.E.; writing—review and editing, V.L. and C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico—CNPq—, No. 303192/2022-4, and Fundação de Amparo a Ciência e Tecnologia do Estado da Bahia—FAPESB—, No. APP0021/2023 (R.O.); by the Vice-rectorate for Research, Creation, and Innovation—VINCI—of the Pontificia Universidad Católica de Valparaíso—PUCV—, Chile, under grants VINCI 039.470/2024—regular research—, VINCI 039.493/2024—interdisciplinary associative research—, VINCI 039.309/2024—PUCV centenary—, and FONDECYT 1200525 (V.L.) from the National Agency for Research and Development—ANID—of the Chilean government; and by Portuguese funds through the CMAT—Research Centre of Mathematics of University of Minho, Portugal, within projects UIDB/00013/2020—https://doi.org/10.54499/UIDB/00013/2020, accessed on 4 November 2024—and UIDP/00013/2020—https://doi.org/10.54499/UIDP/00013/2020, accessed on 4 November 2024—(C.C.).

Data Availability Statement

The dataset used for analysis in this study originates from the FAPESP research project No. 06/03227-2, titled “Gene Expression in Stomach and Esophagus Tumors: From Biology to Diagnosis”. The data were obtained through a collaboration between State University of Paraíba and the Sírio-Libanês Hospital in Brazil. The data and codes used in this study are available on GitHub at https://github.com/Raydonal/SCGeneNetworkGCC, (accessed on 4 November 2024). Please contact the authors for any additional information.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their valuable comments and suggestions, which helped us to improve the quality of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cavalcante, T.; Ospina, R.; Leiva, V.; Martin-Barreiro, C.; Cabezas, X. Weibull regression and machine learning survival models: Methodology, comparison, and application to biomedical data related to cardiac surgery. Biology 2023, 12, 442. [Google Scholar] [CrossRef] [PubMed]
Varuzza, L.; Pereira, C.A.D.B. Significance test for comparing digital gene expression profiles: Partial likelihood application. Chil. J. Stat. 2010, 1, 91–102. [Google Scholar]
Ospina, R.; Ferreira, A.G.O.; de Oliveira, H.M.; Leiva, V.; Castro, C. On the use of machine learning techniques and non-invasive indicators for classifying and predicting cardiac disorders. Biomedicines 2023, 11, 2604. [Google Scholar] [CrossRef] [PubMed]
Bielińska-Wąż, D.; Wąż, P.; Błaczkowska, A.; Mandrysz, J.; Lass, A.; Gładysz, P.; Karamon, J. Mathematical modeling in bioinformatics: Application of an alignment-free method combined with principal component analysis. Symmetry 2024, 16, 967. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. A statistical comparison between Matthews correlation coefficient (MCC), prevalence threshold, and Fowlkes–Mallows index. J. Biomed. Informat. 2023, 144, 104426. [Google Scholar] [CrossRef]
Zhou, K.; Zhang, S.; Wang, Y.; Cohen, K.B.; Kim, J.-D.; Luo, Q.; Yao, X.; Zhou, X.; Xia, J. High-quality gene/disease embedding in a multi-relational heterogeneous graph after a joint matrix/tensor decomposition. J. Biomed. Informat. 2022, 126, 103973. [Google Scholar] [CrossRef]
Ortega-Leon, A.; Gucciardi, A.; Segado-Arenas, A.; Benavente-Fernández, I.; Urda, D.; Turias, I.J. Neurodevelopmental impairments prediction in premature infants based on clinical data and machine learning techniques. Stats 2024, 7, 685–696. [Google Scholar] [CrossRef]
Han, H. Bayesian model averaging and regularized regression as methods for data-driven model exploration, with practical considerations. Stats 2024, 7, 732–744. [Google Scholar] [CrossRef]
Leiva, V.; Corzo, J.; Vergara, M.E.; Ospina, R.; Castro, C. A statistical methodology for evaluating asymmetry after normalization with application to genomic data. Stats 2024, 7, 967–983. [Google Scholar] [CrossRef]
Leiva, V.; Sanhueza, A.; Kelmansky, S.; Martinez, E. On the glog-normal distribution and its association with the gene expression problem. Comput. Stat. Data Anal. 2009, 53, 1613–1621. [Google Scholar] [CrossRef]
Vilca, F.; Rodrigues-Motta, M.; Leiva, V. On a variance stabilizing model and its application to genomic data. J. Appl. Stat. 2013, 40, 2354–2371. [Google Scholar] [CrossRef]
Kelmansky, D.; Martinez, E.; Leiva, V. A new variance stabilizing transformation for gene expression data analysis. Stat. Appl. Genet. Mol. Biol. 2013, 12, 653–666. [Google Scholar] [CrossRef] [PubMed]
Wilcox, R. The percentage bend correlation coefficient. Psychometrika 1994, 59, 601–616. [Google Scholar] [CrossRef]
Wilcox, R. Inferences based on a skipped correlation coefficient. J. Appl. Stat. 2004, 31, 131–143. [Google Scholar] [CrossRef]
Reshef, D.N.; Reshef, Y.A.; Finucane, H.K.; Grossman, S.R.; McVean, G.; Turnbaugh, P.J.; Lander, E.S.; Mitzenmacher, M.; Sabeti, P.C. Detecting novel associations in large datasets. Science 2011, 334, 1518–1524. [Google Scholar] [CrossRef]
Ravindran, U.; Gunavathi, C. A survey on gene expression data analysis using deep learning methods for cancer diagnosis. Prog. Biophys. Mol. Biol. 2023, 177, 1–13. [Google Scholar] [CrossRef]
Masoodi, F.; Quasim, M.; Bukhari, S.; Dixit, S.; Alam, S. (Eds.) Applications of Machine Learning and Deep Learning on Biological Data; CRC Press: New York, NY, USA, 2023. [Google Scholar]
Rahnenführer, J.; De Bin, R.; Benner, A.; Ambrogi, F.; Lusa, L.; Boulesteix, A.L.; Migliavacca, E. Statistical analysis of high-dimensional biomedical data: A gentle introduction to analytical goals, common approaches and challenges. BMC Med. 2023, 21, 182. [Google Scholar] [CrossRef]
Li, J.J.; Zhou, H.J.; Bickel, P.J.; Tong, X. Dissecting gene expression heterogeneity: Generalized Pearson correlation squares and the K-lines clustering algorithm. J. Am. Stat. Assoc. 2024, 119, 1–14. [Google Scholar] [CrossRef]
Bai, X.; Wang, S.; Zhang, X.; Wang, H. Molecular-memory-induced counter-intuitive noise attenuator in protein polymerization. Symmetry 2024, 16, 315. [Google Scholar] [CrossRef]
Chinchilli, V.M.; Philips, B.R.; Mauger, D.T.; Szefler, S.J. A general class of correlation coefficients for the 2 × 2 crossover design. Biom. J. 2005, 47, 644–653. [Google Scholar] [CrossRef]
McManus, C. Cerebral polymorphisms for lateralisation: Modelling the genetic and phenotypic architectures of multiple functional modules. Symmetry 2022, 14, 814. [Google Scholar] [CrossRef]
Chen, V.Y.J.; Chinchilli, V.M.; Richards, D.S.P. Robustness and monotonicity properties of generalized correlation coefficients. J. Stat. Plan. Infer. 2011, 141, 924–936. [Google Scholar] [CrossRef]
Sanchez, J.D.; Rêgo, J.C.; Ospina, R.; Leiva, V.; Chesneau, C.; Castro, C. Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes. Biology 2023, 12, 959. [Google Scholar] [CrossRef] [PubMed]
Alkadya, W.; ElBahnasy, K.; Leiva, V.; Gad, W. Classifying COVID-19 based on amino acids encoding with machine learning algorithms. Chemom. Intell. Lab. Syst. 2022, 224, 104535. [Google Scholar] [CrossRef]
Bustos, N.; Tello, M.; Droppelmann, G.; Garcia, N.; Feijoo, F.; Leiva, V. Machine learning techniques as an efficient alternative diagnostic tool for COVID-19 cases. Signa Vitae 2022, 18, 23. [Google Scholar]
García-Sancho, M.; Lowe, J. A History of Genomics Across Species, Communities and Projects; Springer: New York, NY, USA, 2023. [Google Scholar]
Tully, J.; Hill, A.; Ahmed, H.; Whitley, R.; Skjellum, A.; Mukhtar, M. Expression-based network biology identifies immune-related functional modules involved in plant defense. BMC Genom. 2014, 15, 421. [Google Scholar] [CrossRef]
Jaskowiak, P.A.; Campello, R.J.G.B.; Costa, I. Proximity measures for clustering gene expression microarray data: A validation methodology and a comparative analysis. Comput. Biol. Bioinform. IEEE/ACM Trans. 2013, 10, 845–857. [Google Scholar] [CrossRef]
Langfelder, P.; Horvath, S. Fast R functions for robust correlations and hierarchical clustering. J. Stat. Softw. 2012, 46, 1–17. [Google Scholar] [CrossRef]
Jaskowiak, P.; Campello, R.G.B.; Costa, I. Evaluating correlation coefficients for clustering gene expression profiles of cancer. In Advances in Bioinformatics and Computational Biology; de Souto, M., Kann, M., Eds.; Springer: Heidelberg/Berlin, Germany, 2012; Volume 7409, pp. 120–131. [Google Scholar]
Son, Y.S.; Baek, J. A modified correlation coefficient based similarity measure for clustering time-course gene expression data. Pattern Recognit. Lett. 2008, 29, 232–242. [Google Scholar] [CrossRef]
Hardin, J.S.; Mitani, A.; Hicks, L.; VanKoten, B. A robust measure of correlation between two genes on a microarray. BMC Bioinform. 2007, 8, 220. [Google Scholar] [CrossRef]
Ma, S.; Gong, Q.; Bohnert, H.J. An arabidopsis gene network based on the graphical gaussian model. Genome Res. 2007, 17, 1614–1625. [Google Scholar] [CrossRef] [PubMed]
Elo, L.L.; Lahesmaa, R.; Aittokallio, T. Inference of gene coexpression networks by integrative analysis across microarray experiments. J. Integr. Bioinform. 2006, 3, 33. [Google Scholar] [CrossRef]
Voy, B.H.; Scharff, J.A.; Perkins, A.D.; Saxton, A.M.; Borate, B.; Chesler, E.J.; Branstetter, L.K.; Langston, M.A. Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Comput. Biol. 2006, 2, e89. [Google Scholar] [CrossRef] [PubMed]
Zhu, D.; Hero, A.O.; Cheng, H.; Khanna, R.; Swaroop, A. Network constrained clustering for gene microarray data. Bioinformatics 2005, 21, 4014–4020. [Google Scholar] [CrossRef] [PubMed]
Xu, W.; Hou, Y.; Hung, Y.S.; Zou, Y. A comparative analysis of Spearman rho and Kendall tau in normal and contaminated normal models. Signal Process. 2013, 93, 261–276. [Google Scholar] [CrossRef]
Croux, C.; Dehon, C. Influence functions of the spearman and kendall correlation measures. Stat. Methods Appl. 2010, 19, 497–515. [Google Scholar] [CrossRef]
Maronna, R.A.; Martin, D.R.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: New York, NY, USA, 2006. [Google Scholar]
Kendall, M.G. A new measure of rank correlation. Biometrika 1938, 1, 81–93. [Google Scholar] [CrossRef]
Kendall, M.G.; Gibbons, J.D. Rank Correlation Methods. A Charles Griffin Book; E. Arnold: London, UK, 1990. [Google Scholar]
Blomqvist, N. On a measure of dependence between two random variables. Ann. Math. Stat. 1950, 21, 593–600. [Google Scholar] [CrossRef]
Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 1904, 15, 72–101. [Google Scholar] [CrossRef]
Lee, A.J. U-Statistics: Theory and Practice; Routledge: Abingdon, UK, 2019. [Google Scholar]
Andrews, G.E.; Askey, R.; Roy, R. Special Functions. Encyclopedia of Mathematics and its Applications; Cambridge University Press: Cambridge, UK, 1999; Volume 71. [Google Scholar]
Hotelling, H. New light on the correlation coefficient and its transformation. J. Royal Stat. Soc. B 1953, 15, 193–232. [Google Scholar] [CrossRef]
Fisher, R.A. On the probable error of a coefficient of correlation deduced from a small sample. Metron 1921, 1, 3–32. [Google Scholar]
David, F.N.; Mallows, C.L. The variance of Spearman rho in normal samples. Biometrika 1961, 48, 19–28. [Google Scholar] [CrossRef]
Serfling, R.J. Approximation Theorems of Mathematical Statistics; Wiley: Hoboken, NJ, USA, 1981. [Google Scholar]
Butte, A.J.; Kohane, I.S. Mutual information relevance networks: Functional genomic clusteringusing pairwise entropy measurements. Pac. Symp. Biocomput. 2000, 5, 415–426. [Google Scholar]
Butte, A.J.; Kohane, I.S. Unsupervised knowledge discovery in medical databases using relevance networks. In Proceedings of the AMIA Symposium; American Medical Informatics Association: Washington, DC, USA, 1999; pp. 711–715. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023. [Google Scholar]
Sanchez, L.; Leiva, V.; Galea, M.; Saulo, H. Birnbaum-Saunders quantile regression and its diagnostics with application to economic data. Appl. Stoch. Model. Bus. Ind. 2021, 37, 53–73. [Google Scholar] [CrossRef]
Deng, D.; Chowdhury, M.H. Quantile regression approach for analyzing similarity of gene expressions under multiple biological conditions. Stats 2022, 5, 583–605. [Google Scholar] [CrossRef]

Figure 1. Histogram representing the data distribution, with a kernel density estimate and an overlay of the normal distribution curve.

Figure 2. Relevance network constructed using GCC for different values of

γ, ρ

, and

ρ_{S}

, where green edges represent negative correlations, while red edges represent positive correlations.

Figure 2. Relevance network constructed using GCC for different values of

γ, ρ

, and

ρ_{S}

, where green edges represent negative correlations, while red edges represent positive correlations.

Figure 3. Gene interaction network using the GCC with

γ = 1

and

| ρ | > 0.5

, where nodes represent genes and edges represent high correlations between gene expression levels, with correlation coefficients indicated.

Figure 3. Gene interaction network using the GCC with

γ = 1

and

| ρ | > 0.5

, where nodes represent genes and edges represent high correlations between gene expression levels, with correlation coefficients indicated.

Figure 4. Gene interaction network using the GCC with

γ = 0.86

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 1

), while violet edges indicate correlations that have remained strong or increased.

Figure 4. Gene interaction network using the GCC with

γ = 0.86

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 1

), while violet edges indicate correlations that have remained strong or increased.

Figure 5. Gene interaction network using the GCC with

γ = 0.71

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 0.86

), while violet edges indicate correlations that have remained strong or increased.

Figure 5. Gene interaction network using the GCC with

γ = 0.71

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 0.86

), while violet edges indicate correlations that have remained strong or increased.

Figure 6. Gene interaction network using the GCC with

γ = 0.57

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 0.71

), while violet edges indicate correlations that have remained strong or increased.

Figure 6. Gene interaction network using the GCC with

γ = 0.57

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 0.71

), while violet edges indicate correlations that have remained strong or increased.

Figure 7. Gene interaction network using the GCC with

γ = 0.43

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 0.57

), while violet edges indicate correlations that have remained strong or increased.

Figure 7. Gene interaction network using the GCC with

γ = 0.43

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 0.57

), while violet edges indicate correlations that have remained strong or increased.

Figure 8. Gene interaction network using the GCC with

γ = 0.29

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous case (

γ = 0.43

), while violet edges indicate correlations that have remained strong or increased.

Figure 8. Gene interaction network using the GCC with

γ = 0.29

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous case (

γ = 0.43

), while violet edges indicate correlations that have remained strong or increased.

Figure 9. Gene interaction network with correlation coefficients (

γ = 0.14

and

| ρ | > 0.5

), where blue edges represent correlations that have weakened compared to those in the previous case (with

γ = 0.29

), while violet edges indicate correlations that have remained strong or increased.

Figure 9. Gene interaction network with correlation coefficients (

γ = 0.14

and

| ρ | > 0.5

), where blue edges represent correlations that have weakened compared to those in the previous case (with

γ = 0.29

), while violet edges indicate correlations that have remained strong or increased.

Figure 10. Gene interaction network using the GCC with

γ = 0

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 0.14

).

Figure 10. Gene interaction network using the GCC with

γ = 0

and

| ρ | > 0.5

, where blue edges represent correlations that have weakened compared to those with the previous value (

γ = 0.14

).

Figure 11. Gene interaction network using the Spearman correlation coefficient with

| ρ_{S} | > 0.5

.

Figure 11. Gene interaction network using the Spearman correlation coefficient with

| ρ_{S} | > 0.5

.

Figure 12. Flowchart of data analysis process with steps from data collection to network construction.

Table 1. RMSE of the estimators for Case 1, with the indicated values of

ρ

,

γ

, and n.

Table 1. RMSE of the estimators for Case 1, with the indicated values of

ρ

,

γ

, and n.

$ρ$	$γ$	Estimator	$n = 10$	$n = 50$	$n = 100$	$n = 250$	$n = 500$
0	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.1097	0.0892	0.0864	0.0849	0.0840
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.1528	0.0735	0.0649	0.0594	0.0574
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2246	0.0925	0.0642	0.0407	0.0284
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3252	0.1315	0.0905	0.0612	0.0465
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2969	0.1253	0.0869	0.0554	0.0386
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2993	0.1271	0.0884	0.0572	0.0407
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3450	0.1405	0.0967	0.0654	0.0497
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3097	0.1315	0.0912	0.0582	0.0406
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.3171	0.1356	0.0943	0.0611	0.0434
$0.3$	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2361	0.0931	0.0640	0.0432	0.0328
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2404	0.0944	0.0649	0.0411	0.0286
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2187	0.0905	0.0628	0.0406	0.0288
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3252	0.1315	0.0905	0.0612	0.0465
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2969	0.1253	0.0869	0.0554	0.0386
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2993	0.1271	0.0884	0.0572	0.0407
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3450	0.1405	0.0967	0.0654	0.0497
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3097	0.1315	0.0912	0.0582	0.0406
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.3171	0.1356	0.0943	0.0611	0.0434
$0.9$	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.0631	0.0343	0.0281	0.0243	0.0228
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.1380	0.0470	0.0314	0.0196	0.0135
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.1418	0.0542	0.0378	0.0258	0.0199
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3252	0.1315	0.0905	0.0612	0.0465
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2969	0.1253	0.0869	0.0554	0.0386
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2993	0.1271	0.0884	0.0572	0.0407
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3450	0.1405	0.0967	0.0654	0.0497
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3097	0.1315	0.0912	0.0582	0.0406
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.3171	0.1356	0.0943	0.0611	0.0434

Table 2. RMSE of the estimators for Case 2, with the indicated values of

ρ

,

γ

, and n.

Table 2. RMSE of the estimators for Case 2, with the indicated values of

ρ

,

γ

, and n.

$ρ$	$γ$	Estimator	$n = 10$	$n = 50$	$n = 100$	$n = 250$	$n = 500$
0	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2017	0.1618	0.1404	0.1086	0.0955
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2725	0.0774	0.0554	0.0336	0.0241
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2246	0.0925	0.0642	0.0407	0.0284
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2104	0.1702	0.1487	0.1165	0.1023
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2829	0.0853	0.0601	0.0385	0.0289
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2301	0.0975	0.0682	0.0437	0.0302
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2186	0.1785	0.1558	0.1243	0.1088
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2934	0.0912	0.0643	0.0415	0.0321
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2378	0.1028	0.0723	0.0469	0.0326
$0.3$	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2785	0.2371	0.1953	0.1456	0.1264
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3154	0.1352	0.0941	0.0600	0.0418
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2187	0.0905	0.0628	0.0406	0.0288
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2902	0.2501	0.2103	0.1557	0.1352
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3315	0.1439	0.1003	0.0640	0.0446
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2993	0.1271	0.0884	0.0572	0.0407
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3105	0.2802	0.2403	0.1804	0.1607
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3097	0.1315	0.0912	0.0582	0.0406
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.3171	0.1356	0.0943	0.0611	0.0434
$0.9$	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.1891	0.1563	0.1352	0.1014	0.0882
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.1380	0.0470	0.0314	0.0196	0.0135
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.1418	0.0542	0.0378	0.0258	0.0199
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2003	0.1655	0.1428	0.1087	0.0934
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.1487	0.0501	0.0334	0.0212	0.0147
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.1472	0.0585	0.0406	0.0276	0.0206
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2104	0.1745	0.1504	0.1156	0.0998
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.1578	0.0532	0.0356	0.0228	0.0158
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.1539	0.0621	0.0434	0.0298	0.0223

Table 3. RMSE of the estimators for Case 3, with the indicated values of

ρ, γ

, and n.

Table 3. RMSE of the estimators for Case 3, with the indicated values of

ρ, γ

, and n.

$ρ$	$γ$	Estimator	$n = 10$	$n = 50$	$n = 100$	$n = 250$	$n = 500$
0	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3258	0.1807	0.1404	0.1086	0.0955
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2725	0.1457	0.1278	0.1167	0.1126
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2518	0.1434	0.1275	0.1174	0.1137
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.4190	0.2331	0.1774	0.1308	0.1103
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3277	0.1773	0.1520	0.1359	0.1299
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.3235	0.1771	0.1532	0.1375	0.1316
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.4539	0.2535	0.1914	0.1383	0.1142
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3492	0.1880	0.1589	0.1405	0.1337
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.1638	0.1405	0.1420	0.1435	0.1438
$0.3$	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.4648	0.2602	0.1959	0.1405	0.1151
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2309	0.1127	0.0963	0.0836	0.0786
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2415	0.1054	0.0857	0.0690	0.0626
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2810	0.2546	0.2534	0.2521	0.2514
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2309	0.1127	0.0963	0.0836	0.0786
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2415	0.1054	0.0857	0.0690	0.0626
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2810	0.2546	0.2534	0.2521	0.2514
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2309	0.1127	0.0963	0.0836	0.0786
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2415	0.1054	0.0857	0.0690	0.0626
$0.9$	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2902	0.2501	0.2103	0.1557	0.1352
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3315	0.1439	0.1003	0.0640	0.0446
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2993	0.1271	0.0884	0.0572	0.0407
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3105	0.2802	0.2403	0.1804	0.1607
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3097	0.1315	0.0912	0.0582	0.0406
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.3171	0.1356	0.0943	0.0611	0.0434
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3233	0.1393	0.0976	0.0622	0.0437
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3428	0.1493	0.1048	0.0668	0.0470
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.3434	0.1497	0.1050	0.0670	0.0471

Table 4. RMSE of the estimators for Case 4, with the indicated values of

ρ, γ

, and n.

Table 4. RMSE of the estimators for Case 4, with the indicated values of

ρ, γ

, and n.

$ρ$	$γ$	Estimator	$n = 10$	$n = 50$	$n = 100$	$n = 250$	$n = 500$
0	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2489	0.1067	0.0754	0.0523	0.0417
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2480	0.0981	0.0676	0.0428	0.0298
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.2246	0.0925	0.0642	0.0407	0.0284
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3433	0.1537	0.1092	0.0760	0.0606
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3154	0.1352	0.0941	0.0600	0.0418
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3129	0.1336	0.0931	0.0592	0.0414
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3645	0.1652	0.1176	0.0820	0.0653
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3315	0.1439	0.1003	0.0640	0.0446
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3330	0.1437	0.1003	0.0639	0.0446
$0.3$	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2361	0.0931	0.0640	0.0432	0.0328
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2404	0.0944	0.0649	0.0411	0.0286
		${\bar{ρ}}_{γ}$ (Adjusted Spearman)	0.2187	0.0905	0.0628	0.0406	0.0288
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3252	0.1315	0.0905	0.0612	0.0465
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2969	0.1253	0.0869	0.0554	0.0386
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.2993	0.1271	0.0884	0.0572	0.0407
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3450	0.1405	0.0967	0.0654	0.0497
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3097	0.1315	0.0912	0.0582	0.0406
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3171	0.1356	0.0943	0.0611	0.0434
$0.9$	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.0631	0.0343	0.0281	0.0243	0.0228
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.1380	0.0470	0.0314	0.0196	0.0135
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.1418	0.0542	0.0378	0.0258	0.0199
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.0525	0.0280	0.0228	0.0195	0.0181
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.0940	0.0331	0.0225	0.0142	0.0099
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.1357	0.0456	0.0310	0.0208	0.0158
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.0480	0.0254	0.0205	0.0175	0.0163
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.0839	0.0286	0.0193	0.0122	0.0085
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.1301	0.0417	0.0281	0.0187	0.0142

Table 5. RMSE of the estimators for Case 5 with contamination levels of 10%, 30%, and 50%, and

ρ = 0.1

, with the indicated values of

γ

and n.

Table 5. RMSE of the estimators for Case 5 with contamination levels of 10%, 30%, and 50%, and

ρ = 0.1

, with the indicated values of

γ

and n.

Contamination	$γ$	Estimator	$n = 10$	$n = 50$	$n = 100$	$n = 250$	$n = 500$
10%	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2238	0.0991	0.0768	0.0550	0.0456
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2495	0.1115	0.0864	0.0619	0.0514
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3117	0.1427	0.1107	0.0795	0.0660
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2685	0.1090	0.0771	0.0482	0.0341
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3224	0.1417	0.1025	0.0649	0.0468
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3444	0.1694	0.1339	0.0959	0.0762
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3128	0.1328	0.0943	0.0593	0.0421
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3321	0.1424	0.1013	0.0637	0.0453
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3327	0.1428	0.1015	0.0639	0.0454
30%	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2179	0.1085	0.0875	0.0663	0.0566
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2431	0.1220	0.0984	0.0746	0.0638
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3041	0.1559	0.1261	0.0958	0.0819
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2779	0.1146	0.0802	0.0498	0.0360
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3423	0.1594	0.1158	0.0733	0.0537
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3745	0.2098	0.1703	0.1253	0.1051
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3189	0.1371	0.0961	0.0603	0.0437
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3383	0.1471	0.1032	0.0647	0.0469
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3389	0.1474	0.1035	0.0649	0.0471
50%	0	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2188	0.1132	0.0898	0.0697	0.0584
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.2440	0.1272	0.1010	0.0785	0.0658
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3050	0.1626	0.1293	0.1008	0.0844
	$0.5$	${\hat{ρ}}_{γ}$ (GCC-ML)	0.2835	0.1177	0.0823	0.0518	0.0364
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3572	0.1709	0.1227	0.0788	0.0563
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3957	0.2349	0.1849	0.1395	0.1109
	1	${\hat{ρ}}_{γ}$ (GCC-ML)	0.3233	0.1393	0.0976	0.0622	0.0437
		${\tilde{ρ}}_{γ}$ (GCC-U)	0.3428	0.1493	0.1048	0.0668	0.0470
		${\bar{ρ}}_{γ}$ (adjusted Spearman)	0.3434	0.1497	0.1050	0.0670	0.0471

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ospina, R.; Xavier, C.M.; Esteves, G.H.; Espinheira, P.L.; Castro, C.; Leiva, V. Symmetry and Complexity in Gene Association Networks Using the Generalized Correlation Coefficient. Symmetry 2024, 16, 1510. https://doi.org/10.3390/sym16111510

AMA Style

Ospina R, Xavier CM, Esteves GH, Espinheira PL, Castro C, Leiva V. Symmetry and Complexity in Gene Association Networks Using the Generalized Correlation Coefficient. Symmetry. 2024; 16(11):1510. https://doi.org/10.3390/sym16111510

Chicago/Turabian Style

Ospina, Raydonal, Cleber M. Xavier, Gustavo H. Esteves, Patrícia L. Espinheira, Cecilia Castro, and Víctor Leiva. 2024. "Symmetry and Complexity in Gene Association Networks Using the Generalized Correlation Coefficient" Symmetry 16, no. 11: 1510. https://doi.org/10.3390/sym16111510

APA Style

Ospina, R., Xavier, C. M., Esteves, G. H., Espinheira, P. L., Castro, C., & Leiva, V. (2024). Symmetry and Complexity in Gene Association Networks Using the Generalized Correlation Coefficient. Symmetry, 16(11), 1510. https://doi.org/10.3390/sym16111510

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry and Complexity in Gene Association Networks Using the Generalized Correlation Coefficient

Abstract

1. Introduction

2. Advancements and Applications of the Generalized Correlation Coefficient

2.1. Theoretical Foundations and Developments of the Generalized Correlation Coefficient

2.2. Practical Implementations and Computational Refinements of GCC

3. Simulation Study

3.1. Simulation Design

3.2. Simulation Results

4. Relevance Networks and Advanced Statistical Applications

4.1. Data Collection and Relevance Network Methodology

4.2. Integration of Advanced Statistical Methods in RN Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI