1. Introduction
The partial correlation coefficient (Pcor) measures the correlation between two random variables,
X and
Y, after accounting for the effects of controlling variables
Z, denoted by
. The Pcor essentially quantifies the unique relationship between
X and
Y, after removing the correlations between
X and
Z, and between
Y and
Z [
1]. This correlation coefficient provides a more thorough comprehension of the connection between variables, untainted by the influence of confounding factors. Unlike the Pearson correlation coefficient, which only captures the direct correlation between random variables, the Pcor enables the identification of whether correlations stem from intermediary variables. This distinction enhances the precision and validity of statistical analyses.
The Pcor is a fundamental statistical tool for investigating intricate relationships and gaining a more profound comprehension of the underlying mechanisms in a variety of scientific fields, such as psychology, biology, economics, and social sciences. When examining genetic markers and illness outcomes, biologists used the Pcor to identify correlations while accounting for potential confounding factors [
2,
3,
4]. Marrelec et al. utilised the partial correlation matrix to explore large-scale functional brain networks through functional MRI [
5]. In the field of economics, Pcor assists in comprehending complex connections, including the interplay between interest rates and inflation, while considering other variables’ influence [
6]. The financial industry also employs Pcor to interpret connections and relationships between stocks in the financial markets [
7,
8]. For example, Michis proposed a wavelet procedure for estimating Pcor between stock market returns over different time scales and implemented it for portfolio diversification [
9]. Using partial correlations within a complex network framework, Singh et al. examined the degree of globalisation and regionalisation of stock market linkages and how these linkages vary across different economic or market cycles [
10]. Meanwhile, the employment of the Gaussian graphical model (GGM) technique in psychology has recently gained popularity for defining the relationships between observed variables. This technique employs Pcors to represent pairwise interdependencies, controlling the influence of all other variables [
11,
12,
13]. In the field of geography, a correlation analysis based on the Pcor of the fractal dimension of the variations of HZD components is implemented to study the geomagnetic field component variations in Russian [
14].
Several methodologies have been proposed over the years to estimate the Pcor in statistical analyses. For instance, Peng et al. introduced a Pcor estimation technique that relies on the sparsity property of the partial correlation matrix and utilises sparse regression methods [
3]. Khare et al. suggested a high-dimensional graphical model selection approach based on the use of pseudolikelihood [
15]. Kim provided an R package “ppcor” for a fast calculation to semi-Pcor [
16]. Huang et al. introduced the kernel partial correlation coefficient as a measure of the conditional dependence between two random variables in various topological spaces [
17]. Van Aert and Goos focused on calculating the sampling variance of Pcor [
18]. Hu and Qiu proposed a statistical inference procedure for Pcor under the high-dimensional nonparanormal model [
19]. However, these methods mainly centre around determining whether or not the partial correlation coefficient is zero, without adequate regard for the precision of the Pcor calculation and the algorithm’s efficacy. We analysed multiple high-dimensional algorithms and discovered notable Pcor estimation biases, particularly for positive Pcor. Even with larger sample sizes, these biases persisted. Motivated by these findings, our primary goal is to put forward a Pcor estimation algorithm to increase the precision of the Pcor estimation algorithm and diminish the estimation bias for positive Pcor values.
This paper reviews current methods for estimating Pcor in high-dimensional data. We introduce a novel minimum residual sum of squares (MRSS) Pcor estimation method under high-dimensional conditions, aiming to mitigate the estimation bias for positive Pcor. The algorithm’s effectiveness is validated through simulation studies under sparse and non-sparse conditions and real data analysis on stock markets.
The sections are structured as follows:
Section 2 outlines definitions and corresponding formulae for calculating Pcor, and examines common algorithms for estimating Pcor.
Section 3 presents our Minimum Residual Sum of Squares Pcor estimation, designed to mitigate estimation bias for positive Pcor. In
Section 4, we demonstrate the effectiveness of our proposed algorithm through simulation studies on high-dimensional data under both sparse and non-sparse conditions.
Section 5 provides an analysis of real data related to stock markets, while
Section 6 contains the conclusion.
3. Minimum Residual Sum of Squares Pcor Estimation Algorithm
3.1. Motivation
From the comprehensive simulations in this paper, it is evident that the Pcor estimation methods discussed exhibit significant bias. This bias becomes more pronounced as the true Pcor increases, especially when the Pcor is positive. Therefore, further research is necessary to address this estimation bias in positive Pcor scenarios. While each algorithm has its merits, the Reg2 algorithm performs notably well when Pcor is below approximately . In contrast, the Coef and Var algorithm stands out with minimal bias when Pcor exceeds roughly . Our goal is to develop a method that synergises the strengths of both the Reg2 and Var algorithms.
The models introduced in the Reg2 algorithm, (
11) and (
12), can be represented as,
When compared to models (
13) and (
14) from the Coef and Var algorithm, it is evident that the residuals
and
share commonalities. Both provide insights into the information in
X after the exclusion of
Y and
Z effects in some sense. Similarly,
and
capture the essence of
Y after removing for
X and
Z influences. If we choose a
and
with a smaller residual sum of squares, then this will lead to a better estimation for the corresponding regression models. A reduced residual sum of squares in the corresponding regression models signifies enhanced precision in eliminating controlling variables effects, leading to a more accurate Pcor estimator. Guided by the objective of minimising the residual sum of squares, we introduce a novel algorithm for high-dimensional Pcor estimation in the subsequent subsection.
3.2. MRSS Algorithm and Its Implementation
We propose a novel Minimum Residual Sum of Squares partial correlation coefficient estimation algorithm, denoted by MRSS. This algorithm aims to diminish the estimation bias for positive Pcor values under high-dimensional situations. Our MRSS algorithm amalgamates the strengths of the Reg2, Coef, and Var algorithms, effectively curtailing bias in Pcor estimation.
Define
and
as the residual sum of squares of
X after removing the effects of
X and
Z, and the residual sum of squares of
Y after removing the effects of
X and
Z, respectively. The tuning parameter
k is chosen by minimising the sum of squares of the residuals, so as to remove more associated effects and ensure a more efficient Pcor estimator. For
, the pair
represents the residuals from the Reg2 algorithm’ models (
11) and (
12). For
,
corresponds to the residuals from the Coef and Var algorithms’ models (
13) and (
14). Then, the residuals estimated by the MRSS algorithm satisfy the minimum residual sum of squares of both
X and
Y for a more efficient Pcor estimator as follows
The Pcor estimated by MRSS is then given by
where
I is the indicator function and
is the primary regression coefficient in model (
13). If
, then
is estimated following the idea of Reg2 algorithm; if
, then
is estimated following the idea of the Coef and Var algorithm. If the two
k estimates in (
17) differ, the more stable Reg2 algorithm is preferred, setting
in (
18). Given that MRSS integrates two existing algorithms, its convergence should align with their rates.
During the implementation of the MRSS algorithm (Algorithm 1), the Coef and Var algorithm often misestimates Pcor as 0 or
when the true Pcor is close to 0 or
, affecting the algorithms’ precision. To address this, we incorporate a discriminative condition in the MRSS pseudo-code. If the estimated Pcor
or
is zero or
, the Coef and Var algorithm is deemed unreliable, and the Reg2 algorithm’s estimate is adopted.
The proposed MRSS algorithm selects the most suitable residuals by minimising RSS and removing the impact of control variables to optimise the estimation of residuals in the regression model. As such, the estimated Pcor generated by the MRSS algorithm combines the advantages of both algorithms, resulting in a more accurate estimate. Notably, our MRSS algorithm effectively addresses the Pcor estimation bias in cases where . For instance, when the Coef and Var algorithms estimate Pcor as 0 for true Pcor near 0, the MRSS algorithm utilises the minimum RSS principle to select the Reg2 algorithm, which performs better in the vicinity of , and thereby efficiently avoids such misestimations. Around Pcor , the MRSS algorithm employs the minimum RSS principle to determine the more accurate method between Reg2 and Var for exact selection. This selection conforms to the minimum RSS principle, where the regression model and accompanying residuals are selected to provide optimal estimation accuracy, leading to a more precise Pcor estimate. When Pcor lies close to 1, the Reg2 algorithm’s estimates are typically lower with a high RSSs. Thereafter, the MRSS method selects the Var algorithm with small RSSs, which performs better based on the minimum RSS principle. In essence, the MRSS method amalgamates the merits of the Reg2 and Var algorithms. By reducing the sum of squares of the residuals, MRSS can choose the algorithm with a smaller estimation error for , which allows for the proficient regulation of the estimation bias of Pcor.
4. Simulation
4.1. Data Generation
To study the estimation efficiency of Pcor estimation algorithms under high-dimensional conditions, we generate
n centralised samples
i.i.d from
with
. Let
,
and
. Initially, we produce n controlling samples
independently and identically by
where
and
with
and
generated independently from the normal distribution
with variance
for
. The samples
and
are then generated by
where
and
with
and
,
drawn i.i.d. from
. The Pearson correlation of
and
gives the partial correlation coefficient Pcor
. Notably, there is a one-to-one mapping between the true Pcor and the
parameter.
Since our MRSS algorithm and the Reg2 algorithm perform essentially the same for , our simulation focuses on real Pcor values in the range , an interval prone to significant biases with existing methods. Let the true partial correlation coefficient vary as with the sample size , the controlling variable size , and the normal distribution variance . For each combination, we estimate the partial correlation coefficient for 200 replications using the aforementioned estimation algorithms. We use the software R (4.3.1) for our simulation.
Recognising that both sparse and non-sparse conditions are prevalent in real-world applications [
3,
28], we present examples under both conditions. To ensure comparability between the examples, the initial
l coefficients of
and
are fixed under both conditions, where we select the high-correlated numbers of controlling variables as
. For non-sparse examples, the coefficients of
and
asymptotically converge to 0 at varying rates, with coefficients beyond the
-th starting at
, which is significantly smaller than the initial
l coefficients.
Example 1: under sparse conditions
Let the coefficients and be non-zero for the initial l elements and zero for the rest as follows
Example 2: under non-sparse conditions
Let the coefficients and be the same as Example 1 for the initial l elements with a convergence rate of for the remaining elements as follows
where
r is a tuning parameter to make the
-th element close to
.
Example 3: under non-sparse conditions
Let the coefficients and be the same as Example 1 for the initial l elements with a convergence rate of for the remaining elements as follows,
where
r is a tuning parameter to make the
-th element close to
.
Example 4: under non-sparse conditions
Let the coefficients and be the same as Example 1 for the initial l elements with a convergence rate of for the remaining elements as follows,
where
r is a tuning parameter to make the
-th element close to
.
4.2. Simulation Results
4.2.1. By MSE and RMSE
We will assess the efficacy of the Pcor estimation algorithms using the mean square error (MSE) and root mean square error (RMSE) indices as follows. These evaluation indicators may indicate the performance of Pcor estimation algorithms from various perspectives.
where
is the true Pcor, and
is the estimated Pcor in the
-th replication of
replications.
Table 1 displays the mean of MSE and RMSE (
) for the estimated Pcors of the true
,
with
,
,
and
across Examples 1–4 using various methods.
Table A1 and
Table A2, which consider the means of MSE and RMSE (
) for the estimated Pcors for high correlation controlling variables number
, can be found in the Appendix.
For small sample sizes (), all algorithms tend to underperform due to the limited data information, with the mean MSE and RMSE being approximately ten times higher than that of large sample size . And, our MRSS algorithm remains competitive, with both MSE and RMSE in the same order of magnitude as the best performance Lasso.Reg2. However, for large sample size (), the MRSS algorithm’s performance becomes notably superior. Specifically, the MRSS reduces the MSE by around compared to the suboptimal MCP.Reg2, and this percentage grows with increasing n. The MRSS represents a significant improvement in algorithmic performance. Additionally, the MSE of the MRSS algorithm exhibits a slower increase with increasing controlling size p, implying improved stability to some extent.
To compare the performance of different algorithms more intuitively, we calculated the percentage difference of MSE by
with
be algorithms listed above. Similarly, the percentage difference of RMSE can be calculated. And,
Table 2 shows the average percentage difference of MSE and RMSE compared to the MRSS algorithm for a small sample size (
) and large sample size(
) with the same settings in
Table 1. For a small sample size (
), we observe a 10–20% decrease in MSE and RMSE for an MRSS algorithm relative to the Res algorithm, a 10–20% increase relative to Lasso.Reg2, and a slight change relative to other algorithms. For large sample size (
), the MRSS algorithm reduces MSE by about 30–70% and RMSE by 20–60% relative to other algorithms, achieving effective control of the Pcor estimation error. These results further illustrate the superiority of the MRSS algorithm. For optimal Pcor estimation performance, we suggest using the MRSS algorithm with a minimum sample size of
.
For Examples 1–4, shifting from sparse to non-sparse conditions with increasing non-sparsity, we observe that all algorithms exhibit a higher MSE and RMSE under non-sparse conditions compared to sparse conditions, and the MSE and RMSE increase with increasing non-sparsity. This could be attributed to the greater impact and more complicated correlations of the controlling variables, resulting in a less accurate estimate of the partial correlation. However, even in Example 4 with the strongest non-sparsity, the MRSS algorithm still performs well, possessing the smallest MSE and RMSE and outperforming conventional algorithms. Especially under non-sparse conditions, the MRSS algorithm provides a dependable and accurate estimation of Pcor despite the influence of complex controlling variables.
4.2.2. For Pcor Values on
To investigate the effectiveness of Pcor estimation algorithms for various Pcor values, we set a constant ratio of the dimension of controlling variables to the sample size (i.e., a fixed
).
Figure 1 displays the average estimated Pcor of 200 repetitions compared to the true Pcor for
and
in Example 1. The MRSS, MCP.Reg2, and MCP.Var are denoted in red, green and blue, respectively. When Pcor is small around Pcor < 0.5, the MRSS accurately simulates the true Pcor, performing similarly to the MCP.Reg2. When Pcor is large, like about Pcor > 0.5, the MRSS performs sub-optimally and comparable to the MCP.Var, falling slightly behind the RSS2. Essentially, the MRSS effectively amalgamates the strengths of both MCP.Reg2 and MCP.Var algorithms, reducing potential weaknesses for Pcor estimation. For a small sample size
, the MRSS leads to a significant improvement in the estimation for a large Pcor in
, but still a considerable estimation bias for small Pcor in
owing to the limited sample size and information. For a large sample size
, the MRSS effectively reduces the Pcor estimation bias for Pcor
. Consequently, greatly enhancing the sample size substantially boosts the MRSS estimation accuracy, even if the ratio of the controlling variables dimension to the sample size
increases from 2 to 10.
4.3. Parameter Sensitivity
We investigate the sensitivity of the performance of the MRSS algorithm to different parameter settings, such as variance and sparsity. This allows us to explore the robustness of algorithms under different parameter configurations.
4.3.1. For Variance
We set a variance parameter
in data generation to test the stability of our algorithm under varying variance.
Table 3 shows the mean of MSE (
) and RMSE (
) for the estimated Pcors of real
with different variances
and
for a large sample size (
) and small sample size (
) in Examples 1–4. We discover that, as the variance increases
from 1 to 40, the MSE and RMSE remain consistent for various examples and sample sizes. This indicates that our MRSS algorithm is highly robust to variance and retains good stability.
4.3.2. For Sparsity
To evaluate the effectiveness of algorithms under different sparsity conditions, we set the data generation conditions to develop from sparse to non-sparse, with an increasingly non-sparse convergence rate from Example 1 to Example 4. This suggests a greater inclusion of controlling variables as we progress through the examples. From the above
Table 1,
Table 2 and
Table 3, our observations show that the MRSS algorithm performs well for all examples. For moderate non-sparse convergence rates, as witnessed in Examples 2–3, MRSS demonstrates both low MSE and RMSE, comparable to the sparse conditions of Example 1. As the rate of non-sparsity convergence and the impact of controlling variables increase in Example 4, the best-performing MRSS also encounters difficulties in reducing the estimation bias. Therefore, the best-performing MRSS algorithm remains the most favoured choice for estimating Pcor under both sparse and non-sparse conditions. If it is possible to analyse the degree of non-sparsity the initial data, then we can obtain a better understanding of the algorithm’s error margin.
Another indication of the sparsity strength is the number of high correlation controlling variables
l.
Figure 2 illustrates the performance of the featured algorithms for varying numbers
. The figure contrasts the average Pcor with the true Pcor for
in Example 2 with the first row
and the second
. As
l increases, the interference from controlling variables in the estimation process becomes more pronounced, leading to a heightened estimation bias. However, the MRSS algorithm consistently showcases an optimal performance throughout the entire
interval. Remarkably, despite encountering a high interference level at
, MRSS keeps the bias in close alignment with the diagonal, in contrast to its counterparts.
Table 4 shows the mean of the MSE and RMSE for
. As
l increases, both the MSE and RMSE of the MRSS algorithm increase, but always remain slightly weaker than optimal in small samples and significantly more optimal than the other algorithms in large samples. These results demonstrate the robustness, stability, and precision advantages of the MRSS algorithm.
4.4. Summaries
Based on numerous simulations, our study examines the practicality and effectiveness of the MRSS algorithm in a variety of scenarios. Through extensive simulations, we provide valuable insights into the accuracy and effectiveness of the MRSS algorithm. We provide empirical evidence that MRSS effectively incorporates the strengths of the MCP.Reg2 and MCP.Var algorithms and reduces the potential weaknesses of Pcor estimation, especially in challenging environments with high-dimensional sparse and non-sparse conditions. For larger sample sizes (), the MRSS algorithm reduces the MSE and RMSE by approximately 30–70% compared to other algorithms and effectively controls Pcor estimation errors. For small sample sizes (), a reduction of 10–20% is observed in MSE and RMSE for the MRSS algorithm compared to the Res algorithm, an increase of 10–20% compared to Lasso.Reg2, and a slight change compared to other algorithms.
Conducting a sensitivity analysis with various variance and sparsity parameters, the outcomes demonstrate the benefits of the MRSS algorithm in terms of robustness, stability, and accuracy. As the variance increases from 1 to 40, the MSE and RMSE remain consistent for distinct examples and sample sizes. This demonstrates that our MRSS algorithm is remarkably resilient to variability and maintains excellent stability. As the level of sparsity decreases (from Examples 1–4, or from to 14), it is noticeable that the MSE and RMSE of the MRSS algorithm increase, but remain within the same order of magnitude. Even the optimal MRSS algorithms undergo a significant rise in MSE and RMSE for Example 4 and , as an escalation of non-sparse and intricate controlling variables brings forth certain systematic errors.
5. Real Data Analysis
A distinguishing feature of financial markets is the observed correlation among the price movements of various financial assets. A prevalent feature entails the existence of a substantial cross-correlation between stock returns’ simultaneous time evolution [
29]. In numerous instances, a strong correlation does not necessarily imply a significant direct relationship. For instance, two stocks in the same market may be subject to shared macroeconomic or investor psychology influences. Therefore, to examine the direct correlation between these stocks, it is necessary to eliminate the common drivers represented by the market index. The Pcor meets this requirement by assessing the direct relationship between the two stocks after removing the market impacts of controlling variables. When accurately estimating the Pcor, it is possible to evaluate the impact of diverse factors (e.g., economic sectors, other markets, or macroeconomic factors) on a specific stock. The resulting partial correlation data may be utilised in fields, such as stock market risk management, stock portfolio optimisation, and financial control [
7,
8]. Moreover, the Pcor can also indicate the interdependence and influence of industries in the context of global integration. These techniques for analysing Pcor can provide valuable information on the correlations between different assets and different sectors of the economy, as they are generalisable and can be applied to other asset types and cross-asset relationships in financial markets. This information is beneficial for practitioners and policymakers.
We chose 100 stocks with substantial market capitalisation and robust liquidity from the Shanghai Stock Exchange (SSE) market. These stocks can comprehensively represent the overall performance of listed stock prices in China’s A-share market. We then downloaded their daily adjusted closing prices from Yahoo Finance from January 2018 to August 2023 and removed the missing data. Here, a sufficient sample size of was chosen to ensure the effectiveness of algorithms and limit the bias in Pcor estimation. For each pair of the 100 stocks, we estimate their Pcor by setting the remaining stocks as the corresponding controlling variables and construct the estimated Pcor matrix. The Pcor matrix shows the better internal correlation between two stocks after removing the influence of the stock market.
Figure 3 presents the estimated Pcor matrices for 100 stocks from SSE markets using MCP.Reg2, MCP.Var and MRSS algorithms. Blue signifies
, while red represents
. Whilst the MCP.Coef, MCP.Var, and RSS2 algorithms all estimate Pcor as 0 when true Pcor approaches 0, our proposed MRSS algorithm resembles the MCP.Reg2, which estimates an accurate Pcor for weak partial correlation. Thus, the MRSS is capable of effectively estimating weak partial correlations. When dealing with high Pcor values and strong partial correlation, we find that the MCP.Var algorithm overestimates Pcor as a result of the divergence in stock prices. For two stocks with a higher stock price, the Pcor estimated by the Var algorithm to be overestimated or even most at 1. MRSS effectively solves this problem. Notably, as a result of incorporating the MCP.Var algorithm, the MRSS algorithm amplifies certain partial correlations that are not significant by MCP.Reg2. These results can also be seen in
Table 5. The MRSS estimates these correlations to be stronger partial correlations resulting in improved clarity in the partial correlations.
Figure 4 shows the stocks’ Pcor network for the top-100 and top-50 pairs of Pcor estimates by the MRSS algorithm from 100 SSE stocks. The node represents the stock, coloured with its sector. The edge thickness represents the Pcor estimate between two nodes, with the thick-edge Pcor
and the thin-edge Pcor
.
Table 5 shows the stock pairs with their sector and Pcor estimates for all the MRSS estimated Pcor
from 100 SSE stocks, and
Table 6 shows the corresponding stock pairs with their company name, business, and sector. Here, we use industry classifications from the Global Industry Classification Standard (GICS) with Communication Services, Consumer Discretionary (C.D.), Consumer Staples, Energy, Financials, Health Care, Industrials, Information Technology (I.T.), Materials, Real Estate and Utilities. We find that two stocks connected in the partial correlation network with a high Pcor are almost in the same sector and operate in the same business. In addition, high Pcor values may indicate shareholding relationships between companies. For instance, the highly correlated 601398–601939–601288–601988–601328 (financials) are all state-controlled banks that do not have a direct high Pcor link with the city banks 601009–601166 (financials). And, stocks that do not belong to the same industry under a high Pcor may have certain other links behind them, such as 601519 (I.T.)–601700 (industrials) having a common major shareholder. After stripping out the other factors influencing the market, Pcor represents the inherent and intrinsic correlation between two stocks because they are in the same sector.
As societies become increasingly integrated, the productive activities of different industries become interdependent and interact with each other. Categorising a company into only one industry does not reflect its overall performance and associated risks. Many listed companies in the stock market belong to conglomerates and operate in different industry sectors, so it is natural for the performance of these companies to be affected by multiple industries. Therefore, we will also find that Pcor, apart from showing the correlation between industries, will also reveal the correlation between two industries that are linked together by two stocks in different industries. For example, the partial correlation between the Bank of Communications (601328) and PetroChina (601857) with Pcor links the Energy (600028–601857 in orange) and Financial (601398–601939–601288–601988–601328 in dark blue) sectors of state-owned assets.
Overall, the MRSS algorithm amalgamates the characteristics of MCP.Reg2 and MCP.Var, enhancing the estimation of strong partial correlations, while effectively estimating those weak partial correlations, ultimately revealing the stock correlations.