1. Introduction
Canonical correlation analysis is a multivariate statistical method purported to analyze the correlation structure between two random vectors
It obtains the linear transformations
where the only nonnull correlations are those between components of
and
with the same indices, that is
The random vector is the first canonical pair and the correlation between its components, that is, the first canonical correlation is the highest among all correlations between a projection of and a projection of . Similarly, the random vector is the i-th canonical pair and the correlation between its components, that is, the i-th canonical correlation is the highest among all correlations between a projection of and a projection of , which are orthogonal to the previous canonical pairs, for .
Canonical correlation analysis is particularly appropriate when the joint distribution of the vectors
and
is multivariate normal but it often performs poorly when the data are nonnormal [
1]. The problem has been addressed nonparametrically [
2], semiparametrically [
1] and parametrically [
3]. In this paper we introduce a semiparametric model to investigate the nonlinear dependence structure by means of canonical correlations. Kernel canonical correlation analysis (KCCA) and distance canonical correlation analysis (DCCA) play a prominent role among nonparametric generalizations of CCA aimed at addressing nonlinear dependencies (see, e.g., [
4,
5]).
The main contributions of the paper are as follows. Firstly, it defines the perturbed independence distribution as a statistical model for the joint distribution of two random vectors. The proposed model is somewhat reminiscent of copula models, in that the parameters addressing the dependence structure between two random vectors do not appear in the marginal distributions of the vectors themselves; however, the generating mechanism of perturbed independence distributions is very different from those of ordinary copulas.
Secondly, the perturbed independence model allows for flexible and tractable modeling of the nonlinear dependence structure between two random vectors, since the conditional distribution of a random vector with respect to the other is skew-symmetric. The proposed model provides a parametric interpretation of KCCA and DCCA, which are commonly regarded as nonparametric multivariate methods.
Thirdly, some appealing properties of canonical correlation analysis that hold true in the normal case still hold true in the perturbed independence case. For example, the first (second) component of a canonical pair is independent from the second (first) component of any other canonical pair. Further, if the marginal distributions of the two given vectors are normal, any canonical pair is independent of any other canonical pair.
Fourthly, the paper investigates the bivariate perturbed independence models within the framework of positive and negative association. In particular, it shows that the canonical pairs obtained from a perturbed independence distribution have the desirable properties of being positive quadrant dependent, under mild assumptions on the perturbing function.
The rest of the paper is structured as follows.
Section 2 defines perturbed independence distributions and states some of their probabilistic and inferential properties.
Section 3 connects perturbed independence distributions, canonical correlation analysis, positive dependence orderings and ordinal measures of association.
Section 4 uses both theoretical and empirical results to find nonlinear transformations that increase correlations.
Appendix A contains all proofs.
2. Model
This section defines the perturbed independence model, states its invariance properties and the independence properties of its canonical pairs. The theoretical results are illustrated with the bivariate distribution
introduced by [
6,
7], where
and
denote the probability and the cumulative density functions of a standard normal distribution, while
is a real value. Ref. [
8] thoroughly investigated its properties and proposed some generalizations.
A
p-dimensional random vector
is centrally symmetric (simply symmetric, henceforth) if there is a
p-dimensional real vector
such that
and
are identically distributed ([
9]). A real-valued function
is a skewing function (also known as perturbing function) if it satisfies the equality
and the inequalities
for any real vector
[
10]. The probability density function of a perturbed independence model is twice the product of two symmetric probability density functions and a skewing function evaluated at a bilinear function of the outcomes. A more formal definition follows.
Definition 1. Let the joint distribution of the random vectors and bewhere is the pdf of a p-dimensional, centrally symmetric distribution, is the pdf of a q-dimensional, centrally symmetric distribution, Ψ is a matrix and is a function satisfying for any real value a. We refer to this distribution as to a perturbed independence model, with components and , location vectors and , perturbing function and association matrix Ψ. In the bivariate distribution , both components coincide with the normal pdf, both location vectors coincide with the origin, the perturbing function is the standard normal cdf and the association matrix is the scalar parameter .
Random numbers having a perturbed independence distribution can be generated in a very simple way. For the sake of simplicity, we illustrate it in the simplified case where and are null vectors and is a cumulative distribution function of a distribution symmetric at the origin. First, generate the vectors and from the densities and . Second, generate the scalar r from the distribution whose cumulative density function is . Third, let the vector be if the bilinear form is greater than r and either or in the opposite case. Then, the distribution of is perturbed independence with components and , null location vectors, perturbing function and association matrix .
The bivariate distribution might be generated as follows. First, generate three mutually independent, standard normal random numbers U, W and Z. Second, set X equal to U and Y equal to W if the product is greater than Z. Otherwise, set X equal to and Y equal to W. Then the joint distribution of X and Y is .
A
p-dimensional probability density function
is skew-symmetric with kernel
(i.e., a probability density function symmetric at the origin), location vector
and skewing function function
. The function
would be more precisely denoted by
, since it depends on the dimension of the corresponding random vector. However, we use
instead of
to relieve the notational burden. Ref. [
11] discuss hypothesis testing on
for any choice of function
. The most widely studied skew-symmetric distributions are the linearly skewed distributions, where the skewing function depends on
only through its linear function
, as it happens for the multivariate skew-normal case. [
12], as well as [
13], investigated their inferential properties. Ref. [
14] used them to motivate kurtosis-based projection pursuit.
In the notation of the above definition, the first part of the following theorem states that the marginal distributions of and are and . Thus, perturbed independent distributions separately model the marginal distributions and the association between two random vectors, and constitute an alternative to copulas. The second part of the following theorem states that the conditional distribution of a component with respect to the other is linearly skewed. Hence, the association between the two components has an analytical form, which has been thoroughly investigated.
Theorem 1.
Let the random vectors and have a perturbed independence distribution with components , and location vectors , . Then the following statements hold true.
The marginal probability density functions of and are and .
The conditional probability density functions of given and given are skew-symmetric with kernels and , while the associated location vectors are and .
The marginal distributions of
are standard normal:
and
. The conditional distributions are skew-normal: the probability density functions of
and of
are
and
. The sign of the correlation between
X and
Y is the same as the sign of
but the two random variables are nonlinearly dependent [
7]:
There is a close connection between order statistics and either skew-normal distributions or their generalizations. For example, any linear combination of the minimum and the maximum of a bivariate, exchangeable and elliptical random vector is skew-elliptical [
15]. In particular, any skew-normal distribution might be represented as the maximum or the minimum of a bivariate, normal and exchangeable random vector. At present, it is not clear whether there exists a meaningful connection between order statistics and perturbed independence distributions, which would ease both the interpretation and the application of these distributions.
The mean vector and the covariance matrix of the data matrix are statistically independent, if the rows of are a random sample from a multivariate normal distribution. As a direct consequence, the components of the pairs and are statistically independent, too, where and ( and ) are the mean vector and the covariance matrix of (), that is the data matrix whose columns coincide with the first (the last ) columns of . The same property holds true for perturbed independence models, as a corollary of the following theorem.
Theorem 2. Let the random vectors and have the perturbed independence distribution with location vectors and . Then any even function of is independent of . Similarly, any even function of is independent of .
Let the joint distribution of the random variables X and Y be . Then Y and are mutually independent. Similarly, X and are mutually independent.
The components of the canonical covariates
and
are uncorrelated when their indices differ:
A
p-dimensional random vector
is said to be sign-symmetric if there is a
p-dimensional real vector
such that
and
are identically distributed, where
is any
diagonal matrix whose diagonal elements are either 1 or
[
9]. For example, spherical random vectors are sign-symmetric. The following theorem shows that the canonical covariates belonging to different canonical vectors and with different indices are independent, if the joint distribution of the original variables is perturbed independence with sign-symmetric components.
Theorem 3. Let the random vectors and have a perturbed independence distribution with sign-symmetric components. Further, let and be the canonical covariates of and . Then and are independent when .
Under normal sampling, the components of different canonical pairs are statistically independent. The following corollary of the above theorem shows that the same property still holds true when the original variables have a perturbed independence distribution with normal components.
Corollary 1. Let the random vectors and have a perturbed independence distribution with normal components. Further, let and the canonical covariates of and . Then the variables , , and are pairwise independent when .
As remarked by [
16], the default measures of multivariate skewness and kurtosis are those introduced by [
17]. Mardia’s skewness is the sum of all squared, third-order, standardized moments, while Mardia’s kurtosis is the fourth moment of the Mahalanobis distance of the random vector from its mean.
Mardia’s kurtosis of
is
so that it increases with the squared correlation between
X and
Y ([
18]).
It is tempting to generalize
by letting
as performed in [
6,
7]. Unfortunately, this model does not preserve the nonlinear associations between pairs of its components. For example, the joint bivariate marginals of the trivariate distribution
are bivariate, standard normal random vectors [
19]. Other generalizations of
have been proposed by [
8].
Let
be the joint probability density function of the
p-dimensional random vector
and of the
q-dimensional random vector
. Further, let
and
be the marginal probability density functions of
and
. The distance covariance between
and
with respect to the weight function
w is
where
,
and
[
20]. If the joint distribution of
and
is a perturbed independence model with components
and
, location vectors
and
, perturbing function
and association matrix
we have
A little algebra leads to the identities
Hence, for perturbed independence models, the distance covariance is just half the difference between
and
, which is the probability density functions of
In particular, if the joint distribution of the random variables
X and
Y is
we have
3. Concordance
This section investigates the bivariate perturbed independence models within the framework of positive and negative association. In particular, it shows that the canonical pairs obtained from a perturbed independence distribution have the desirable properties of being positive quadrant dependent, under mild assumptions on the perturbing function. The seminal paper by [
21] started a vast literature on dependence orderings and their connections with ordinal measures of association. For the sake of brevity, here we mention only some thorough reviews of the concepts in this section: [
22,
23,
24,
25,
26,
27].
Two random variables are said to be either concordant, positively associated or positively dependent if larger (smaller) outcomes of one of them often occur together with larger (smaller) outcomes of the other random variable. Conversely, two random variables are said to be either discordant, negatively associated or negatively dependent if larger (smaller) outcomes of one of them often occur together with smaller (larger) outcomes of the other random variable. For example, financial returns from different markets are known to be positively dependent (see, e.g., [
28,
29,
30]). The degree of concordance or discordance is assessed with ordinal measures of association, of which the most commonly used are Pearson’s correlation (simply correlation, for short), Spearman’s rho and Kendall’s tau.
The correlation is the best known measure of ordinal association. The correlation between two random variables
X and
Y is
where
and
are the expectations of
X and
Y. The ordinal association between two random variables might be decomposed into a linear component and a nonlinear component. The liner component refers to the tendency of the random variables to deviate from their means in a proportional way. The correlation only detects and measures the linear component of the ordinal association. When the nonlinear component is not negligible, the information conveyed by the correlation needs to be integrated with information from other measures of ordinal association.
Spearman’s rho, also known as Spearman’s correlation, between the random variables
X and
Y is the correlation between the two variables after being transformed according to their marginal cumulative distribution functions:
where
and
are the marginal cumulative distribution functions of
X and
Y. Its sample counterpart is the correlation between the observed ranks. Spearman’s rho is a measure of ordinal association detecting both linear and nonlinear dependence. It is also more robust to ouliers than the Pearson’s correlation.
Kendall’s tau, also known as Kendall’s correlation, between two random variables is the difference between their probability of concordance and their probability of discordance. The former (latter) is the probability that the difference between the first components of two independent outcomes from a bivariate distribution have the same sign of (a different sign than) the difference between the second components of the same pairs. More formally, Kendall’s tau between the random variables
X and
Y is
where
and
are two independent outcomes from the bivariate random vector
. Just like Spearman’s rho, Kendall’s tau is an ordinal measure of association detecting linear as well as nonlinear dependence and is more robust to outliers than Pearson’s correlation.
Unfortunately, Pearson’s correlation, Spearman’s rho and Kendall’s tau might take different signs, thus making it difficult to measure ordinal association. In order to prevent this from happening, it is convenient to impose some constraints on the bivariate distribution. The distribution of a bivariate random vector
is said to be positively quadrant dependent (PQD) if its joint cdf is greater or equal than the product of the marginal cdf:
for any two real values
x and
y. Similarly, the distribution of a bivariate random vector
is said to be negatively quadrant dependent (PQD) if its joint cdf is either smaller or equal than the product of the marginal cdf:
for any two real values
x and
y. Pearson’s correlation, Spearman’s rho and Kendall’s tau of PQD (NQD) distributions are either null or have positive (negative) signs.
Independent random variables are special cases of PQD and NQD random variables. In order to rule this case out, the PQD and NQD condition can be made more restrictive that the above inequalities needs to be strict for measurable sets of
x and
y values. For example, a strictly positive quadrant dependent pair of random variables satisfies the inequality
for any two real values
x and
y belonging to given interval of positive length. Pearson’s correlation, Spearman’s rho and Kendall’s tau of strictly positive (negative) quadrant dependent distributions have positive (negative) signs. As shown in the following theorem, a bivariate perturbed independence model is strictly positive (negative) quadrant dependent if the perturbing function is a cumulative density function and the association parameter is a positive (negative) scalar.
Theorem 4. Let the joint distribution of the random variables X and Y be perturbed independent with components and , perturbing function and association parameter λ: . Further, let be the cumulative density function of a symmetric distribution. Then the random variables X and Y are strictly positive (negative) quadrant dependent when λ is positive (negative).
The joint distribution
of the bivariate random vector
introduced in the previous section fulfills the assumptions in Theorem 5. In particular, if the association parameter
is positive, the random variables
X and
Y are strictly positive quadrant dependent:
for any two real values
a and
b. As a direct consequence, their Pearson’s correlation
, their Spearman’s rho
and their Kendall’s tau
are positive.
Pearson’s correlation between the components of a canonical pair is nonnegative. However, within a nonparametric framework, their Spearman’s rho and their Kendall’s tau can take any sign. When Pearson’s correlation between the components of a canonical pair is positive but their Spearman’s rho and their Kendall’s tau are negative, the former ordinal association measure becomes quite unreliable and canonical correlation analysis provides little insight into the dependence structure. This problem does not occur under a perturbed independence model satisfying the assumptions stated in the following theorem.
Theorem 5. Let , …, , with the canonical pairs obtained from a perturbed independence distribution, and let their density bewhere is a strictly increasing perturbing function. Then the joint distribution of the i-th canonical pair is a bivariate perturbed independence model:where is a strictly increasing perturbing function. We illustrate the above theorem with the perturbed independence distribution
where
is the
q-dimensional normal density with null mean vector and covariance matrix
,
is the cdf of a continuous distribution symmetric at the origin and
is a symmetric
matrix. The distribution of the canonical variates
and
is
which fulfills the assumptions in Theorem 6. Then the joint distribution of the
i-th canonical pair
is
where
is the pdf of a univariate, standard normal distribution and
is the cdf of a continuous distribution symmetric at the origin. By Theorem 5 and since the
i-th canonical correlation
is nonnegative, the association parameter
, Kendall’s tau
and Spearman’s rho
are nonnegative, too. Moreover, if
is positive, the association parameter
, Kendall’s tau
and Spearman’s rho
are positive, too.
4. Nonlinearity
As a desirable property, CCA decomposes the covariance matrix between the
p-dimensional random vector
and the
q-dimensional random vector
into linear combinations of the covariances between uncorrelated linear functions of
and
. Ref. [
31] thoroughly investigate the interpretation of CCA within the framework of linear dependence. The first output of CCA are the linear combinations of
and
, which are maximally correlated:
where
and
are the sets of
p-dimensional and
q-dimensional nonnull, real vectors.
As mentioned in the Introduction and in the previous section, both the interpretability and the usefulness of CCA are severely diminished by nonlinear dependencies between
and
. A solution would be looking for the linear and nonlinear transformations of
and
, which are maximally correlated:
where
is the set of all real valued monotonic functions. In the general case, the maximization needs to be performed simultaneously with respect to the nonlinear functions
,
and the real vectors
,
, thus being difficult to compute and difficult to interpret. Ref. [
1] addressed the problem by proposing the Gaussian copula model, where the components of
and
have a joint distribution that is multivariate normal, after being transformed according to monotonic and nonlinear functions. However, these monotonic transformations do not have a clear interpretation and they are not guaranteed to increase the correlations.
Perturbed independence models do not suffer from these limitations. Firstly, the monotonic transformations have a simple interpretation, being the expectations of one variable conditioned with respect to the other. Secondly, the same transformations are guaranteed to increase the correlations, under mild assumptions. These statements are made more precise in the following theorem.
Theorem 6. Let the joint distribution of the random variables X and Y be perturbed independent with null location parameters, nonnull association parameter and increasing perturbing function. Finally, let and have finite second moments. Then the conditional expectation is a monotone, odd and nonlinear function, while the correlation between Y and X is smaller than the correlation between and .
We illustrate the above theorem with the distribution
of the bivariate random vector
introduced in
Section 1. The conditional expectations of
Y and
X with respect to the outcomes
x of
X and
y of
Y are
so that the nonlinear function of
X and
Y maximally correlated with
Y and
X are
The above theorem does not guarantee that
is the nonlinear transformation of one component that is maximally correlated with
Y, nor that such correlation is smaller than the correlation between
and
. We empirically address this point by simulating
n = 10,000 bivariate data from
, where
. The left-hand scatterplots in
Figure 1 clearly hint at positive dependence: more points lie in the first and in the third quadrants as the association parameter increases, despite the absence of the ellipsoidal shapes associated with bivariate normality. For each simulated sample, we computed Kendall’s tau, Spearman’s rho and Pearson’s correlation and report their values in
Table 1. The three measures of ordinal association are positive and they increase with the association parameter, consistently with the theoretical results in
Section 2. More surprisingly, Spearman’s rho is always greater than Kendall’s tau and Pearson’s correlation, unlike the bivariate normal distribution, where Pearson’s correlation is always greater than Kendall’s tau and Spearman’s rho.
Finally, for each simulated sample
,…,
, we computed Pearson’s correlation between
,
…,
and
,
…,
, where
are proportional to the sample counterpart of the expectation of
Y given
and
X given
under the model
. For each simulated sample, these correlations are always greater than the correlations between the original data, consistently with Theorem 7. Moreover, Pearson’s correlations between
,
…,
and
,
…,
are always greater their Spearman’s correlations. As shown in the right-hand scatterplots of
Figure 1 and
Figure 2, the transformed data lie at the lower left corner and at the upper right corner of a square. This pattern becomes more evident as the association parameter increases. The histograms of
,
…,
in
Figure 2 are symmetric and bimodal, with both modes at the ends of the observed range. Bimodality becomes more evident as the association parameter increases. The behavior of the transformed data
,
…,
is virtually identical and therefore is not reported.
We conclude that perturbed independence distributions, by modeling the nonlinear association between random variables, might help in finding the nonlinear transformations that are maximally correlated to each other. A positive Pearson’s correlation much lower than Spearman’s rho and Kendall’s tau hints for the presence of nonlinear association, whose analytical form might be estimated by looking for the maximally correlated nonlinear transformations of the random variables. This approach is particularly appropriate for the single index regression model , where the response variable Y is the sum of a smooth function of the predictor X and the error term . When is monotone, its analytical form might be estimated by looking for the transformation that is maximally correlated with Y.
As remarked in the Introduction, kernel canonical correlation analysis (KCCA) and distance canonical correlation analysis (DCCA) are the two most popular generalizations of CCA aimed at dealing with nonlinear dependencies. A formal description of KCCA, based on Hilbert spaces and their inner products, might be found in the seminal papers by [
32,
33]. For most practical purposes, KCCA might be defined as the statistical method searching for linear projections of nonlinear functions of a random vector that are maximally correlated with linear projections of nonlinear functions of another random vector. Let
be a class of
p-dimensional random vectors whose
i-th components are nonlinear fuctions of the
p-dimensional random vector
. Similarly, let
be a class of
q-dimensional random vectors whose
i-th components are nonlinear fuctions of the
q-dimensional random vector
. Then KCCA looks for the random vectors
,
and for the real vectors
,
such that
and
are maximally correlated with each other.
In a nonparametric framework, the choice of the nonlinear functions may not be straightforward. On the other hand, in the perturbed independence framework, the theoretical and empirical results in this section suggest to set them equal to the conditional expectations:
and
. In particular, for the perturbed independence model
the suggested nonlinear functions of
and
are
DCCA looks for two projections whose joint distribution differs the most from the product of their marginal distributions, where difference is measured by distance correlation. The distance correlation between the random variables
X and
Y with respect to the weight function
w is
where
is the distance covariance between
X and
Y with respect to
w, as defined in the previous section. Hence the first canonical correlation between the
p-dimensional random vector
and the
q-dimensional random vector
is
For other distance canonical correlations, the distance canonical pairs and the distance canonical transformations are defined similarly to their CCA analogues.
A natural question to ask is whether CCA and DCCA lead to identical projections, under the assumption of perturbed independence. At present, we are unable to either prove or disprove this statement, which we conjecture to be true, under the assumptions of Theorem 6: increasing perturbing functions that increase more steeply are more likely to imply both higher Pearson and distance correlations. We plan to investigate this conjecture by means of both theoretical arguments and simulation studies.