Next Article in Journal
Observation and Analysis of Ejector Hysteresis Phenomena in the Hydrogen Recirculation Subsystem of PEMFCs
Previous Article in Journal
An Entropy-Based Condition Monitoring Strategy for the Detection and Classification of Wear Levels in Gearboxes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Extending Hilbert–Schmidt Independence Criterion for Testing Conditional Independence

Graduate School of Engineer Science, Osaka University, Toyonaka 560-0043, Japan
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(3), 425; https://doi.org/10.3390/e25030425
Submission received: 26 January 2023 / Revised: 22 February 2023 / Accepted: 25 February 2023 / Published: 26 February 2023
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
The Conditional Independence (CI) test is a fundamental problem in statistics. Many nonparametric CI tests have been developed, but a common challenge exists: the current methods perform poorly with a high-dimensional conditioning set. In this paper, we considered a nonparametric CI test using a kernel-based test statistic, which can be viewed as an extension of the Hilbert–Schmidt Independence Criterion (HSIC). We propose a local bootstrap method to generate samples from the null distribution H 0 : X Y Z . The experimental results showed that our proposed method led to a significant performance improvement compared with previous methods. In particular, our method performed well against the growth of the dimension of the conditioning set. Meanwhile, our method can be computed efficiently against the growth of the sample size and the dimension of the conditioning set.

1. Introduction

The Conditional Independence (CI) test is a statistical hypothesis test that examines whether variables X and Y are conditionally independent given another variable Z, denoted as X Y Z , when we observe the actual values of the three variables. The CI test plays a critical role in Bayesian network structure learning [1,2] and causal discovery [3].
The task is relatively easy when the sample size n is large and the variable Z is discrete, because then, we can test the independence of X , Y for each value of Z [4]. On the other hand, if X , Y , Z have a joint Gaussian distribution, then the CI reduces to a zero partial correlation between X and Y given Z [5], which can also be easily tested. In this paper, we considered X , Y , Z without making any assumption on the joint distribution. X , Y , Z can be either continuous or discrete variables. The problem becomes challenging with a growing dimension d Z due to the curse of dimensionality [6], when Z may be a set of d Z variables or any d Z -dimensional random vector.
Another major challenge in CI tests is the need to sample from the null distribution H 0 : X Y Z . In general, statistical hypothesis tests require us to obtain the distribution of the test statistic under the null hypothesis H 0 . However, when we are only given the observations, the exact distribution for any test statistic under the CI case ( H 0 : X Y Z ) is unknown. The two approaches below are the most-popular ways to obtain an approximated null distribution:
Permutation method:
One approach is by permuting the observed samples. In the independence test, where H 0 : X Y , though X and Y in each pair ( x 1 , y 1 ) , , ( x n , y n ) are not independent, we may regard X and Y of shifted pairs of ( x 1 , y 2 ) , , ( x n 1 , y n ) , ( x n , y 1 ) to be independent. Thus, we can compute the test statistic values on the shifted pairs, which mimic H 0 , and obtain a histogram as an approximated null distribution. However, in the CI test, as the conditioning set Z exists, we cannot shift { x i } , { y i } , { z i } in order to make them conditionally independent [7,8].
Asymptotic method:
The other approach utilizes the asymptotic distributions of the test statistics [9,10,11]. For some test statistics, their asymptotic distributions are derived. In that case, the asymptotic distribution of a test statistic can be used to approximate the null distribution. Though these asymptotic distributions can be generated efficiently, they are less accurate when the sample size n is small or with a high-dimensional Z [8,12].
Our contributions: In this paper, we propose a new CI test including a novel test statistic and a local bootstrap method to sample from H 0 : X Y Z . In many CI tests, test statistics directly evaluate the conditioning set Z, which becomes difficult when Z is high-dimensional or has a complex density. Our proposed test statistic does not directly access conditioning set Z, which alleviates the curse of dimensionality. Such a test statistic is expected to be more robust for a high-dimensional conditional set. The experiment result showed that our proposed test had a comparable performance when Z is low-dimensional and notably outperformed others when Z is high-dimensional. Moreover, our proposed method can be computed efficiently regarding the growing sample size n and growing dimension of Z. We summarize our main contributions as follows:
We designed a novel test statistic in the following procedure: we first subdivided Z into several local clusters, then measured the unconditional independence in each cluster, and finally, combined the unconditional independence measures into a single number as the measure of conditional independence. In particular, we used k-means to find clusters of Z and the Hilbert–Schmidt Independence Criterion (HSIC) [13] as the measure of unconditional independence in each cluster. We took the sum of the local HSIC values as our test statistic for conditional dependence.
We propose to use a local bootstrap method to sample from the CI case H 0 : X Y Z . We extended the local bootstrap strategy in [14] and showed the theoretical consistency of the bootstrap distribution. The local bootstrap method worked well combined with the proposed test statistic, but can also be applied to other CI tests.
The paper is organized as follows. In Section 2, we discuss some related works on the CI test. In Section 3, we introduce the notations and provide an overview of the HSIC, a kernel-based measure of unconditional independence. In Section 4, we show the details about the test procedure and explain both the test statistic and the local bootstrap method. In Section 5, we compare with other representative CI tests based on the synthetic data. Finally, we summarize our results in Section 6.

2. Related Work

Recently, numerous nonparametric methods have been proposed for CI testing. Many test statistics have been constructed by embedding distributions in Reproducing Kernel Hilbert Spaces (RKHSs). Fukumizu et al. [7] proposed a measure of CI based on cross-covariance operators. However, its asymptotic distribution under the null hypothesis is unknown, and the bin-based permutation degrades as the dimension of conditioning variable Z grows. Later, Zhang et al. [10] proposed the KCIT, based on the partial association of functions in some universal RKHS. A major advantage of the KCIT is a known asymptotic distribution that can be efficiently approximated using Monte Carlo simulations. For the CI test on a large-scale dataset, Strobl et al. [11] proposed the RCIT and RCoT to use random Fourier features to approximate the KCIT efficiently. Huang et al. [15] proposed a Kernel Partial Correlation (KPC), a generalization of a partial correlation to measure conditional dependence. Beyond kernel-based methods, Runge [12] used a Conditional Mutual Information (CMI) estimator as the test statistic and proposed a k-nearest-neighbor-based permutation to generate samples from the null distribution. Shah and Peters [16] proposed a Generalized Covariance Measure (GCM) as the test statistic based on regression method. Doran et al. [8] turned the CI test into a two-sample test by finding a permutation matrix and measuring the Maximum Mean Discrepancy (MMD) [17] between the two distributions. Sen et al. [18] proposed a method called the CCIT, which turns the CI test into a classification problem. In [8,18], they both gave an additional sampling step involving data splitting, potentially reducing the power when the dataset is small. Some other model-powered methods also make use of the GAN [19,20] and Double-GAN [21].
While nonparametric CI tests make no assumption about the joint distribution of X , Y , Z , imposing additional assumptions helps to simplify the problem. Some milder assumptions are considered. In particular, X and Y are assumed to be in function forms of variable Z plus an additive independent noise term, which has a zero mean:
X = f ( Z ) + ε x , Y = g ( Z ) + ε y .
If the estimated noise terms are independent ε x ε y , we conclude that X Y Z [22,23,24,25,26]. The methods in this category need to find a regression function and then test for the unconditional independence of the residuals.
For further details about the different characterizations of CI, see [27]. From a theoretical perspective, Shah and Peters [16] proved there exists no universally valid CI testing for all CI cases. Precisely, no CI test can control Type-I error for all the CI cases while having a higher power against any alternative. However, a desirable CI test is supposed to be computationally efficient.

3. Background on Kernel Methods

This section introduces the notations and gives the basic definitions related to the kernel methods. For further details, see [13,28,29]. We use X , Y , Z and x X , y Y , z Z to represent random variables and their observed samples and use X , Y , Z to denote the associated domains. We considered a positive definite kernel k : X × X R that corresponds to a Hilbert space H and a feature map Ψ : X H such that
k ( x 1 , x 2 ) = Ψ ( x 1 ) , Ψ ( x 2 ) H
for x 1 , x 2 X , where · , · H is the inner product of the Hilbert space H . Such an H is a Reproducing Kernel Hilbert Space (RKHS) with respect to the kernel k, denoted as H k . For example, the Gaussian kernel k ( x 1 , x 2 ) = exp ( x 1 x 2 2 / σ ) is a positive definite kernel, and we considered it a default choice in the paper. Let k be a kernel defined on X and its corresponding RKHS be H k . We fixed a set P of measures.
Definition 1
(Kernel embedding). The kernel embedding of the measure μ into the RKHS H k is the map m k : P H k defined by
P μ m k ( μ ) := k ( · , x ) d μ ( x ) H k .
From the above definition, a direct consequence is
f ( x ) d μ ( x ) = f , m k ( μ ) H k , f H k .
Definition 2
(MMD). The Maximum Mean Discrepancy (MMD) between P , Q P is
M M D ( P , Q ) := m k ( P ) m k ( Q ) H k 2 .
It is easy to see that the MMD takes non-negative values. In particular, for characteristic kernels (e.g., the Gaussian kernel), the MMD ( P , Q ) becomes zero if and only if the measures P , Q coincide [17].
Finally, we considered an unconditional dependence measure for variables X and Y. Let k X and k Y be kernels on X and Y and H k X and H k Y be the corresponding RKHSs. Gretton et al. [13] defined the Hilbert–Schmidt Independence Criterion (HSIC), which can be viewed as the MMD between a measure P X Y of X , Y and the product P X P Y of the marginalized measures P X , P Y . The HSIC is a state-of-the-art dependence measure, which suits both continuous and discrete variables. The HSIC has been well studied as a test statistic in independence testing [10,13,17]. For a characteristic kernel, the HSIC ( X , Y ) is zero if and only if P X Y = P X P Y , which indicates X Y .
More precisely, we may express the HSIC as follows:
Definition 3
(HSIC).
H S I C ( X , Y ) : = m k m k X m k Y H 2 = E X Y [ k X ( X , · ) k Y ( Y , · ) ] E X [ k X ( X , · ) ] E Y [ k Y ( Y , · ) ] H 2 ,
where H is the corresponding RKHS of the kernel k := k X k Y defined by
k ( ( x , y ) , ( x , y ) ) = k X ( x , x ) k Y ( y , y )
for ( x , y ) , ( x , y ) X × Y .
The HSIC ( X , Y ) is known to have an alternative expression:
HSIC ( X , Y ) = E X Y X Y [ C ( X , Y , X , Y ) ]
where C ( X , Y , X , Y ) is
k X ( X , X ) E X k X ( X , X ) k Y ( Y , Y ) E Y k Y ( Y , Y ) ,
and ( X , Y ) are independent copies of ( X , Y ) . Given data points ( x 1 , y 1 ) , , ( x n , y n ) , we considered the following estimator [13]:
HSIC ^ ( X , Y ) = 1 n 2 tr ( K X H K Y H )
where ( K X ) i j = k ( x i , x j ) , ( K Y ) i j = k ( y i , y j ) , H = I 1 n 𝟙 𝟙 T , and 𝟙 is an n vector of ones. Intuitively, we expect an estimator of the HSIC to be a small value when X Y .

4. Proposed Method

In this section, we introduce our proposed method. First, we present a novel test statistic. We considered using characteristic kernels as a default choice, i.e., the Gaussian kernel. Next, we explain the local bootstrap algorithm to generate samples from H 0 : X Y Z . The test is summarized in Algorithm 1. Finally, we discuss the effect of the parameters and provide a time complexity analysis of the overall procedure.
We start by looking at the CI definition: X Y Z means X and Y are independent for any fixed value of Z. Here, we used HSIC ( X , Y Z = z ) : = E X Y X Y [ C ( X , Y , X , Y ) | Z = z ] to represent the HSIC on ( X , Y ) with a fixed Z value, where ( X , Y ) are copies of ( X , Y ) .
X Y Z X Y Z = z , z Z . HSIC ( X , Y Z = z ) = 0 , z Z .
As a direct result, we have the following proposition.
Proposition 1
(Characterization of CI).
X Y Z HSIC ( X , Y Z ) d μ ( Z ) = 0
where μ ( Z ) is the probability measure on Z.
Proof sketch: By definition, HSIC ( X , Y Z = z ) = 0 always takes non-negative values. Thus, for a characteristic kernel, the integral becomes zero if and only if HSIC ( X , Y Z = z ) = 0 , z Z , which indicates X Y Z .
Based on the above fact, conditional dependence can be measured by the marginal unconditional dependence measure. Here, we considered the following procedure to calculate our test statistic:
  • Perform the clustering algorithm to subdivide Z into M clusters, and let its index set be C M .
  • Measure the unconditional dependence HSIC ^ C m ( X , Y ) for each cluster C m .
  • Combine the sum of the values as the test statistic:
    T = m = 1 M HSIC ^ C m ( X , Y ) .
We used the sum of the local unconditional dependence measure as the conditional dependence measure, which is similar in spirit to [4]. Margaritis [4] considered dividing a univariate Z R 1 into local bins and using the product of the local measure as a single number. Our method applies to a high-dimensional Z and takes the sum of kernel-based measures. Given the data ( x i , y i , z i ) , i = 1 , , n , we divided them into M clusters based on the value of Z, and the estimator is
HSIC ^ C m ( X , Y ) = 1 | C m | 2 tr ( K X ( m ) H K Y ( m ) H )
where | C m | is the size of C m and K X ( m ) and K Y ( m ) are the corresponding kernel matrices for samples ( x i , y i ) , i C m . It is easy to see that the conditioning set Z is only used in deciding the local clusters. By doing that, we alleviate the influence of the dimension of Z.

4.1. Local Bootstrap

In this subsection, we introduce the local bootstrap method to sample from H 0 : X Y Z , which completes the CI test. The key is to break the dependence between X and Y while keeping the dependence between ( X , Z ) and ( Y , Z ) . An example of an ideal CI permutation is explained in Figure 1.
In practice, it is impossible to perform the ideal permutation because we do not have enough samples that have the same z. As an alternative method, we used a local bootstrap.
First, given with different z * , we generated ( x * , y * ) independently from the following discrete distribution:
x * G ^ x z * : x 1 x 2 , x n w 1 w 2 , w n , y * G ^ y z * : y 1 y 2 , y n w 1 w 2 , w n ,
where w j = K ( z j z * / γ ) j = 1 n K ( z j z * / γ ) are the probabilities to sample the index j. Under the mild assumptions [30], we show the consistency of the bootstrap distribution at each z * . See the proof in Appendix A.
Proposition 2.
The empirical bootstrap distribution converges to a conditional distribution with each fixed point of z * :
| G ^ x z * G ^ y z * P ( X Z = z * ) P ( Y Z = z * ) | 0 , a s n .
The local bootstrap strategy is an extension of [14], which was original designed for sampling ( x , y ) according to a regression model. We extended it using a Nadaraya–Watson kernel estimator [31] to assign the weights for indexes to be sampled. If z j is close to z * , w j is large; thus, the index j has a larger possibility of being sampled. Moreover, x * and y * are sampled independently. Thus, it is less possible for both x j and y j to be sampled simultaneously, which breaks the dependence between X and Y. Shi [14] suggested that the bandwidth γ should be varied for different z * . Here, we narrowed the candidates from 1 , , n to the 10-nearest neighbors of each z * and let the local bandwidth γ of z * be the squared Euclidean distance between z * and its 10-th nearest neighbor.
The local bootstrap is summarized in Algorithm 1. Each time, we generated n samples as if they come from H 0 : X Y Z and calculated the test statistic value T. We repeated this K-times on the generated samples and calculated the p-value based on the histogram. We reject H 0 if the p value is smaller than a predefined significance level. Otherwise, we accept H 0 . We summarize the procedure in Algorithm 2.
Algorithm 1:Local bootstrap
Input: Data ( x i , y i , z i ) , i = 1 , , n .
Output: New samples: ( x i * , y i * , z i * ) , i = 1 , , n
1
for i 1 tondo
2
     Let z i * = z i
3
     Sample x i * G ^ x z i * , y i * G ^ y z i * .
4
end
Algorithm 2:Test
Input: Data ( x i , y i , z i ) , i = 1 , , n .
            Cluster number M.
            Times to repeat K.
Output: p-value.
1
Find M clusters.
2
Estimate the T.
3
for k 1 toKdo
4
     Generate samples with Algorithm 1
5
     Estimate T i on the generated samples.
6
end
7
Compute the p-value:
p = 1 K t = 1 T { T k T } .

4.2. Effect of M

The choice of the cluster number M affects the test performance. M needs to be chosen properly so that we can focus on the pairs ( x , y ) who have similar z values while having enough pairs ( x , y ) in each local cluster to make a good estimation of the local HSIC. Besides, the number of M has an effect on the computational complexity: a smaller M makes bigger clusters on average and takes more time to find local HSICs. In practice, we fixed the average cluster size and used k-means to decide the clusters. We let M be n / C ¯ , where x takes the least integer that is not smaller than x, and C ¯ is the average cluster size.

4.3. Complexity Analysis

We discuss the time complexity of the test procedure. In the beginning, our method found M clusters and weights w in the bootstrap. Both were calculated once and took little time. The major computational cost was in repeatedly finding the test statistic T k . Estimating T scales less than O ( M | C ˜ | 2 ) , where M is the number of clusters and the | C ˜ | is the maximum set size among all clusters and is smaller than n. This was repeated T-times over the generated samples to construct the histogram of the null distribution. The test took O ( M n 2 K ) . The bootstrap part can be easily parallelized to promote the speed further, but this was beyond the scope of the paper.

5. Experiments

In this section, we compare the proposed methods with other nonparametric CI tests. We denote our proposed method as a Bundle of HSICs (BHSIC). The evaluation was focused on the Type-I error rate, Type-II error rate, and runtime. A lower Type-II error rate and computational efficiency are essential for a good CI test. In particular, we compared with some representative methods: KCIT [10], RCIT, RCoT [11], CCIT [18], and CMIknn [12]. For details about these methods, see Section 2. All methods have source codes that are available online. Different methods are implemented in different programming languages, and we focused on how these methods scale with the sample size and the dimension of Z instead of a direct comparison of the runtimes.
We were interested in the performance of the methods with different settings. In our simulations, we considered the following two models. The first model was a simple linear regression model. The second model was a post-nonlinear noise model, which is a commonly used setting in evaluating CI tests [10,11,12]. The functional forms of X and Y on Z are as follows:
Model 1 : X = i = 1 d Z α i Z i + c ε b + ε 1 , Y = i = 1 d Z β i Z i + c ε b + ε 2 , Model 2 : X = g 1 ( i = 1 d Z Z i + c ε b + ε 1 ) , Y = g 2 ( i = 1 d Z Z i + c ε b + ε 2 ) ,
where Z = ( Z 1 , , Z d Z ) , ε 1 , ε 2 , and ε b are independent standard Gaussian. The coefficients α i , β i U n i f o r m ( 0.5 / d z , 0.5 / d z ) and the functions g 1 ( · ) and g 2 ( · ) were uniformly chosen from { ( · ) , ( · ) 2 , ( · ) 3 , tan h ( · ) , exp ( · 2 ) } . We considered (a) H 0 : X Y Z with c = 0 and (b) H 1 : X Y Z with c = 1 .
In the following simulations, we studied the test performance on different sample sizes and dimensions of Z. The sample sizes n varied from { 100 , 200 , 400 , 600 , 800 } with fixed dimensions of d Z = 1 and d Z = 10 . The dimensions d Z varied from { 1 , 2 , 5 , 10 , 20 } with a fixed sample size of n = 400 . We also studied the effect of the cluster number M in our proposed method. The significance levels were set to be α = 0.05 in all the simulations. The evaluations of the Type-I error rate, Type-II error rate, and mean runtimes are reported over 100 replications. The Type-I error rate is the false rejection percentage when the underlying truth is H 0 : X Y Z with c = 0 , and the Type-II error rate is the false acceptance percentage when the underlying truth is H 1 : X Y Z with c = 1 . Runtime is the average time to perform one test.

5.1. Hyperparameters

The choice of the hyperparameters affects the results. For the KCIT, RCIT, and RCoT, the bandwidths in the Gaussian kernels were set to be the squared median Euclidean distance between ( X , Y ) using all the pairs (or the first 500 pairs if n > 500 ) double the conditioning set size, which was recommended in [11]. Thew CMIknn has two hyperparameters: the neighbor size k CMI = 0.1 n in finding the estimator of the CMI, and k p e r m = 5 in the permutation, respectively. The permutation in CMIknn was repeated 1000-times as the default [12].
In our proposed methods, the bandwidths were set to be the squared median Euclidean distance between ( X , Y ) in each local cluster. The number of clusters M was set to be n / 50 when n < = 200 and n / 80 when n > 200 , where x takes the least integer that is bigger than or equal to x. On average, each cluster had 50 samples when n < = 200 and 80 samples otherwise. The local bootstrap was repeated 1000-times.

5.2. When Z Is Low-Dimensional

We first examined the test performance when Z is generated independently of a standard Gaussian distribution. The sample size n changed from 100 to 800. The simulation results on Linear Model 1 and Nonlinear Model 2 are reported in Figure 2 and Figure 3, respectively. Both the Type-I error rate and Type-II error rate are reported. Because runtime is independent of the model, we only report it in Figure 3.
The Linear Model 1 setting with a single conditioning variable Z is very simple. All methods had controlled Type-I error rates around α = 0.05 and almost zero Type-II error rates, except for the CCIT. In our experiments, the performance of the CCIT was constantly among the worst. The data splitting procedure in the CCIT seems to reduce the power of the test when the sample size is small. In the Nonlinear Model 2 setting, all methods had controlled Type-I error rates around α = 0.05 . However, it was shown that the proposed method and the CMIknn had better powers against the others when the sample size n was smaller. This matched the result in [12] that CMIknn performed well with a low-dimensional conditioning set Z. When the sample size n was larger than 400, most methods had relatively low Type-II error rates. From the runtime plot, the proposed method was less efficient than the KCIT, RCIT, and RCoT, which are based on an asymptotic distribution. Though the BHSIC and CMIknn were slower, the sampling procedure can readily be parallelized.

5.3. When Z Is High-Dimensional

We next examined the test performance when Z was a set of 10 variables, and each variable in conditioning set Z was generated independently from a standard Gaussian distribution. The sample size n changed from 100 to 800. The simulation results on Linear Model 1 and Nonlinear Model 2 are reported in Figure 4 and Figure 5, respectively.
In both linear and nonlinear settings, the RCIT and RCoT failed and had high Type-I error rates. The RCIT and RCoT approximated the KCIT by using random Fourier features and were designed for large-scale datasets. Though they are more scalable than the KCIT, their performances were poor when the sample size was relatively small. The KCIT, CMIknn, and BHSIC performed well in the linear model setting. In the nonlinear model setting, the KCIT showed greater Type I error rates and Type II error rates because the high-dimensional Z led to a less accurate estimation of the asymptotic distribution. We noticed that the BHSIC showed a higher power than the other methods. As we expected, it was beneficial to avoid evaluating the high-dimensional Z directly, which made the method more robust.

5.4. When d Z Changes

Next, we examined the performance when the dimension of Z changes. We fixed the sample size n = 400 and changed d Z from 1 to 20 in Nonlinear Model 2. The results are shown in Figure 6. Our proposed method performed well against the growth of the dimension of Z and showed a higher power than other methods. Moreover, the dimension of Z did not affect the runtimes since Z was used in k-means only once, which coincided with our complexity analysis.

5.5. Effect on M

Now, we study the effect on the cluster number M. We fixed the sample size n = 400 in Nonlinear Model 2 and changed M from 2 to 20. We examined both low-dimensional ( d Z = 1 ) and high-dimensional ( d Z = 10 ) cases. The results are shown in Figure 7.
We noticed that the Type I error rates were controlled when d Z = 1 and d Z = 10 . As the cluster number M grew to more than 10, the Type-II error rate increased, but the runtimes reduced. The reason is that when we divided the sample points into more clusters, each cluster had fewer points. Thus, the estimation of each local HSIC value became less accurate. On the other hand, the computational cost of the proposed test reduced as the clusters became smaller when the cluster number M grew. The number of samples in each cluster depends on the choice of the clustering algorithms as well. In our experiment, we simply used naive k-means.

6. Conclusions

In this paper, we proposed a novel CI test including a new test statistic and a local bootstrap method to generate samples from the null hypothesis. We first performed clustering to avoid directly evaluating the high-dimensional conditioning set Z. Then, we used the clustering result and combined several local dependence measures as a measure of conditional dependence. Consequently, the problems caused by a high-dimensional Z can be suppressed. The experimental results showed that our method is robust and performs well against the growth of the dimension of the conditioning set.

Author Contributions

Methodology, B.Z. and J.S.; validation and writing, B.Z.; supervision, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Japan Science and Technology Agency, Grant Number JPMJSP2138 and the Grant-in-Aid for Scientific Research (C) 18K11192.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The evaluation in the paper is based on synthetic data described above.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Proposition 2

We assumed that
X = g x ( Z ) + ϵ , Y = g y ( Z ) + ϵ ,
and the local bootstrap method generates samples from X Y Z .
For each z i * = z i , the empirical conditional probability density functions are
G ^ x z * ( x < t ) = j = 1 n w j 1 x j < t ,
G ^ y z * ( y < t ) = j = 1 n w j 1 y j < t .
where w j = K z j z * γ j = 1 n K z j z * γ . We used G x z * and G y z * to denote the underlying truth:
G x z * ( x < t ) = P ( X < t Z = z * ) , G y z * ( y < t ) = P ( Y < t Z = z * ) .
Consider pairs ( x , z ) R × R d . Devroye and Wagner (1980) proved the L p convergence of g ^ x ( z ) : = j = 1 n w j x j to g x ( z ) :
A1
E [ | Z | p ] , p 1 .
A2
γ 0 as n .
A3
n γ d 0 as n .
A4
The kernel K satisfies the following:
K is a nonnegative function on R d bounded by k * < .
K has a compact support A.
k β 1 B for some β > 0 and some closed sphere B centered at the origin with positive radius.
Lemma A1
(Devroye and Wagner, 1980). Under the above assumption, we have
E X [ | g ^ x ( z ) g x ( z ) | p d z ] 0 , as n .
Let x i = 1 x i < t , and notice that
1 x i < t = P ( X i < t ) + e i
where E ( e i ) = 0 and sup i E ( e i k ) 1 for all k 2 . This can be viewed as another non-parametric regression model as in (1). We have
E X [ | G ^ x z G x z | p d z ] 0 , as n .
Using Lemma 1, we proved the L p convergence of g ^ x ( z ) g ^ y ( z ) :
Lemma A2.
E X Y [ | g ^ x ( z ) g ^ y ( z ) g x ( z ) g y ( z ) | p d z ] 0 , as n .
Proof:
 
E X Y [ | g ^ x ( z ) g ^ y ( z ) g x ( z ) g y ( z ) | p d z ] = E X Y [ | g ^ x ( z ) { g ^ y ( z ) g y ( z ) } + g y ( z ) { g ^ x ( z ) g x ( z ) } | p d z ] E X Y [ | g ^ x ( z ) | p | g ^ y ( z ) g y ( z ) | p d z ] + E X Y [ | g y ( z ) | p | g ^ x ( z ) g x ( z ) | p d z ] ( k * ) p { E Y [ | g ^ y ( z ) g y ( z ) | p d z ] + E X [ | g ^ x ( z ) g x ( z ) | p d z ] }
and the right side becomes 0 as n .
Similarly, we have
E X Y [ | G ^ x z G ^ y z G x z G y z | p d z ] 0 , as n ,
and
| G ^ x z * G ^ y z * G x z * G y z * | 0 , a s n .

References

  1. Pearl, J. Causality: Models, Reasoning and Inference; Cambridge University Press: Cambridge, UK, 2000; Volume 19. [Google Scholar]
  2. Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  3. Spirtes, P.; Glymour, C.; Acheines, R. Causation, Prediction, and Search; The MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
  4. Margaritis, D. Distribution-Free Learning of Bayesian Network Structure in Continuous Domains. In AAAI’05: Proceedings of the 20th National Conference on Artificial Intelligence—Volume 2; AAAI Press: Palo Alto, CA, USA, 2005; pp. 825–830. [Google Scholar]
  5. Lawrance, A.J. On Conditional and Partial Correlation. Am. Stat. 1976, 30, 146–149. [Google Scholar] [CrossRef]
  6. Bergsma, W.P. Testing Conditional Independence for Continuous Random Variables; Eurandom: Eindhoven, The Netherlands, 2004. [Google Scholar]
  7. Fukumizu, K.; Gretton, A.; Sun, X.; Schölkopf, B. Kernel Measures of Conditional Dependence. In Proceedings of the Advances in Neural Information Processing Systems; Platt, J., Koller, D., Singer, Y., Roweis, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2008; Volume 20. [Google Scholar]
  8. Doran, G.; Muandet, K.; Zhang, K.; Schölkopf, B. A Permutation-Based Kernel Conditional Independence Test. In Proceedings of the UAI, Quebec City, QC, Canada, 23–27 July 2014; pp. 132–141. [Google Scholar]
  9. Huang, T.M. Testing conditional independence using maximal nonlinear conditional correlation. Ann. Stat. 2010, 38, 2047–2091. [Google Scholar] [CrossRef] [Green Version]
  10. Zhang, K.; Peters, J.; Janzing, D.; Schölkopf, B. Kernel-based conditional independence test and application in causal discovery. arXiv 2012, arXiv:1202.3775. [Google Scholar]
  11. Strobl, E.V.; Zhang, K.; Visweswaran, S. Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery. J. Causal Inference 2019, 7, 20180017. [Google Scholar] [CrossRef] [Green Version]
  12. Runge, J. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Lanzarote, Spain, 9 –11 April 2018; pp. 938–947. [Google Scholar]
  13. Gretton, A.; Fukumizu, K.; Teo, C.H.; Song, L.; Schölkopf, B.; Smola, A.J. A kernel statistical test of independence. Adv. Neural Inf. Process. Syst. 2007, 20, 585–592. [Google Scholar]
  14. Shi, S.G. Local bootstrap. Ann. Inst. Stat. Math. 1991, 43, 667–676. [Google Scholar] [CrossRef]
  15. Huang, Z.; Deb, N.; Sen, B. Kernel Partial Correlation Coefficient—A Measure of Conditional Dependence. arXiv 2020, arXiv:2012.14804. [Google Scholar]
  16. Shah, R.D.; Peters, J. The hardness of conditional independence testing and the generalised covariance measure. Ann. Stat. 2020, 48, 1514–1538. [Google Scholar] [CrossRef]
  17. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  18. Sen, R.; Suresh, A.T.; Shanmugam, K.; Dimakis, A.G.; Shakkottai, S. Model-powered conditional independence test. arXiv 2017, arXiv:1709.06138. [Google Scholar]
  19. Bellot, A.; van der Schaar, M. Conditional independence testing using generative adversarial networks. Adv. Neural Inf. Process. Syst. 2019, 2202–2211. [Google Scholar]
  20. Ahuja, K.; Sattigeri, P.; Shanmugam, K.; Wei, D.; Ramamurthy, K.N.; Kocaoglu, M. Conditionally independent data generation. In Proceedings of the Uncertainty in Artificial Intelligence, Online, 27–30 July 2021; pp. 2050–2060. [Google Scholar]
  21. Shi, C.; Xu, T.; Bergsma, W.; Li, L. Double Generative Adversarial Networks for Conditional Independence Testing. J. Mach. Learn. Res. 2021, 22, 285–1. [Google Scholar]
  22. Shimizu, S.; Hoyer, P.O.; Hyvärinen, A.; Kerminen, A.; Jordan, M. A linear non-Gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 2006, 7, 2003–2030. [Google Scholar]
  23. Zhang, K.; Hyvarinen, A. On the identifiability of the post-nonlinear causal model. arXiv 2012, arXiv:1205.2599. [Google Scholar]
  24. Zhang, H.; Zhou, S.; Zhang, K.; Guan, J. Causal Discovery Using Regression-Based Conditional Independence Tests. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 1250–1256. [Google Scholar]
  25. Zhang, Q.; Filippi, S.; Flaxman, S.; Sejdinovic, D. Feature-to-Feature Regression for a Two-Step Conditional Independence Test. In Proceedings of the Uncertainty in Artificial Intelligence, Sydney, Australia, 11–15 August 2017. [Google Scholar]
  26. Zhang, H.; Zhang, K.; Zhou, S.; Guan, J.; Zhang, J. Testing Independence Between Linear Combinations for Causal Discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 6538–6546. [Google Scholar]
  27. Li, C.; Fan, X. On nonparametric conditional independence tests for continuous variables. In Wiley Interdisciplinary Reviews: Computational Statistics; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2019; Volume 12. [Google Scholar] [CrossRef] [Green Version]
  28. Suzuki, J. Kernel Methods for Machine Learning with R; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  29. Zhang, Q.; Filippi, S.; Gretton, A.; Sejdinovic, D. Large-scale kernel methods for independence testing. Stat. Comput. 2018, 28, 113–130. [Google Scholar] [CrossRef] [Green Version]
  30. Devroye, L.P.; Wagner, T.J. Distribution-Free Consistency Results in Nonparametric Discrimination and Regression Function Estimation. Ann. Stat. 1980, 8, 231–239. [Google Scholar] [CrossRef]
  31. Nadaraya, E.A. On estimating regression. Theory Probab. Appl. 1964, 9, 141–142. [Google Scholar] [CrossRef]
Figure 1. An ideal permutation in the CI. Given the data, we first divided different bins (green, red, and blue), and each bin x ˜ includes samples that have the same z. From that, we fixed y and z and shuffled x within each bin with some permutations ( π 1 , π 2 , π 3 ) to generate new data ( x ˜ , y , z ) . An ideal permutation successfully generates samples that keep the dependence between ( X , Z ) and ( Y , Z ) while satisfying X Y .
Figure 1. An ideal permutation in the CI. Given the data, we first divided different bins (green, red, and blue), and each bin x ˜ includes samples that have the same z. From that, we fixed y and z and shuffled x within each bin with some permutations ( π 1 , π 2 , π 3 ) to generate new data ( x ˜ , y , z ) . An ideal permutation successfully generates samples that keep the dependence between ( X , Z ) and ( Y , Z ) while satisfying X Y .
Entropy 25 00425 g001
Figure 2. Simulation results on Linear Model 1 ( d Z = 1 ). The significance level is α = 0.05 . Type-I error rates and Type-II error rates are reported.
Figure 2. Simulation results on Linear Model 1 ( d Z = 1 ). The significance level is α = 0.05 . Type-I error rates and Type-II error rates are reported.
Entropy 25 00425 g002
Figure 3. Simulation results on Nonlinear Model 2 ( d Z = 1 ). The significance level is α = 0.05 . Type-I error rates, Type-II error rates, and mean runtimes are reported.
Figure 3. Simulation results on Nonlinear Model 2 ( d Z = 1 ). The significance level is α = 0.05 . Type-I error rates, Type-II error rates, and mean runtimes are reported.
Entropy 25 00425 g003
Figure 4. Simulation results on Linear Model 1 ( d Z = 10 ) . The significance level α = 0.05 . Type-I error rates and Type-II error rates are reported.
Figure 4. Simulation results on Linear Model 1 ( d Z = 10 ) . The significance level α = 0.05 . Type-I error rates and Type-II error rates are reported.
Entropy 25 00425 g004
Figure 5. Simulation results on Nonlinear Model 2 ( d Z = 10 ) . The significance level α = 0.05 . Type-I error rates, Type-II error rates, and mean runtimes are reported.
Figure 5. Simulation results on Nonlinear Model 2 ( d Z = 10 ) . The significance level α = 0.05 . Type-I error rates, Type-II error rates, and mean runtimes are reported.
Entropy 25 00425 g005
Figure 6. Simulation results on different dimensions of Z ( d Z = 1 , 5 , 10 , 15 , 20 ). The sample size n = 400 was fixed. The significant level α = 0.05 . Type-I error rates, Type-II error rates, and mean runtimes are reported.
Figure 6. Simulation results on different dimensions of Z ( d Z = 1 , 5 , 10 , 15 , 20 ). The sample size n = 400 was fixed. The significant level α = 0.05 . Type-I error rates, Type-II error rates, and mean runtimes are reported.
Entropy 25 00425 g006
Figure 7. Simulation results on a different cluster number M. The sample size n = 400 was fixed. Results on different dimensionality of Z are reported ( d Z = 1 , red line; d Z = 10 , blue line). The significant level α = 0.05 . Type-I error rates, Type-II error rates, and mean runtimes are reported.
Figure 7. Simulation results on a different cluster number M. The sample size n = 400 was fixed. Results on different dimensionality of Z are reported ( d Z = 1 , red line; d Z = 10 , blue line). The significant level α = 0.05 . Type-I error rates, Type-II error rates, and mean runtimes are reported.
Entropy 25 00425 g007
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, B.; Suzuki, J. Extending Hilbert–Schmidt Independence Criterion for Testing Conditional Independence. Entropy 2023, 25, 425. https://doi.org/10.3390/e25030425

AMA Style

Zhang B, Suzuki J. Extending Hilbert–Schmidt Independence Criterion for Testing Conditional Independence. Entropy. 2023; 25(3):425. https://doi.org/10.3390/e25030425

Chicago/Turabian Style

Zhang, Bingyuan, and Joe Suzuki. 2023. "Extending Hilbert–Schmidt Independence Criterion for Testing Conditional Independence" Entropy 25, no. 3: 425. https://doi.org/10.3390/e25030425

APA Style

Zhang, B., & Suzuki, J. (2023). Extending Hilbert–Schmidt Independence Criterion for Testing Conditional Independence. Entropy, 25(3), 425. https://doi.org/10.3390/e25030425

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop