1. Introduction
The Conditional Independence (CI) test is a statistical hypothesis test that examines whether variables
X and
Y are conditionally independent given another variable
Z, denoted as
, when we observe the actual values of the three variables. The CI test plays a critical role in Bayesian network structure learning [
1,
2] and causal discovery [
3].
The task is relatively easy when the sample size
n is large and the variable
Z is discrete, because then, we can test the independence of
for each value of
Z [
4]. On the other hand, if
have a joint Gaussian distribution, then the CI reduces to a zero partial correlation between
X and
Y given
Z [
5], which can also be easily tested. In this paper, we considered
without making any assumption on the joint distribution.
can be either continuous or discrete variables. The problem becomes challenging with a growing dimension
due to the curse of dimensionality [
6], when
Z may be a set of
variables or any
-dimensional random vector.
Another major challenge in CI tests is the need to sample from the null distribution . In general, statistical hypothesis tests require us to obtain the distribution of the test statistic under the null hypothesis . However, when we are only given the observations, the exact distribution for any test statistic under the CI case () is unknown. The two approaches below are the most-popular ways to obtain an approximated null distribution:
- •
Permutation method:
One approach is by permuting the observed samples. In the independence test, where
, though
X and
Y in each pair
are not independent, we may regard
X and
Y of shifted pairs of
to be independent. Thus, we can compute the test statistic values on the shifted pairs, which mimic
, and obtain a histogram as an approximated null distribution. However, in the CI test, as the conditioning set
Z exists, we cannot shift
in order to make them conditionally independent [
7,
8].
- •
Asymptotic method:
The other approach utilizes the asymptotic distributions of the test statistics [
9,
10,
11]. For some test statistics, their asymptotic distributions are derived. In that case, the asymptotic distribution of a test statistic can be used to approximate the null distribution. Though these asymptotic distributions can be generated efficiently, they are less accurate when the sample size
n is small or with a high-dimensional
Z [
8,
12].
Our contributions: In this paper, we propose a new CI test including a novel test statistic and a local bootstrap method to sample from . In many CI tests, test statistics directly evaluate the conditioning set Z, which becomes difficult when Z is high-dimensional or has a complex density. Our proposed test statistic does not directly access conditioning set Z, which alleviates the curse of dimensionality. Such a test statistic is expected to be more robust for a high-dimensional conditional set. The experiment result showed that our proposed test had a comparable performance when Z is low-dimensional and notably outperformed others when Z is high-dimensional. Moreover, our proposed method can be computed efficiently regarding the growing sample size n and growing dimension of Z. We summarize our main contributions as follows:
- •
We designed a novel test statistic in the following procedure: we first subdivided
Z into several local clusters, then measured the unconditional independence in each cluster, and finally, combined the unconditional independence measures into a single number as the measure of conditional independence. In particular, we used
k-means to find clusters of
Z and the Hilbert–Schmidt Independence Criterion (HSIC) [
13] as the measure of unconditional independence in each cluster. We took the sum of the local HSIC values as our test statistic for conditional dependence.
- •
We propose to use a local bootstrap method to sample from the CI case
. We extended the local bootstrap strategy in [
14] and showed the theoretical consistency of the bootstrap distribution. The local bootstrap method worked well combined with the proposed test statistic, but can also be applied to other CI tests.
The paper is organized as follows. In
Section 2, we discuss some related works on the CI test. In
Section 3, we introduce the notations and provide an overview of the HSIC, a kernel-based measure of unconditional independence. In
Section 4, we show the details about the test procedure and explain both the test statistic and the local bootstrap method. In
Section 5, we compare with other representative CI tests based on the synthetic data. Finally, we summarize our results in
Section 6.
2. Related Work
Recently, numerous nonparametric methods have been proposed for CI testing. Many test statistics have been constructed by embedding distributions in Reproducing Kernel Hilbert Spaces (RKHSs). Fukumizu et al. [
7] proposed a measure of CI based on cross-covariance operators. However, its asymptotic distribution under the null hypothesis is unknown, and the bin-based permutation degrades as the dimension of conditioning variable
Z grows. Later, Zhang et al. [
10] proposed the KCIT, based on the partial association of functions in some universal RKHS. A major advantage of the KCIT is a known asymptotic distribution that can be efficiently approximated using Monte Carlo simulations. For the CI test on a large-scale dataset, Strobl et al. [
11] proposed the RCIT and RCoT to use random Fourier features to approximate the KCIT efficiently. Huang et al. [
15] proposed a Kernel Partial Correlation (KPC), a generalization of a partial correlation to measure conditional dependence. Beyond kernel-based methods, Runge [
12] used a Conditional Mutual Information (CMI) estimator as the test statistic and proposed a k-nearest-neighbor-based permutation to generate samples from the null distribution. Shah and Peters [
16] proposed a Generalized Covariance Measure (GCM) as the test statistic based on regression method. Doran et al. [
8] turned the CI test into a two-sample test by finding a permutation matrix and measuring the Maximum Mean Discrepancy (MMD) [
17] between the two distributions. Sen et al. [
18] proposed a method called the CCIT, which turns the CI test into a classification problem. In [
8,
18], they both gave an additional sampling step involving data splitting, potentially reducing the power when the dataset is small. Some other model-powered methods also make use of the GAN [
19,
20] and Double-GAN [
21].
While nonparametric CI tests make no assumption about the joint distribution of
, imposing additional assumptions helps to simplify the problem. Some milder assumptions are considered. In particular,
X and
Y are assumed to be in function forms of variable
Z plus an additive independent noise term, which has a zero mean:
If the estimated noise terms are independent
, we conclude that
[
22,
23,
24,
25,
26]. The methods in this category need to find a regression function and then test for the unconditional independence of the residuals.
For further details about the different characterizations of CI, see [
27]. From a theoretical perspective, Shah and Peters [
16] proved there exists no universally valid CI testing for all CI cases. Precisely, no CI test can control Type-I error for all the CI cases while having a higher power against any alternative. However, a desirable CI test is supposed to be computationally efficient.
3. Background on Kernel Methods
This section introduces the notations and gives the basic definitions related to the kernel methods. For further details, see [
13,
28,
29]. We use
and
to represent random variables and their observed samples and use
to denote the associated domains. We considered a positive definite kernel
that corresponds to a Hilbert space
and a feature map
such that
for
, where
is the inner product of the Hilbert space
. Such an
is a Reproducing Kernel Hilbert Space (RKHS) with respect to the kernel
k, denoted as
. For example, the Gaussian kernel
is a positive definite kernel, and we considered it a default choice in the paper. Let
k be a kernel defined on
and its corresponding RKHS be
. We fixed a set
of measures.
Definition 1 (Kernel embedding).
The kernel embedding of the measure μ into the RKHS is the map defined by From the above definition, a direct consequence is
Definition 2 (MMD).
The Maximum Mean Discrepancy (MMD) between is It is easy to see that the MMD takes non-negative values. In particular, for characteristic kernels (e.g., the Gaussian kernel), the
becomes zero if and only if the measures
coincide [
17].
Finally, we considered an unconditional dependence measure for variables
X and
Y. Let
and
be kernels on
and
and
and
be the corresponding RKHSs. Gretton et al. [
13] defined the Hilbert–Schmidt Independence Criterion (HSIC), which can be viewed as the MMD between a measure
of
and the product
of the marginalized measures
. The HSIC is a state-of-the-art dependence measure, which suits both continuous and discrete variables. The HSIC has been well studied as a test statistic in independence testing [
10,
13,
17]. For a characteristic kernel, the
is zero if and only if
, which indicates
.
More precisely, we may express the HSIC as follows:
Definition 3 where is the corresponding RKHS of the kernel defined by for .
The
is known to have an alternative expression:
where
is
and
are independent copies of
. Given data points
, we considered the following estimator [
13]:
where
,
, and
is an
n vector of ones. Intuitively, we expect an estimator of the HSIC to be a small value when
.
4. Proposed Method
In this section, we introduce our proposed method. First, we present a novel test statistic. We considered using characteristic kernels as a default choice, i.e., the Gaussian kernel. Next, we explain the local bootstrap algorithm to generate samples from . The test is summarized in Algorithm 1. Finally, we discuss the effect of the parameters and provide a time complexity analysis of the overall procedure.
We start by looking at the CI definition:
means
X and
Y are independent for any fixed value of
Z. Here, we used
to represent the HSIC on
with a fixed
Z value, where
are copies of
.
As a direct result, we have the following proposition.
Proposition 1 (Characterization of CI).
where is the probability measure on Z.
Proof sketch: By definition, always takes non-negative values. Thus, for a characteristic kernel, the integral becomes zero if and only if , which indicates .
Based on the above fact, conditional dependence can be measured by the marginal unconditional dependence measure. Here, we considered the following procedure to calculate our test statistic:
Perform the clustering algorithm to subdivide Z into M clusters, and let its index set be .
Measure the unconditional dependence for each cluster .
Combine the sum of the values as the test statistic:
We used the sum of the local unconditional dependence measure as the conditional dependence measure, which is similar in spirit to [
4]. Margaritis [
4] considered dividing a univariate
into local bins and using the product of the local measure as a single number. Our method applies to a high-dimensional
Z and takes the sum of kernel-based measures. Given the data
, we divided them into
M clusters based on the value of
Z, and the estimator is
where
is the size of
and
and
are the corresponding kernel matrices for samples
. It is easy to see that the conditioning set
Z is only used in deciding the local clusters. By doing that, we alleviate the influence of the dimension of
Z.
4.1. Local Bootstrap
In this subsection, we introduce the local bootstrap method to sample from
, which completes the CI test. The key is to break the dependence between
X and
Y while keeping the dependence between
and
. An example of an ideal CI permutation is explained in
Figure 1.
In practice, it is impossible to perform the ideal permutation because we do not have enough samples that have the same z. As an alternative method, we used a local bootstrap.
First, given with different
, we generated
independently from the following discrete distribution:
where
are the probabilities to sample the index
j. Under the mild assumptions [
30], we show the consistency of the bootstrap distribution at each
. See the proof in
Appendix A.
Proposition 2. The empirical bootstrap distribution converges to a conditional distribution with each fixed point of : The local bootstrap strategy is an extension of [
14], which was original designed for sampling
according to a regression model. We extended it using a Nadaraya–Watson kernel estimator [
31] to assign the weights for indexes to be sampled. If
is close to
,
is large; thus, the index
j has a larger possibility of being sampled. Moreover,
and
are sampled independently. Thus, it is less possible for both
and
to be sampled simultaneously, which breaks the dependence between
X and
Y. Shi [
14] suggested that the bandwidth
should be varied for different
. Here, we narrowed the candidates from
to the 10-nearest neighbors of each
and let the local bandwidth
of
be the squared Euclidean distance between
and its 10-th nearest neighbor.
The local bootstrap is summarized in Algorithm 1. Each time, we generated n samples as if they come from and calculated the test statistic value T. We repeated this K-times on the generated samples and calculated the p-value based on the histogram. We reject if the p value is smaller than a predefined significance level. Otherwise, we accept . We summarize the procedure in Algorithm 2.
Algorithm 1:Local bootstrap |
Input: Data , . Output: New samples: - 1
fortondo - 2
Let - 3
Sample . - 4
end
|
Algorithm 2:Test |
Input: Data . Cluster number M. Times to repeat K. Output: p-value. - 1
Find M clusters. - 2
Estimate the T. - 3
fortoKdo - 4
Generate samples with Algorithm 1 - 5
Estimate on the generated samples. - 6
end - 7
|
4.2. Effect of M
The choice of the cluster number M affects the test performance. M needs to be chosen properly so that we can focus on the pairs who have similar z values while having enough pairs in each local cluster to make a good estimation of the local HSIC. Besides, the number of M has an effect on the computational complexity: a smaller M makes bigger clusters on average and takes more time to find local HSICs. In practice, we fixed the average cluster size and used k-means to decide the clusters. We let M be , where takes the least integer that is not smaller than x, and is the average cluster size.
4.3. Complexity Analysis
We discuss the time complexity of the test procedure. In the beginning, our method found M clusters and weights w in the bootstrap. Both were calculated once and took little time. The major computational cost was in repeatedly finding the test statistic . Estimating T scales less than , where M is the number of clusters and the is the maximum set size among all clusters and is smaller than n. This was repeated T-times over the generated samples to construct the histogram of the null distribution. The test took . The bootstrap part can be easily parallelized to promote the speed further, but this was beyond the scope of the paper.
5. Experiments
In this section, we compare the proposed methods with other nonparametric CI tests. We denote our proposed method as a Bundle of HSICs (
BHSIC). The evaluation was focused on the Type-I error rate, Type-II error rate, and runtime. A lower Type-II error rate and computational efficiency are essential for a good CI test. In particular, we compared with some representative methods:
KCIT [
10],
RCIT, RCoT [
11],
CCIT [
18], and
CMIknn [
12]. For details about these methods, see
Section 2. All methods have source codes that are available online. Different methods are implemented in different programming languages, and we focused on how these methods scale with the sample size and the dimension of
Z instead of a direct comparison of the runtimes.
We were interested in the performance of the methods with different settings. In our simulations, we considered the following two models. The first model was a simple linear regression model. The second model was a post-nonlinear noise model, which is a commonly used setting in evaluating CI tests [
10,
11,
12]. The functional forms of
X and
Y on
Z are as follows:
where
,
,
, and
are independent standard Gaussian. The coefficients
and the functions
and
were uniformly chosen from
,
,
. We considered (a)
with
and (b)
with
.
In the following simulations, we studied the test performance on different sample sizes and dimensions of Z. The sample sizes n varied from with fixed dimensions of and . The dimensions varied from with a fixed sample size of . We also studied the effect of the cluster number M in our proposed method. The significance levels were set to be in all the simulations. The evaluations of the Type-I error rate, Type-II error rate, and mean runtimes are reported over 100 replications. The Type-I error rate is the false rejection percentage when the underlying truth is with , and the Type-II error rate is the false acceptance percentage when the underlying truth is with . Runtime is the average time to perform one test.
5.1. Hyperparameters
The choice of the hyperparameters affects the results. For the KCIT, RCIT, and RCoT, the bandwidths in the Gaussian kernels were set to be the squared median Euclidean distance between
using all the pairs (or the first 500 pairs if
) double the conditioning set size, which was recommended in [
11]. Thew CMIknn has two hyperparameters: the neighbor size
in finding the estimator of the CMI, and
in the permutation, respectively. The permutation in CMIknn was repeated 1000-times as the default [
12].
In our proposed methods, the bandwidths were set to be the squared median Euclidean distance between in each local cluster. The number of clusters M was set to be when and when , where takes the least integer that is bigger than or equal to x. On average, each cluster had 50 samples when and 80 samples otherwise. The local bootstrap was repeated 1000-times.
5.2. When Z Is Low-Dimensional
We first examined the test performance when
Z is generated independently of a standard Gaussian distribution. The sample size
n changed from 100 to 800. The simulation results on Linear Model 1 and Nonlinear Model 2 are reported in
Figure 2 and
Figure 3, respectively. Both the Type-I error rate and Type-II error rate are reported. Because runtime is independent of the model, we only report it in
Figure 3.
The Linear Model 1 setting with a single conditioning variable
Z is very simple. All methods had controlled Type-I error rates around
and almost zero Type-II error rates, except for the CCIT. In our experiments, the performance of the CCIT was constantly among the worst. The data splitting procedure in the CCIT seems to reduce the power of the test when the sample size is small. In the Nonlinear Model 2 setting, all methods had controlled Type-I error rates around
. However, it was shown that the proposed method and the CMIknn had better powers against the others when the sample size
n was smaller. This matched the result in [
12] that CMIknn performed well with a low-dimensional conditioning set
Z. When the sample size
n was larger than 400, most methods had relatively low Type-II error rates. From the runtime plot, the proposed method was less efficient than the KCIT, RCIT, and RCoT, which are based on an asymptotic distribution. Though the BHSIC and CMIknn were slower, the sampling procedure can readily be parallelized.
5.3. When Z Is High-Dimensional
We next examined the test performance when
Z was a set of 10 variables, and each variable in conditioning set
Z was generated independently from a standard Gaussian distribution. The sample size
n changed from 100 to 800. The simulation results on Linear Model 1 and Nonlinear Model 2 are reported in
Figure 4 and
Figure 5, respectively.
In both linear and nonlinear settings, the RCIT and RCoT failed and had high Type-I error rates. The RCIT and RCoT approximated the KCIT by using random Fourier features and were designed for large-scale datasets. Though they are more scalable than the KCIT, their performances were poor when the sample size was relatively small. The KCIT, CMIknn, and BHSIC performed well in the linear model setting. In the nonlinear model setting, the KCIT showed greater Type I error rates and Type II error rates because the high-dimensional Z led to a less accurate estimation of the asymptotic distribution. We noticed that the BHSIC showed a higher power than the other methods. As we expected, it was beneficial to avoid evaluating the high-dimensional Z directly, which made the method more robust.
5.4. When Changes
Next, we examined the performance when the dimension of
Z changes. We fixed the sample size
and changed
from 1 to 20 in Nonlinear Model 2. The results are shown in
Figure 6. Our proposed method performed well against the growth of the dimension of
Z and showed a higher power than other methods. Moreover, the dimension of
Z did not affect the runtimes since
Z was used in k-means only once, which coincided with our complexity analysis.
5.5. Effect on M
Now, we study the effect on the cluster number
M. We fixed the sample size
in Nonlinear Model 2 and changed
M from 2 to 20. We examined both low-dimensional (
) and high-dimensional (
) cases. The results are shown in
Figure 7.
We noticed that the Type I error rates were controlled when and . As the cluster number M grew to more than 10, the Type-II error rate increased, but the runtimes reduced. The reason is that when we divided the sample points into more clusters, each cluster had fewer points. Thus, the estimation of each local HSIC value became less accurate. On the other hand, the computational cost of the proposed test reduced as the clusters became smaller when the cluster number M grew. The number of samples in each cluster depends on the choice of the clustering algorithms as well. In our experiment, we simply used naive k-means.
6. Conclusions
In this paper, we proposed a novel CI test including a new test statistic and a local bootstrap method to generate samples from the null hypothesis. We first performed clustering to avoid directly evaluating the high-dimensional conditioning set Z. Then, we used the clustering result and combined several local dependence measures as a measure of conditional dependence. Consequently, the problems caused by a high-dimensional Z can be suppressed. The experimental results showed that our method is robust and performs well against the growth of the dimension of the conditioning set.