1. Introduction
Representation learning has recently become a core problem in machine learning, especially with the development of deep learning methods. Different from other statistical representation learning approaches, the information bottleneck principle formalizes the extraction of relevant features about Y from X as an information-theoretic optimization problem: , where amounts to the encoder of the input signal X, f stands for the compression of representation T with respect to input X, g stands for the preserved information of T with respect to output Y, forms a Markov chain, and is the tradeoff parameter. The basic idea of the information bottleneck principle is to obtain the information that X provides about Y through a ‘bottleneck’ representation T. The Markov constraint requires that T is a (possibly stochastic) function of X and can only obtain information about Y through X.
Under this principle, various methods have been proposed, such as information bottleneck (IB) [
1], conditional entropy bottleneck (CEB) [
2], Gaussian IB [
3], multivariate IB [
4], distributed IB [
5], squared IB [
6], deterministic information bottleneck (DIB) [
7], etc. Almost all previous methods use the mutual information
as the information preserving function
g. As for the compression function
f, there are two typical functions, which categorize these methods into two groups. The first group uses the mutual information
, a common measure of representation cost in channel coding, as the compression function. Typical examples include IB, CEB, Gaussian IB, multivariate IB, squared IB, etc. Please note that CEB uses the conditional mutual information
as the compression function, however, it has been proven to be equivalent to mutual information. Similarly, squared IB uses the square of the mutual information
as the compression function; still, we put it into the same category. Instead of the mutual information, the second group, including DIB, uses entropy
as the compression function, which is another common measure to quantify the representation cost as in source coding. The reason for this is that entropy is directly related to the quantity to be constrained such as the number of clusters, which is expected to achieve better compression results.
IB has been extensively studied over recent years in theory and application. In theory, IB has been proven to be able to enhance generalization [
8] and adversarial robustness [
9] by providing a generalization bound and an adversarial robustness bound. In application, IB has been successfully applied in evaluating the representations learned by deep neural networks (DNN) [
10,
11,
12] and completing various tasks such as geometric clustering by iteratively solving a set of self-consistent equations to obtain the optimal solution of IB optimization problem [
13,
14], classification [
2,
15], and generation [
16] by serving as the loss function of DNN via variational methods. Recent research also shows that if we merge the loss function of IB with other models, such as BERT and Graph Neural Network, better generalization and adversarial robustness results will be obtained [
9,
17]. Moreover, IB inspires new methods that improve generalization [
18,
19,
20]. Likewise, DIB has been applied in geometric clustering [
21] and has the potential to be used in similar applications to IB. More wide-ranging work on IB in deep learning and communication is comprehensively summarized in the surveys found in [
22,
23,
24].
Still, the theoretical results are only valid when the training and test data are drawn from the same distribution, but this is rarely the case in practice. Therefore, it is unclear, especially theoretically, whether IB and DIB are able to learn a representation from a source domain that performs well on the target domain. Moreover, it is worth studying which objective function is better. This is exactly the motivation of this paper. To this end, we formulate the problem as transfer learning, and the target domain test error could be decomposed into three parts: the source domain training error, the source domain generalization gap (SG), and the representation discrepancy (RD), based on the transfer learning theory [
25]. Without loss of generality, we assume that both IB and DIB have the ability to achieve small source domain training errors, so our goals become calculating SG and RD and comparing the two methods on these terms.
For SG, Shamir et al. [
8] have provided an upper bound related to
, indicating that minimizing
leads to better generalization. However, this theory is not applicable for comparing IB and DIB because DIB minimizes
, and it is unclear whether further minimizing
may bring some advantages. Therefore, we need to derive a new bound. The difficulty lies in that the new bound needs to not only include both
and
for convenient comparison but also be tighter than the previous one. Since the previous bound in Shamir et al. [
8] was represented as a
function of the variance, we tackle this problem by introducing a different analysis of these two factors. Specifically for the variance, different from relating the variance to the
distance; then, to KL-divergence; and lastly, to mutual information in the previous proof, we bound the variances by functions of expectations and successfully relate them to entropy
. Furthermore, we prove a tighter bound for the
function. Consequently, we prove a tighter generalization bound, suggesting that minimizing
is better than
. Therefore, our results indicate that DIB may generalize better than IB in the source domain.
As for RD, it is measured by the
-distance, as in Ben-David et al. [
26]. However, this term is difficult to compute because
is the hypothesis space of classifiers and is diverse for different models. Inspired by the fact that IB and DIB solutions are mainly different with the variance in the representations, we assume that data are generated with a Gaussian distribution. Therefore, we define a pair-wise
distance to bound the
-distance and relate RD with the variance in representations. Specifically, IB’s representations have larger randomness and thus have smaller RDs. Moreover, the closer the two domains are, the greater the difference between IB and DIB on RD is.
From the above theoretical findings, we conclude that there exists a better objective function under the IB principle. However, how to obtain the optimal objective function remains a challenging problem. Inspired by the trade-off between SG and RD, we propose an elastic information bottleneck (EIB) to interpolate between IB and DIB, i.e., . We can see that EIB includes IB and DIB as special cases. In addition, we provide a variational method to optimize EIB by DNN. We conduct both simulations and real data experiments in this paper. Our results show that EIB is more flexible to different kinds of data and achieves better accuracy than IB and DIB on classification tasks. We also provide an example of combining EIB with a previous domain adaptation algorithm by substituting the cross entropy loss with the EIB objective function, which suggests a promising application of EIB. Our contributions are summarized as follows:
We derive a transfer learning theory for IB frameworks and find a trade-off between SG and RD. Consequently, We propose a novel representation learning method EIB for better transfer learning results under the IB principle, which is flexible to suit different kinds of data and can be merged into domain adaptation algorithms as a regularizer.
In the study of SG, we provide a tighter SG upper bound, which serves as a theoretical guarantee for DIB and further develop the generalization theory in the IB framework.
Comprehensive simulations and experiments validate our theoretic results and demonstrate that EIB outperforms IB and DIB in many cases.
2. Problem Formulation
In domain adaptation, a scenario of transductive transfer learning, we have labeled training data from a source domain, and we wish to learn an algorithm on the training data that performs well on a target domain [
27]. The learning task is the same on the two domains, but the population distribution of the source domain and the target domain are different. Specifically, the source and the target instances come from the same instance space
, and the same instance corresponds to the same label, even if they lie in different domains. However, the population instance distribution on the source and the target domain are different, denoted as
and
. (We use “instance” (X) to distinguish from “label” (Y) and “example” (X,Y) and use “population distribution” (
p or
q) to distinguish from “empirical distribution” (
or
)). Assume that there exists a generic feature space so it can be utilized to transfer knowledge between different domains. That is to say, both the IB and DIB methods involve an encoder with a transition probability
to convert instance
to an intermediate feature
. The induced marginal distribution is then denoted as
and
for the source and target domains, respectively. The IB and DIB methods also have a decoder or classifier
to map from the feature space
to the label space
.
denotes the ground-truth labeling function and
is a labeling function induced by
and
. In classification scenarios, deterministic labels are commonly used. Therefore, we define the deterministic ground-truth-induced labeling function as
. If the maximum probability is not unique, i.e.,
, randomly choose a
as the output. With the above definitions, the expected error on the source domain could be written as
, where
is the indicator function. Similarly, the expected error on target domain is
. Then, our problem could be formulated as follows: When the IB and DIB methods are trained on the source domain, which method achieves a lower target domain expected error
?
According to previous work [
26], the target domain error could be decomposed to three parts, as shown in the following theorem. A detailed proof is provided in
Appendix A.1.
Theorem 1 (Target Error Decomposition)
. Suppose that is the classifier in , which minimizes the sum of the expected error in two domains, i.e., . Let us denote . Then, for a classifier h, we havewhere is the empirical error; is the source generalization error gap, i.e., ; and is the RD defined by the -distance, i.e., . Please note that we assume that the ground-truth labeling function on the two domains remains the same, which is common, as used in many previous studies. If there is a conditional shift, i.e., a distance between
and
, where
and
are the deterministic ground-truth-induced labeling function on the source and target domains, respectively, as defined in Zhao et al. [
28], the above decomposition is not valid any more. The assumption is reasonable since the conditional shift phenomenon is not observed in our experiments.
According to the above theorem, we need to minimize the following three terms to achieve a low expected error, i.e., the training error on source domain, and SG and RD between the marginal distributions on the two domains (RD). Assume that both the IB and DIB methods have the ability to achieve comparable small source training errors, so we focus on SG and RD to compare the two methods.
3. Main Results
3.1. Source Generalization Study of IB and DIB
In this subsection, we study the generalization of IB and DIB on the source domain. The generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit the training data. The generalization error gap in statistical learning theory is defined as the expected difference between the population risk and the empirical risk. Russo and Zou [
29], Xu and Raginsky [
30] provided a generalization upper bound with mutual information of the input and output. Sefidgaran et al. [
31], Sefidgaran et al. [
32] further related the mutual information with two other studies of generalization error, i.e., compressibility and fractal dimensions, providing a unifying framework for the three directions of studies.
In the theoretic study of generalization and adversarial robustness of IB, however, the measure is not exactly the same for the convenience of illustrating the characteristic of IB [
8,
9]. Specifically, in
Section 4.1 in Shamir et al. [
8], the classification error is proven to be exponentially upper bounded by
, indicating that
is a measure of performance. Therefore, the difference between the population performance
and empirical performance
is used to measure the ability of generalization. Shamir et al. [
8] provided the following generalization upper bound, which is relevant to the mutual information
.
Theorem 2 (The previous generalization bound)
. Denote the empirical estimates by . For any probability distribution , with a probability of at least over the draw of the sample of size m from p(x,y), we have that for all T,where is a constant and only depend on m, δ, , , and , where min refers to the minimum value other than 0. However, this upper bound cannot be directly used to compare IB and DIB because both IB and DIB minimizes . Moreover, the regularizer of DIB: further minimizes the term . However, it is not clear whether it may bring any advantage from the previous theoretical result. To tackle this problem, we prove a new generalization bound in this paper, as shown in the following theorem.
Lemma 1. If , , then .
Lemma 2. If , , then
Theorem 3 (Our generalization bound).
For any probability distribution , with a probability of at least over the draw of the sample of size m from p(x, y), we have that for all T,where only depend on δ, , , , , and . Proof. Here, we show a proof sketch to help us understand how our bound is different from the previous one. The complete proof is in the appendix. Similarly to the previous proof, SG is firstly divided into three parts.
Denote
,
,
. Then, we can summarize the previous proof and our proof in
Figure 1.
The main idea of the previous proof is to bound these parts by
functions of sample variances in (5) and (9), where
, and
is defined by (
A1). The variances are then related to the
distances in (6) and (10); KL-divergence in (7); and lastly, mutual information in (8) and (11).
We also use the idea of bounding by functions of variances, as shown in (12), (15), and (18). However, the variance is an exact one, instead of the previous estimation on the samples, i.e., . Consequently, we can obtain a tighter bound. Specifically, we first use Lemma 1 to convert the variance to the expectation in (13), (16), and (19). Then, with the help of Lemma 2, we turn the functions into entropy, like from (13) to (14). With (14), (17), and (20), we finish the proof of Theorem 3. □
We can see that this bound is similar to the previous one, but the new bound is related to the regularization term , and , instead of the mutual information . Note that and are closely related by . Additionally, is equivalent to as a regularizer because . With these two observations above, this bound is directly dependent on the entropy , i.e., the regularizer of DIB. That is to say, compressing the entropy of the representation is beneficial to reducing SG.
Now we will illustrate why DIB generalizes better. Intuitively, since DIB directly minimizes
while IB only minimizes a part of them, i.e.,
, DIB may have a better generalization than IB. Owing to the lack of close form solutions, we cannot explicitly see the difference between IB and DIB with respect to the concerning information quantities. However, we can solve the IB and DIB problems through a self-consistent iterative algorithm, with
or 1 in
Appendix B.1. The IB and DIB solutions are shown on the information plain in
Appendix C with
or 1, showing that the IB solutions have larger
than DIB solutions, and thus, DIB generalizes better in the sense of Theorem 3 in our paper.
Furthermore, our bound is tighter than the previous one. We provide a comparison of the order in
Appendix A.2, which shows that the IB bound is
times larger than the DIB bound for the first two terms and
times larger for the third term. There is also an empirical comparison in
Section 4.1.2. Moreover, Theorems 2 and 3 require some constraints for sample size. They are summarized in the
Appendix A.2 too. The empirical results show that our bound requires a smaller sample size.
To sum up, we provide a tighter generalization bound in the IB framework, which serves as a theoretical support for DIB’s generalization performance. Experiments on MNIST will validate this theoretical result in
Section 4.2.1.
3.2. Representation Discrepancy of IB and DIB
In this section, we compare the RD of IB and DIB. According to the target error decomposition theorem in
Section 1, RD is measured by
-distance, which is however difficult for direct computation because of the complex hypothesis space
. To remove the dependence of
, we propose to bound RD by a pair-wise
distance.
To start with the simplest case, assume that the sample sizes on the source and target domains are both m, and the samples on two domains have a one-to-one correspondence (i.e., semantically close to each other). Then, RD can be bounded by the following pair-wise distance.
Proposition 1. The distance of overall representations on the source and target domains is bounded by the distance of individual instance representations.where stands for the set of all correspondence pairs and , which is small when the sample size is large. In fact, Proposition 1 is valid for any pair, so there can be different upper bounds. To obtain the lowest upper bound, we define correspondent pairs such that is the lowest, i.e., .
We parameterize
to be a
d-dimensional Gaussian with a diagonal covariance matrix, which is usually assumed in some variational methods such as Alemi et al. [
15]. Compared with IB, DIB additionally reduces
, so its
are almost deterministic. Therefore, the variance in
in each dimension are significantly smaller than that of
. On the other hand, since the only difference between IB and DIB is altering
and the entropy of Gaussian random variable is only dependent on its variance, the expectations of
and
are comparable. Therefore, we need to find how the variances affect RD.
Consider the representations of the instances from the source and target domains in the same model.
, since
and
are semantically close, the expectations of their representations are close. Additionally, the discrepancy between their variances are small compared to the discrepancy in the representation variances between IB and DIB, so we can neglect them. Therefore, denote that
,
,
; then,
where
is the cumulative distribution function of a standard Gaussian distribution.
As discussed above, compared with IB, DIB has significantly smaller variances in the representations and
are comparable for IB and DIB. Therefore, the term
for IB is remarkably smaller than DIB. With
monotonically increasing,
for IB is smaller than that for DIB.
Figure 2 provides an intuitive understanding about how randomness helps reduce RD. Suppose that the blue and red lines are
and
, where
. We can see clearly that their
distance drops with the growth in variances. Moreover, because the derivative of
monotonically decreases on
, when
is smaller, the difference between IB and DIB will be larger. This phenomenon is also found in simulations; see
Section 4.1.1 and
Section 4.1.3.
When the sample size on the two domains are different and the “pair-wise” correspondence does not hold, we can take the correlated instances in the two domains as a general form of correspondent pairs; then, the above comparison result is also valid. Details are found in
Appendix A.3.
Please note that the assumption of correspondent pairs is reasonable. In transfer learning, it is widely assumed that there exists shared (high level) features and common distributions of representation on the two domains. The correspondent pairs can be viewed as the instance pairs with the closest distributions of representations from the two domains.When the distributions are distinct, feature alignment is usually implemented in practice.
According to the above theoretical results, we can obtain the following comparisons between IB and DIB, with respect to the three terms for the target error decomposition, as illustrated in
Table 1. Clearly, there is a trade-off between IB and DIB, in terms of SG and RD.
3.3. Elastic Information Bottleneck Method
From the previous results, IB and DIB still have room for improvement, respectively, on SG and RD. To obtain a Pareto optimal solution under the IB principle, we propose a new bottleneck method as follows, namely EIB. Please note that the generalized IB objective function in [
7] is the same as the objective function of elastic IB, but they are derived from different perspectives. The generalized IB is constructed for the convenience of solving the DIB problem, while EIB is proposed to balance SG and RD.
Definition 1 (Elastic Information bottleneck)
. The objective function of the elastic information bottleneck method is as follows: We can see that EIB is a linear combination of IB and DIB, which covers IB and DIB as special cases. Specifically, EIB reduces to IB when and DIB when . Since IB and DIB perform better than the other one on SG or RD, respectively, a linear combination may lead to better target performance. In fact, the global optimal solution of the linear combination of the objectives is in the Pareto optimal solution set. Therefore, by adjusting in , we can obtain a balance between SG and RD and achieve a Pareto frontier within the IB framework.
As a bottleneck method similar to IB, its optimal solution can be calculated by iterative algorithms when the joint distribution of instances and labels are known. The algorithm and EIB’s optimal solutions on information planes are provided in
Appendix B. However, the iterative methods are intractable when the data are numerous and high-dimensional. Therefore, we use the variational approaches similar to the method in Fischer [
2] to optimize IB, DIB, and EIB objective functions by neural networks. Assume the representation is a k-dimension Gaussian distribution and all dimensions are independent. The network contains an encoder
, a decoder
, and a backward encoder
. To be specific, the encoder is an MLP which outputs the K-dimensional expectations
and K-dimensional diagonal elements of covariance matrix
and then yield the Gaussian representation with the reparameterization trick [
15]. The decoder is a simple logistic regression model. The backward encoder is an MLP that uses the onehot encoding of the classify outcome as an input and the K-dimensional expectations
and K-dimensional diagonal elements of covariance matrix
as an output. The loss function of variational EIB is as follows:
where
is the cross entropy loss.
Our simulations and real data experiments show that EIB outperforms IB and DIB. It is also worth noticing that EIB can be plugged into previous transfer learning algorithms to expand its application. An example is given in the next section.