Common Information Components Analysis
Abstract
:1. Introduction
1.1. Related Work
1.2. Contributions
- We introduce a novel suit of algorithms, referred to as CICA. These algorithms are characterized by a two-step procedure. In the first step, a relaxation of Wyner’s common information is extracted. The second step can be interpreted as a form of projection of the common information back onto the original data so as to obtain the respective features. A free parameter is introduced to control the complexity of the extracted features.
- We establish that, for the special case where the original data are jointly Gaussian, our algorithms precisely extract the CCA components. In this case, the parameter determines how many of the CCA components are extracted. In this sense, we establish a new rigorous connection between information measures and CCA.
- We present initial results on how to extend CICA to more than two variates.
- Via a number of paradigmatic examples, we illustrate that, for discrete data, CICA gives intuitively pleasing results while other methods, including CCA, do not. This is most pronounced in a simple example with three sources described in Section 7.1.
1.3. Notation
1.4. A Simple Example with Synthetic Data
2. Wyner’s Common Information and Its Relaxation
2.1. Wyner’s Common Information
2.2. A Natural Relaxation of Wyner’s Common Information
- 1.
- 2.
- Data processing inequality: If forms a Markov chain,then
- 3.
- is a convex and continuous function of γ for
- 4.
- Tensorization: For n independent pairs we have thatwhere the min is over all non-negative satisfying
- 5.
- If forms a Markov chain, then
- 6.
- The cardinality of may be restricted to
- 7.
- If and are one-to-one functions, then
- 8.
- For discrete we have
2.3. The Non-Convexity of the Relaxed Wyner’s Common Information Problem
2.4. The Operational Significance of the Relaxed Wyner’s Common Information Problem
3. The Algorithm
3.1. High-Level Description
3.2. Main Steps of the Algorithm
- 1.
- Select a real number where This is the compression level: A low value of γ represents low compression, and, thus, many components are retained. A high value of γ represents high compression, and, thus, only a small number of components are retained.
- 2.
- Solve the relaxed Wyner’s common information problem,leading to an associated conditional distribution
- 3.
- Using the conditional distribution found in Step 2), the dimension-reduced data sets can now be found via one of the following three variants:
- (a)
- Version 1: MAP (maximum a posteriori):
- (b)
- Version 2: Conditional Expectation:
- (c)
- Version 3: Marginal Integration:
4. For Gaussian, CICA Is CCA
4.1. Wyner’s Common Information and Its Relaxation in the Gaussian Case
4.2. CICA in the Gaussian Case and the Exact Connection with CCA
- 1.
- The top k CCA components are a solution to all three versions of Generic Procedure 1.
- 2.
- The parameter γ controls the number k as follows:where
5. A Binary Example
- Let us first analyze the behavior and outcome of CCA in this particular example. The key observation is that any pair, amongst the four entries in these two vectors, and are (pairwise) independent binary uniform random variables. Hence, the overall covariance matrix of the merged random vector is merely a scaled identity matrix. This, in turn, implies that CCA as described in Equations (34) and (35) merely boils down to the identity mapping. Concretely, this means that, for CCA, in this example, the best one-dimensional projections are ex aequo any pair of one coordinate of the vector with one coordinate of the vector As we have already explain above, any such pair is merely a pair of independent (and identically distributed) random variables, so CCA does not extract any dependence between and at all. Of course, this is the main point of the present example.
- How does CICA perform in this example? We selected this example because it represents one of the only cases for which a closed-form solution to the optimization problem in Equation (13) is known, at least in the case To see this, let us first observe that, in our example, we haveLet us now apply Version 1 (the MAP version) of Generic Procedure 1. To this end, we also need to calculate and Again, for these can be expressed in a closed form as follows:As a final note, we point out that it is conceptually straightforward to evaluate Versions 2 and 3 (conditional expectation) of Generic Procedure 1 in this example, but this would require embedding the considered binary alphabets into the real numbers. This makes it a less satisfying option for the simple example at hand.
6. A Gradient Descent Based Implementation
Algorithm 1: Approximate CICA Algorithm via Gradient Descent |
7. Extension to More than Two Sources
- 1.
- 2.
- is a convex and continuous function of γ for
- 3.
- If forms a Markov chain,then
- 4.
- The cardinality of may be restricted to
- 5.
- If are one-to-one functions,then
- 6.
- For discrete X, we have
- 1.
- Select a real number where This is the compression level: A low value of γ represents low compression, and, thus, many components are retained. A high value of γ represents high compression, and, thus, only a small number of components are retained.
- 2.
- Solving the relaxed Wyner’s common information problem, -4.6cm0cmleading to an associated conditional distribution
- 3.
- Using the conditional distribution found in Step 2, the dimension-reduced data sets can now be found via one of the following three variants:
- (a)
- Version 1: MAP (maximum a posteriori):for
- (b)
- Version 2: Conditional Expectation:for
- (c)
- Version 3: Marginal Integration:for
7.1. A Binary Example with Three Sources
8. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
CCA | Canonical correlation analysis |
ACE | Alternating conditional expectation |
ICA | Independent component analysis |
CICA | Common information component analysis |
Appendix A. Proof of Lemma 1
Appendix B. A Brief Review of Canonical Correlation Analysis (CCA)
Appendix C. Proof of Lemma 3
Appendix D. Proof of Lemma 4
References
- Hotelling, H. Relations between two sets of variants. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
- Satpathy, S.; Cuff, P. Gaussian secure source coding and Wyner’s common information. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 116–120. [Google Scholar]
- Huang, S.L.; Wornell, G.W.; Zheng, L. Gaussian universal features, canonical correlations, and common information. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018. [Google Scholar]
- Gastpar, M.; Sula, E. Relaxed Wyner’s common information. In Proceedings of the 2019 IEEE Information Theory Workshop, Visby, Sweden, 25–28 August 2019. [Google Scholar]
- Sula, E.; Gastpar, M. On Wyner’s common information in the Gaussian Case. arXiv 2019, arXiv:1912.07083. [Google Scholar]
- Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
- Bach, F.; Jordan, M. A Probabilistic Interpretation of Canonical Correlation Analysis; Technical Report 688; University of California: Berkeley, CA, USA, 2005. [Google Scholar]
- Breiman, L.; Friedman, J.H. Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 1985, 80, 580–598. [Google Scholar] [CrossRef]
- Comon, P. Independent component analysis. In Proceedings of the International Signal Processing Workshop on High-Order Statistics, Chamrousse, France, 10–12 July 1991; pp. 111–120. [Google Scholar]
- Witsenhausen, H.S.; Wyner, A.D. A conditional entropy bound for a pair of discrete random variables. IEEE Trans. Inf. Theory 1975, 21, 493–501. [Google Scholar] [CrossRef]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
- Xiao-Tong Yuan, B.H.; Hu, B.G. Robust feature extraction via information theoretic learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1193–1200. [Google Scholar]
- Wang, H.; Chen, P. A feature extraction method based on information theory for fault diagnosis of reciprocating machinery. Sensors 2009, 9, 2415–2436. [Google Scholar] [CrossRef]
- Bi, J.; Bennett, K.P.; Embrechts, M.; Breneman, C.M.; Song, M. Dimensionality reduction via sparse support vector machines. J. Mach. Learn. Res. 2003, 3, 1229–1243. [Google Scholar]
- Laparra, V.; Malo, J.; Camps-Valls, G. Dimensionality reduction via regression in hyperspectral imagery. IEEE J. Sel. Top. Signal Process. 2015, 9, 1026–1036. [Google Scholar] [CrossRef] [Green Version]
- Gao, J.; Shi, Q.; Caetano, T.S. Dimensionality reduction via compressive sensing. Pattern Recognit. Lett. 2012, 33, 1163–1170. [Google Scholar] [CrossRef]
- Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006. [Google Scholar]
- Zhang, Z.; Zha, H. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM J. Sci. Comput. 2004, 26, 313–338. [Google Scholar] [CrossRef] [Green Version]
- Vepakomma, P. Supervised dimensionality reduction via distance correlation maximization. Electron. J. Stat. 2018, 12, 960–984. [Google Scholar] [CrossRef]
- Wu, J.; Wang, J.; Liu, L. Feature extraction via KPCA for classification of gait patterns. Hum. Mov. Sci. 2007, 26, 393–411. [Google Scholar] [CrossRef] [PubMed]
- Lai, Z.; Mo, D.; Wong, W.K.; Xu, Y.; Miao, D.; Zhang, D. Robust discriminant regression for feature extraction. IEEE Trans. Cybern. 2018, 48, 2472–2484. [Google Scholar]
- Wang, H.; Zhang, Y.; Waytowich, N.R.; Krusienski, D.J.; Zhou, G.; Jin, J.; Wang, X.; Cichocki, A. Discriminative feature extraction via multivariate linear regression for SSVEP-based BCI. IEEE Trans. Neural Syst. Rehabil. Eng. 2016, 24, 532–541. [Google Scholar] [CrossRef] [PubMed]
- Wyner, A. The common information of two dependent random variables. IEEE Trans. Inf. Theory 1975, 21, 163–179. [Google Scholar] [CrossRef]
- Witsenhausen, H.S. Values and bounds for the common information of two discrete random variables. SIAM J. Appl. Math 1976, 31, 313–333. [Google Scholar] [CrossRef]
- Xu, G.; Liu, W.; Chen, B. Wyner’s common information for continuous random variables—A lossy source coding interpretation. In Proceedings of the Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 23–25 March 2011. [Google Scholar]
- Xu, G.; Liu, W.; Chen, B. A lossy source coding interpretation of Wyner’s common information. IEEE Trans. Inf. Theory 2016, 62, 754–768. [Google Scholar] [CrossRef]
- Gastpar, M.; Sula, E. Common information components analysis. In Proceedings of the Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 2–7 February 2020. [Google Scholar]
- Wang, C.Y. Function Computation over Networks: Efficient Information Processing for Cache and Sensor Applications. Ph.D. Thesis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2015. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: Hoboken, NJ, USA, 2005. [Google Scholar]
- Timo, R.; Bidokhti, S.S.; Wigger, M.A.; Geiger, B.C. A rate-distortion approach to caching. IEEE Trans. Inf. Theory 2018, 64, 1957–1976. [Google Scholar] [CrossRef] [Green Version]
- Watanabe, S. Information theoretical analysis of multivariate correlation. IBM J. Res. Dev. 1960, 4, 66–82. [Google Scholar] [CrossRef]
- Ahlswede, R.; Körner, J. Source coding with side information and a converse for degraded broadcast channels. IEEE Trans. Inf. Theory 1975, 21, 629–637. [Google Scholar] [CrossRef]
- Wang, C.Y.; Lim, S.H.; Gastpar, M. Information-theoretic caching: Sequential coding for computing. IEEE Trans. Inf. Theory 2016, 62, 6393–6406. [Google Scholar] [CrossRef] [Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sula, E.; Gastpar, M.C. Common Information Components Analysis. Entropy 2021, 23, 151. https://doi.org/10.3390/e23020151
Sula E, Gastpar MC. Common Information Components Analysis. Entropy. 2021; 23(2):151. https://doi.org/10.3390/e23020151
Chicago/Turabian StyleSula, Erixhen, and Michael C. Gastpar. 2021. "Common Information Components Analysis" Entropy 23, no. 2: 151. https://doi.org/10.3390/e23020151
APA StyleSula, E., & Gastpar, M. C. (2021). Common Information Components Analysis. Entropy, 23(2), 151. https://doi.org/10.3390/e23020151