entropy-logo

Journal Browser

Journal Browser

Information Theory and Machine Learning

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory, Probability and Statistics".

Deadline for manuscript submissions: closed (28 February 2022) | Viewed by 48054

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Interests: learning theory; deep learning; information geometry; statistical inference; multi-terminal information theory

E-Mail Website
Guest Editor
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
Interests: a computational approach to information theoretic converses; coding for distributed data storage; joint source-channel coding; an approximate approach to network information theory lossy multiuser source coding problems

Special Issue Information

Dear Colleagues,

There are a number of significant steps in the development of machine learning that benefit from information theoretic analysis, as well as the insights into information processing that it brings. While we expect information theory to play an even more significant role in the next wave of growth in machine learning and artificial intelligence, we also recognize the new challenges in this task. There are indeed a set of lofty goals, where we hope to have a holistic view of data processing, to work with high-dimensional data and inaccurate statistical models, to incorporate domain knowledge, to provide performance guarantees, robustness, security, and fairness, to reduce the use of computational resources, to generate reusable and interpretable learning results, etc. Correspondingly, in theoretical studies, we shall need new formulations, new mathematical tools, new analysis techniques, and maybe even new metrics to evaluate the guidance and insights offered by theoretical studies.

The goal of this Special Issue is to collect new results in using information theoretic thinking to solve machine learning problems. We are also interested in papers presenting new methods and new concepts, even if some of these ideas might not have been fully developed, or might not have the most compelling set of supporting experimental results.

Some of the topics of interest are listed below:

  • Understanding gradient descent and general iterative algorithms;
  • Sample complexity and generalization errors;
  • Utilizing knowledge of data structure in learning;
  • Distributed learning, communication-aware learning algorithms;
  • Transfer learning;
  • Multimodal learning and information fusion;
  • Information theoretic approaches in active and reinforcement learning;
  • Representation learning and its information theoretic interpretation;
  • Method and theory for model compression.

Prof. Dr. Lizhong Zheng
Prof. Dr. Chao Tian
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

13 pages, 1437 KiB  
Article
Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning
by Leighton Pate Barnes, Alex Dytso and Harold Vincent Poor
Entropy 2022, 24(9), 1178; https://doi.org/10.3390/e24091178 - 24 Aug 2022
Cited by 6 | Viewed by 1891
Abstract
We consider information-theoretic bounds on the expected generalization error for statistical learning problems in a network setting. In this setting, there are K nodes, each with its own independent dataset, and the models from the K nodes have to be aggregated into a [...] Read more.
We consider information-theoretic bounds on the expected generalization error for statistical learning problems in a network setting. In this setting, there are K nodes, each with its own independent dataset, and the models from the K nodes have to be aggregated into a final centralized model. We consider both simple averaging of the models as well as more complicated multi-round algorithms. We give upper bounds on the expected generalization error for a variety of problems, such as those with Bregman divergence or Lipschitz continuous losses, that demonstrate an improved dependence of 1/K on the number of nodes. These “per node” bounds are in terms of the mutual information between the training dataset and the trained weights at each node and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

26 pages, 1424 KiB  
Article
A Pattern Dictionary Method for Anomaly Detection
by Elyas Sabeti, Sehong Oh, Peter X. K. Song and Alfred O. Hero
Entropy 2022, 24(8), 1095; https://doi.org/10.3390/e24081095 - 9 Aug 2022
Cited by 2 | Viewed by 2940
Abstract
In this paper, we propose a compression-based anomaly detection method for time series and sequence data using a pattern dictionary. The proposed method is capable of learning complex patterns in a training data sequence, using these learned patterns to detect potentially anomalous patterns [...] Read more.
In this paper, we propose a compression-based anomaly detection method for time series and sequence data using a pattern dictionary. The proposed method is capable of learning complex patterns in a training data sequence, using these learned patterns to detect potentially anomalous patterns in a test data sequence. The proposed pattern dictionary method uses a measure of complexity of the test sequence as an anomaly score that can be used to perform stand-alone anomaly detection. We also show that when combined with a universal source coder, the proposed pattern dictionary yields a powerful atypicality detector that is equally applicable to anomaly detection. The pattern dictionary-based atypicality detector uses an anomaly score defined as the difference between the complexity of the test sequence data encoded by the trained pattern dictionary (typical) encoder and the universal (atypical) encoder, respectively. We consider two complexity measures: the number of parsed phrases in the sequence, and the length of the encoded sequence (codelength). Specializing to a particular type of universal encoder, the Tree-Structured Lempel–Ziv (LZ78), we obtain a novel non-asymptotic upper bound, in terms of the Lambert W function, on the number of distinct phrases resulting from the LZ78 parser. This non-asymptotic bound determines the range of anomaly score. As a concrete application, we illustrate the pattern dictionary framework for constructing a baseline of health against which anomalous deviations can be detected. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

16 pages, 722 KiB  
Article
A Maximal Correlation Framework for Fair Machine Learning
by Joshua Lee, Yuheng Bu, Prasanna Sattigeri, Rameswar Panda, Gregory W. Wornell, Leonid Karlinsky and Rogerio Schmidt Feris
Entropy 2022, 24(4), 461; https://doi.org/10.3390/e24040461 - 26 Mar 2022
Cited by 2 | Viewed by 2605
Abstract
As machine learning algorithms grow in popularity and diversify to many industries, ethical and legal concerns regarding their fairness have become increasingly relevant. We explore the problem of algorithmic fairness, taking an information–theoretic view. The maximal correlation framework is introduced for expressing fairness [...] Read more.
As machine learning algorithms grow in popularity and diversify to many industries, ethical and legal concerns regarding their fairness have become increasingly relevant. We explore the problem of algorithmic fairness, taking an information–theoretic view. The maximal correlation framework is introduced for expressing fairness constraints and is shown to be capable of being used to derive regularizers that enforce independence and separation-based fairness criteria, which admit optimization algorithms for both discrete and continuous variables that are more computationally efficient than existing algorithms. We show that these algorithms provide smooth performance–fairness tradeoff curves and perform competitively with state-of-the-art methods on both discrete datasets (COMPAS, Adult) and continuous datasets (Communities and Crimes). Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

40 pages, 43362 KiB  
Article
CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction
by Xili Dai, Shengbang Tong, Mingyang Li, Ziyang Wu, Michael Psenka, Kwan Ho Ryan Chan, Pengyuan Zhai, Yaodong Yu, Xiaojun Yuan, Heung-Yeung Shum and Yi Ma
Entropy 2022, 24(4), 456; https://doi.org/10.3390/e24040456 - 25 Mar 2022
Cited by 15 | Viewed by 14889
Abstract
This work proposes a new computational framework for learning a structured generative model for real-world datasets. In particular, we propose to learn a Closed-loop Transcriptionbetween a multi-class, multi-dimensional data distribution and a Linear discriminative representation (CTRL) in [...] Read more.
This work proposes a new computational framework for learning a structured generative model for real-world datasets. In particular, we propose to learn a Closed-loop Transcriptionbetween a multi-class, multi-dimensional data distribution and a Linear discriminative representation (CTRL) in the feature space that consists of multiple independent multi-dimensional linear subspaces. In particular, we argue that the optimal encoding and decoding mappings sought can be formulated as a two-player minimax game between the encoder and decoderfor the learned representation. A natural utility function for this game is the so-called rate reduction, a simple information-theoretic measure for distances between mixtures of subspace-like Gaussians in the feature space. Our formulation draws inspiration from closed-loop error feedback from control systems and avoids expensive evaluating and minimizing of approximated distances between arbitrary distributions in either the data space or the feature space. To a large extent, this new formulation unifies the concepts and benefits of Auto-Encoding and GAN and naturally extends them to the settings of learning a both discriminative and generative representation for multi-class and multi-dimensional real-world data. Our extensive experiments on many benchmark imagery datasets demonstrate tremendous potential of this new closed-loop formulation: under fair comparison, visual quality of the learned decoder and classification performance of the encoder is competitive and arguably better than existing methods based on GAN, VAE, or a combination of both. Unlike existing generative models, the so-learned features of the multiple classes are structured instead of hidden: different classes are explicitly mapped onto corresponding independent principal subspaces in the feature space, and diverse visual attributes within each class are modeled by the independent principal components within each subspace. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

17 pages, 1950 KiB  
Article
Robust Spike-Based Continual Meta-Learning Improved by Restricted Minimum Error Entropy Criterion
by Shuangming Yang, Jiangtong Tan and Badong Chen
Entropy 2022, 24(4), 455; https://doi.org/10.3390/e24040455 - 25 Mar 2022
Cited by 137 | Viewed by 5839
Abstract
The spiking neural network (SNN) is regarded as a promising candidate to deal with the great challenges presented by current machine learning techniques, including the high energy consumption induced by deep neural networks. However, there is still a great gap between SNNs and [...] Read more.
The spiking neural network (SNN) is regarded as a promising candidate to deal with the great challenges presented by current machine learning techniques, including the high energy consumption induced by deep neural networks. However, there is still a great gap between SNNs and the online meta-learning performance of artificial neural networks. Importantly, existing spike-based online meta-learning models do not target the robust learning based on spatio-temporal dynamics and superior machine learning theory. In this invited article, we propose a novel spike-based framework with minimum error entropy, called MeMEE, using the entropy theory to establish the gradient-based online meta-learning scheme in a recurrent SNN architecture. We examine the performance based on various types of tasks, including autonomous navigation and the working memory test. The experimental results show that the proposed MeMEE model can effectively improve the accuracy and the robustness of the spike-based meta-learning performance. More importantly, the proposed MeMEE model emphasizes the application of the modern information theoretic learning approach on the state-of-the-art spike-based learning algorithms. Therefore, in this invited paper, we provide new perspectives for further integration of advanced information theory in machine learning to improve the learning performance of SNNs, which could be of great merit to applied developments with spike-based neuromorphic systems. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

28 pages, 498 KiB  
Article
An Information Theoretic Interpretation to Deep Neural Networks
by Xiangxiang Xu, Shao-Lun Huang, Lizhong Zheng and Gregory W. Wornell
Entropy 2022, 24(1), 135; https://doi.org/10.3390/e24010135 - 17 Jan 2022
Cited by 13 | Viewed by 4235
Abstract
With the unprecedented performance achieved by deep learning, it is commonly believed that deep neural networks (DNNs) attempt to extract informative features for learning tasks. To formalize this intuition, we apply the local information geometric analysis and establish an information-theoretic framework for feature [...] Read more.
With the unprecedented performance achieved by deep learning, it is commonly believed that deep neural networks (DNNs) attempt to extract informative features for learning tasks. To formalize this intuition, we apply the local information geometric analysis and establish an information-theoretic framework for feature selection, which demonstrates the information-theoretic optimality of DNN features. Moreover, we conduct a quantitative analysis to characterize the impact of network structure on the feature extraction process of DNNs. Our investigation naturally leads to a performance metric for evaluating the effectiveness of extracted features, called the H-score, which illustrates the connection between the practical training process of DNNs and the information-theoretic framework. Finally, we validate our theoretical results by experimental designs on synthesized data and the ImageNet dataset. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

13 pages, 492 KiB  
Article
Probabilistic Deterministic Finite Automata and Recurrent Networks, Revisited
by Sarah E. Marzen and James P. Crutchfield
Entropy 2022, 24(1), 90; https://doi.org/10.3390/e24010090 - 6 Jan 2022
Cited by 1 | Viewed by 2737
Abstract
Reservoir computers (RCs) and recurrent neural networks (RNNs) can mimic any finite-state automaton in theory, and some workers demonstrated that this can hold in practice. We test the capability of generalized linear models, RCs, and Long Short-Term Memory (LSTM) RNN architectures to predict [...] Read more.
Reservoir computers (RCs) and recurrent neural networks (RNNs) can mimic any finite-state automaton in theory, and some workers demonstrated that this can hold in practice. We test the capability of generalized linear models, RCs, and Long Short-Term Memory (LSTM) RNN architectures to predict the stochastic processes generated by a large suite of probabilistic deterministic finite-state automata (PDFA) in the small-data limit according to two metrics: predictive accuracy and distance to a predictive rate-distortion curve. The latter provides a sense of whether or not the RNN is a lossy predictive feature extractor in the information-theoretic sense. PDFAs provide an excellent performance benchmark in that they can be systematically enumerated, the randomness and correlation structure of their generated processes are exactly known, and their optimal memory-limited predictors are easily computed. With less data than is needed to make a good prediction, LSTMs surprisingly lose at predictive accuracy, but win at lossy predictive feature extraction. These results highlight the utility of causal states in understanding the capabilities of RNNs to predict. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

24 pages, 2454 KiB  
Article
Summarizing Finite Mixture Model with Overlapping Quantification
by Shunki Kyoya and Kenji Yamanishi
Entropy 2021, 23(11), 1503; https://doi.org/10.3390/e23111503 - 13 Nov 2021
Cited by 4 | Viewed by 2338
Abstract
Finite mixture models are widely used for modeling and clustering data. When they are used for clustering, they are often interpreted by regarding each component as one cluster. However, this assumption may be invalid when the components overlap. It leads to the issue [...] Read more.
Finite mixture models are widely used for modeling and clustering data. When they are used for clustering, they are often interpreted by regarding each component as one cluster. However, this assumption may be invalid when the components overlap. It leads to the issue of analyzing such overlaps to correctly understand the models. The primary purpose of this paper is to establish a theoretical framework for interpreting the overlapping mixture models by estimating how they overlap, using measures of information such as entropy and mutual information. This is achieved by merging components to regard multiple components as one cluster and summarizing the merging results. First, we propose three conditions that any merging criterion should satisfy. Then, we investigate whether several existing merging criteria satisfy the conditions and modify them to fulfill more conditions. Second, we propose a novel concept named clustering summarization to evaluate the merging results. In it, we can quantify how overlapped and biased the clusters are, using mutual information-based criteria. Using artificial and real datasets, we empirically demonstrate that our methods of modifying criteria and summarizing results are effective for understanding the cluster structures. We therefore give a new view of interpretability/explainability for model-based clustering. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

20 pages, 494 KiB  
Article
Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method
by Matthew Dixon and Tyler Ward
Entropy 2021, 23(11), 1419; https://doi.org/10.3390/e23111419 - 28 Oct 2021
Cited by 3 | Viewed by 2269
Abstract
Modern computational models in supervised machine learning are often highly parameterized universal approximators. As such, the value of the parameters is unimportant, and only the out of sample performance is considered. On the other hand much of the literature on model estimation assumes [...] Read more.
Modern computational models in supervised machine learning are often highly parameterized universal approximators. As such, the value of the parameters is unimportant, and only the out of sample performance is considered. On the other hand much of the literature on model estimation assumes that the parameters themselves have intrinsic value, and thus is concerned with bias and variance of parameter estimates, which may not have any simple relationship to out of sample model performance. Therefore, within supervised machine learning, heavy use is made of ridge regression (i.e., L2 regularization), which requires the the estimation of hyperparameters and can be rendered ineffective by certain model parameterizations. We introduce an objective function which we refer to as Information-Corrected Estimation (ICE) that reduces KL divergence based generalization error for supervised machine learning. ICE attempts to directly maximize a corrected likelihood function as an estimator of the KL divergence. Such an approach is proven, theoretically, to be effective for a wide class of models, with only mild regularity restrictions. Under finite sample sizes, this corrected estimation procedure is shown experimentally to lead to significant reduction in generalization error compared to maximum likelihood estimation and L2 regularization. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

20 pages, 756 KiB  
Article
Population Risk Improvement with Model Compression: An Information-Theoretic Approach
by Yuheng Bu, Weihao Gao, Shaofeng Zou and Venugopal V. Veeravalli
Entropy 2021, 23(10), 1255; https://doi.org/10.3390/e23101255 - 27 Sep 2021
Cited by 11 | Viewed by 2430
Abstract
It has been reported in many recent works on deep model compression that the population risk of a compressed model can be even better than that of the original model. In this paper, an information-theoretic explanation for this population risk improvement phenomenon is [...] Read more.
It has been reported in many recent works on deep model compression that the population risk of a compressed model can be even better than that of the original model. In this paper, an information-theoretic explanation for this population risk improvement phenomenon is provided by jointly studying the decrease in the generalization error and the increase in the empirical risk that results from model compression. It is first shown that model compression reduces an information-theoretic bound on the generalization error, which suggests that model compression can be interpreted as a regularization technique to avoid overfitting. The increase in empirical risk caused by model compression is then characterized using rate distortion theory. These results imply that the overall population risk could be improved by model compression if the decrease in generalization error exceeds the increase in empirical risk. A linear regression example is presented to demonstrate that such a decrease in population risk due to model compression is indeed possible. Our theoretical results further suggest a way to improve a widely used model compression algorithm, i.e., Hessian-weighted K-means clustering, by regularizing the distance between the clustering centers. Experiments with neural networks are provided to validate our theoretical assertions. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

23 pages, 802 KiB  
Article
On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements
by Farzad Shahrivari and Nikola Zlatanov
Entropy 2021, 23(8), 1045; https://doi.org/10.3390/e23081045 - 13 Aug 2021
Viewed by 2132
Abstract
In this paper, we investigate the problem of classifying feature vectors with mutually independent but non-identically distributed elements that take values from a finite alphabet set. First, we show the importance of this problem. Next, we propose a classifier and derive an analytical [...] Read more.
In this paper, we investigate the problem of classifying feature vectors with mutually independent but non-identically distributed elements that take values from a finite alphabet set. First, we show the importance of this problem. Next, we propose a classifier and derive an analytical upper bound on its error probability. We show that the error probability moves to zero as the length of the feature vectors grows, even when there is only one training feature vector per label available. Thereby, we show that for this important problem at least one asymptotically optimal classifier exists. Finally, we provide numerical examples where we show that the performance of the proposed classifier outperforms conventional classification algorithms when the number of training data is small and the length of the feature vectors is sufficiently high. Full article
(This article belongs to the Special Issue Information Theory and Machine Learning)
Show Figures

Figure 1

Back to TopTop