1. Introduction to Model Selection Criteria Based on Bayesian Statistics and Information Theory
Model Selection (MS) can be defined as the task of identifying the model best supported by the data, among a set of potential candidates [
1]. In many fields, model selection is an essential part of scientific enquiry [
2]. It can also be argued that this step is often among the most delicate in statistical inference.
The exact definition of what is meant by the best model is controversial and is probably application dependent [
3]. Indeed, the requirements of models are not the same if the goal of the study is prediction, explanation, or control. In any case, basically all approaches to model selection try to find a compromise between goodness of fit and complexity. At the same level of goodness of fit, simpler models, implementing some form of Occam’s razor, are preferred. The goodness of fit is assessed with the likelihood or, when this is not possible, with some metric quantifying the residuals, the distance between the model predictions, and the data. The complexity of the models is identified with the number of the model parameters. In the following, attention will be focussed on MSC derived with the help of Bayesian statistics and information theory, since these are the ones explicitly designed to find a trade-off between goodness of fit and complexity. In any case, similar considerations apply also to frequentist types of techniques. A remark about the nomenclature is in place at this point. Since the application covered in the present work is regression, with the term database in the following, it is indicated as a finite ordered list of entries. Each entry consists of a sequence of elements formed by a dependent variable,
y, and a series of p regressors or predictors,
xi.
The most widely accepted and best understood model selection criteria, based on information theory and Bayesian statistics, are the Akaike Information Criterion (AIC) [
4] and the Bayesian Information Criterion (BIC) [
5].
The theoretical derivations of these metrics result in the following unbiased forms of the criteria:
where
L is the likelihood of the model given the data,
k the number of parameters in the model, and
n the number of entries in the database (also called the sample size). Both
AIC and
BIC metrics are basically cost functions, which have to be minimized; they favour models with a high likelihood but implement a penalty for complexity (the term proportional to k).
Since in most applications, such as the ones discussed in this work, it is impossible to calculate the likelihood of the models, the metric adopted for the goodness of fit is the Euclidean distance of the residuals. Under the traditional assumption, that the data are identically distributed and independently sampled from a normal distribution, it can be demonstrated that the
AIC can be written (up to an additive constant, which depends only on the number of entries in the database and not on the model) as follows:
where
MSE is the mean-squared error of the residuals,
n the number of entries in the database, and
k the number of parameters in the model. Similar assumptions allow expressing the
BIC criterion as follows:
where
is the variance of the residuals,
n is again the number of entries in the database, and
k the number of parameters in the model. The derivation of these two criteria in the various approximation is fully covered in [
6].
These two indicators, and all the others belonging to the same families, are cost functions to be minimised, in the sense that the better the model the lower their value. This can be intuitively appreciated by a simple inspection of their structure. The first term favours models that are closer to the data. The second addend is the penalty term for complexity.
In the last years, various upgrades of these criteria have been proposed. They are mainly meant at improving the goodness of fit, by utilising more sophisticated statistics than the simple MSE, and at devising more accurate estimates of the penalisation for complexity [
7,
8]. All these improvements have proved to be quite significant, but they do not consider explicitly the problems related to the choice of the regressors and the effects of the measurement uncertainties. They basically assume that the independent variables have already been properly identified without any specific provision for this aspect. Some of them deploy quite sophisticated statistical indicators of the distribution of the residuals, but they all take the measurements as given without any error bar. These are all issues which can be quite relevant when investigating complex systems. Typically, in the field of complexity, various quantities can be spuriously correlated with the dependent one, and measurements can be affected by significant uncertainties due to the poor accessibility of many systems. In this situation, as will be shown in the following, the performance of the traditional versions of the AIC and BIC are unsatisfactory, both being prone to include redundant variables in the selected models.
This work aims to provide an upgraded version of the traditional AIC and BIC criteria to alleviate the problems posed by quantities spuriously correlated with the actual predictors. These quantities tend to mislead the available versions of the indicators, inducing them to converge on models with an excessive number of non-relevant regressors. The situation is significantly worsened by the presence of significant levels of noise, which tend to blur the relations between the dependent quantities and the predictors, as shown in
Section 4, which is devoted to the numerical tests. It should be mentioned that the vast majority, if not all, of the applications of model selection criteria involve experimental measurements, which are always affected by some form of noise. The capability of the proposed improvements of dealing with uncertainties is therefore an important aspect that needs to be assessed.
The paper is organized as follows. In the next section, the main information theoretic indicators used in the rest of the paper are reviewed. In
Section 3, the derivation of the upgraded version of the AIC and BIC is covered. In
Section 4, the performances of the upgraded criteria are evaluated through a series of systematic tests. In
Section 5, an application of the derived criteria to a real-life database is reported. The conclusions of the paper are presented in the final section.
2. Brief Review of the Information Theoretic Indicators Relevant to the Upgrades of the Model Selection Criteria
The first information theoretic quantity [
9], required to understand the improvements of the MSC proposed in this work, is the Mutual Information (MI) between two random variables,
X and
Y [
9]:
where
PXY is the joint probability distribution function (pdf) of the random variables
X and
Y. Being fully nonlinear, contrary to the Pearson correlation coefficient, the MI is well suited to extract, from a given database, the best features, i.e., the best regressors,
Xi, to reproduce the desired dependent variable
Y.
The second important information theoretic indicator, used in the rest of the paper, is the concept of redundancy, RD, between a variable
Xi and a set,
S, of other variables,
Xj:
Mutual information and redundancy allow defining a quantity, called relevance RL, which quantifies the net contribution of a variable to reducing the uncertainty in a different one,
Y, above what is already contributed by another set of quantities. Relevance is defined as
3. Derivation of the Upgraded Version of the BIC and AIC
In this section, the original versions of the BIC and AIC criteria are reviewed, and this provides an introduction to the derivation of the upgraded versions of the criteria. The BIC criterion is discussed first because it allows a more natural introduction of the proposed improvements.
3.1. Upgraded Version of the BIC
The Bayesian approach to model selection is based on the maximization of the posterior probability of a model
given the data
. From the Bayes theorem, this posterior probability can be written as follows:
where
is the marginal likelihood of the Model
and can be evaluated as follows:
where
is the vector of the parameters of the model
and
is the probability distribution of the parameters.
It can be demonstrated that for high
, and setting
(uninformative prior), Equation (9) can be approximated with
With .
Substituting (10) in (8), we obtain the following:
If we set
, which implies considering all the models equally probable, (11) leads to the traditional definition of the BIC. Indeed, after taking the logarithm and simple mathematical manipulations, (11) becomes the following:
The right-hand side of Equation (12) can be recognized as the BIC criterion estimate for the model
with an inverted sign. Indeed maximizing (12) is equivalent to minimizing:
In situations for which the relevant assumptions are valid, the likelihood can be replaced with the standard deviation of the residuals, with as the number k of parameters in the model, allowing to recover Equation (4).
As shown in the following sections, when the redundancy between the regressors is not negligible, the traditional BIC criterion can fail to identify the right model, showing a tendency to include redundant variables in the selected solutions. To address this problem, a modified version of the BIC criterion is proposed, which, instead of assuming that the models have all the same probability, includes a penalty term for models with high redundancy in the predictor variables.
The proposed a priori probability distribution of the models depends on an overall quantity that we will indicate as
(Weighted Mutual Regressor Relevance). Given a set of regressors
and a dependent variable
,
is defined as
where
is the mutual information estimate between the
-th and
-th predictor variables and
is the relevance between the
i-th predictor and the predicted variable normalized to the maximum value.
This quantity is higher for models which make use of predictors highly correlated between them and that at the same time have low relevance to the dependent variable.
Note that since , is also positively define.
The proposed a priori models’ probability density function is a Chi-squared distribution function that can then be written as
where
. In this way, Models with
have the highest probability of being chosen, while models with greater
are penalised.
Plugging (15) into (11) one obtains the following:
Which can be rewritten as
Maximizing (17) is equivalent to minimizing
If, as it is often the case, the likelihood is difficult or impossible to calculate, and the variables are identically distributed and independently sampled from a normal distribution, the MIBIC can be written in the practical form:
where as usual
k indicates the number of the model’s parameters.
The choice of the prior, which is a delicate point in any Bayesian statistical treatment, deserves a comment. Since is positive definite, its probability distribution function should also be supported on semi-infinite intervals [0, ∞). Moreover, since the main idea behind the proposed improvement of the criterion hinges on penalizing models with strongly correlated variables, this pdf should reach its maximum value when and decrease as increases. There are several pdf that satisfy these conditions, but the Chi-squared distribution with k = 2 is the most uninformative in the exponential family. Indeed, its implementation implies the simplest linear correction term in the upgraded BIC.
3.2. Upgraded Version of the AIC
The derivation of the AIC criteria is based on the concept of minimizing the Kullbach–Leibler divergence between the model generating the data and the fitted candidate model. Given the different derivation approach compared to the BIC, the formal addition of an a priori probability distribution function of the models is not possible. Nevertheless, since the AIC is also based on the assumption that the independent variables have already been properly identified and that the effects of the measurement uncertainties are negligible, it is reasonable to include a correction term also in the AIC, which can help in the model selection process when these assumptions are not met. As a consequence, in analogy with the already described MIBIC, the following upgraded version of the AIC, called
MIAIC, is proposed:
It is worth mentioning that the same argument, leading to the same upgrade, is equally valid for the other indicators belonging to the AIC family, such as the c-AIC and the QAIC. Indeed, for the types of applications that are the subject of this work, these indicators can be expressed as the original AIC plus an additive term [
6]. Consequently, perfectly analogue versions including the
can be easily calculated and have proved to be at least equally effective.
4. Results of Systematic Tests with Synthetic Data
To evaluate the performance of the upgraded versions of the indicators developed in this work, a series of systematic tests have been performed. The main families of functions have been investigated: power laws, polynomials, exponentials, and combinations thereof.
Given the importance of the functional dependence and of the fact that the experimental case studied in the following belongs to this family, power laws are discussed first, which illustrates the methodology of the test in detail.
A synthetic dependent variable is generated from a set of predictor variables in the power-law form reported below:
The predicted variable is generated with Equation (21) using 3 uncorrelated random predictor variables, , from the distribution. The coefficients in (21) are all set equal to 1, and the number of data points generated is .
A fourth correlated predictor variable is added to the set of possible regressors of
in the form reported below:
Then, a normally distributed noise is added to all predictors. The parameter is the percentage of noise with respect to the standard deviation of the regressor and is varied between . A noise of the same type is added to the independent variable .
After generating the variables and adding the noise, two models of the predicted variable fitting (20) to the noised values of have been obtained: The first using all the four noised predictors available, , and the second using only the noised predictors used to build .
The two obtained models are compared using both the standard and the modified version of the AIC and BIC.
The results of the comparison varying the parameter
are reported in
Figure 1. Each result reported in these plots is an average of over 5 repetitions of the calculations.
As can be noted from inspection of
Figure 1, apart from the cases with very low noise, the model obtained, including the redundant variable, would always be chosen over the right model by the traditional AIC/BIC. Instead, the modified versions always succeed in identifying the right model.
The analysis has then been performed for other two types of correlation functions for the redundant variables:
The results of the analysis are also reported in
Figure 1.
The same analysis has been repeated for polynomial and exponential types of functions. The functions used to generate the data are the following, respectively:
While the functions used to fit the data are respectively of the form
Fitting (27) and (28) to the noised values of
with and without the redundant predictor and evaluating the traditional and the modified version of AIC and BIC, provides the results reported in
Figure 2 and
Figure 3. All the results shown have been obtained using
data points.
One important thing to notice in order to interpret the next figures is that without the redundant regressors, MIBIC and MIAIC provide exactly the same results as the traditional AIC and BIC, proving the consistency of the devised new indicators. On the other hand, if the redundant variables are added to the inputs, the traditional versions of the indicators would always select the wrong model (they assume a lower value), whereas the upgraded versions are not misled (the new indicators assume always higher values than when the redundant quantities are not considered).
In all the reported cases, except for small percentages of noise in the predictors, the traditional version of AIC and BIC are not able to identify the correct model, showing a tendency to select the models including redundant regressors. On the contrary, a significant improvement in the ability to detect the right model is achieved by the MIBIC as well as by MIAIC, which fails only in some cases when the noise percentage is significant.
The effect of the noise in the dependent variable has also been evaluated, but the results of the analysis are not significantly different and the conclusions are the same as for the examples reported.
6. Conclusions
In applications to regression, the most widely used versions of the model selection criteria AIC and BIC are vulnerable to the presence of variables correlated to the actual predictors, particularly when the percentage of noise in the regressors is not negligible, as it is in most practical applications. To address this problem, an upgraded version of these criteria is proposed, adding an a priori Chi-squared probability distribution function of the models. This function depends on a quantity that penalizes the model with highly correlated predictors, which bring little new information about the dependent variable. The performance of the proposed criteria has been assessed with different types of generative functions, correlation functions and percentage of noise in the predictors. The results indicate that, in most cases, the newly defined criteria possess an improved capability of detecting redundancy in the predictors and thus of selecting the correct model. The improved performances are not substantially affected by the sample size as reported in
Appendix A. To show the generality of the obtained results, an application to an international database built by the thermonuclear fusion community has also been reported in the final section.
With regard to future developments, from a methodological standpoint, it would be interesting to improve the treatment of the uncertainties in both the dependent and independent variables, implementing techniques inspired by the error in the variable approach [
10]. Moreover, the introduction of metric alternatives to the Euclidean, such as the geodesic distance [
14,
15,
16], has the potential to provide significant added value. An additional interesting activity would be the systematic analysis of possible prior alternatives to the one chosen for the present version of MIBIC and MIAIC. In terms of applications, the scaling laws of the more recent metallic Tokamaks, and particularly JET with the new ITER-like wall [
17], are nowadays a topic of great interest in the fusion community. The new versions of the indicators could become quite useful in the investigation of scaling laws in non-power law monomial form [
18,
19,
20,
21].