Improved Treatment of the Independent Variables for the Deployment of Model Selection Criteria in the Analysis of Complex Systems

Spolladore, Luca; Gelfusa, Michela; Rossi, Riccardo; Murari, Andrea

doi:10.3390/e23091202

Open AccessArticle

Improved Treatment of the Independent Variables for the Deployment of Model Selection Criteria in the Analysis of Complex Systems

¹

Department of Industrial Engineering, University of Rome “Tor Vergata”, Via Del Politecnico 1, 00133 Roma, Italy

²

Consorzio RFX (CNR, ENEA, INFN, Università di Padova, Acciaierie Venete SpA), Corso Stati Uniti 4, 35127 Padova, Italy

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(9), 1202; https://doi.org/10.3390/e23091202

Submission received: 9 August 2021 / Revised: 2 September 2021 / Accepted: 8 September 2021 / Published: 11 September 2021

Download

Browse Figures

Versions Notes

Abstract

:

Model selection criteria are widely used to identify the model that best represents the data among a set of potential candidates. Amidst the different model selection criteria, the Bayesian information criterion (BIC) and the Akaike information criterion (AIC) are the most popular and better understood. In the derivation of these indicators, it was assumed that the model’s dependent variables have already been properly identified and that the entries are not affected by significant uncertainties. These are issues that can become quite serious when investigating complex systems, especially when variables are highly correlated and the measurement uncertainties associated with them are not negligible. More sophisticated versions of this criteria, capable of better detecting spurious relations between variables when non-negligible noise is present, are proposed in this paper. Their derivation is obtained starting from a Bayesian statistics framework and adding an a priori Chi-squared probability distribution function of the model, dependent on a specifically defined information theoretic quantity that takes into account the redundancy between the dependent variables. The performances of the proposed versions of these criteria are assessed through a series of systematic simulations, using synthetic data for various classes of functions and noise levels. The results show that the upgraded formulation of the criteria clearly outperforms the traditional ones in most of the cases reported.

Keywords:

model selection criteria; Akaike information criterion; Bayesian information criterion; overfitting; redundancy; variable selection; complexity; information theory; relevance

1. Introduction to Model Selection Criteria Based on Bayesian Statistics and Information Theory

Model Selection (MS) can be defined as the task of identifying the model best supported by the data, among a set of potential candidates [1]. In many fields, model selection is an essential part of scientific enquiry [2]. It can also be argued that this step is often among the most delicate in statistical inference.

The exact definition of what is meant by the best model is controversial and is probably application dependent [3]. Indeed, the requirements of models are not the same if the goal of the study is prediction, explanation, or control. In any case, basically all approaches to model selection try to find a compromise between goodness of fit and complexity. At the same level of goodness of fit, simpler models, implementing some form of Occam’s razor, are preferred. The goodness of fit is assessed with the likelihood or, when this is not possible, with some metric quantifying the residuals, the distance between the model predictions, and the data. The complexity of the models is identified with the number of the model parameters. In the following, attention will be focussed on MSC derived with the help of Bayesian statistics and information theory, since these are the ones explicitly designed to find a trade-off between goodness of fit and complexity. In any case, similar considerations apply also to frequentist types of techniques. A remark about the nomenclature is in place at this point. Since the application covered in the present work is regression, with the term database in the following, it is indicated as a finite ordered list of entries. Each entry consists of a sequence of elements formed by a dependent variable, y, and a series of p regressors or predictors, x_i.

The most widely accepted and best understood model selection criteria, based on information theory and Bayesian statistics, are the Akaike Information Criterion (AIC) [4] and the Bayesian Information Criterion (BIC) [5].

The theoretical derivations of these metrics result in the following unbiased forms of the criteria:

A I C = - 2 \ln (L) + 2 k

(1)

B I C = - 2 \ln (L) + k \ln (n)

(2)

where L is the likelihood of the model given the data, k the number of parameters in the model, and n the number of entries in the database (also called the sample size). Both AIC and BIC metrics are basically cost functions, which have to be minimized; they favour models with a high likelihood but implement a penalty for complexity (the term proportional to k).

Since in most applications, such as the ones discussed in this work, it is impossible to calculate the likelihood of the models, the metric adopted for the goodness of fit is the Euclidean distance of the residuals. Under the traditional assumption, that the data are identically distributed and independently sampled from a normal distribution, it can be demonstrated that the AIC can be written (up to an additive constant, which depends only on the number of entries in the database and not on the model) as follows:

A I C = n \cdot \ln (M S E) + 2 k

(3)

where MSE is the mean-squared error of the residuals, n the number of entries in the database, and k the number of parameters in the model. Similar assumptions allow expressing the BIC criterion as follows:

B I C = n \cdot \ln (σ_{(ϵ)}^{2}) + k \cdot \ln (n)

(4)

where

σ_{(ϵ)}^{2}

is the variance of the residuals, n is again the number of entries in the database, and k the number of parameters in the model. The derivation of these two criteria in the various approximation is fully covered in [6].

These two indicators, and all the others belonging to the same families, are cost functions to be minimised, in the sense that the better the model the lower their value. This can be intuitively appreciated by a simple inspection of their structure. The first term favours models that are closer to the data. The second addend is the penalty term for complexity.

In the last years, various upgrades of these criteria have been proposed. They are mainly meant at improving the goodness of fit, by utilising more sophisticated statistics than the simple MSE, and at devising more accurate estimates of the penalisation for complexity [7,8]. All these improvements have proved to be quite significant, but they do not consider explicitly the problems related to the choice of the regressors and the effects of the measurement uncertainties. They basically assume that the independent variables have already been properly identified without any specific provision for this aspect. Some of them deploy quite sophisticated statistical indicators of the distribution of the residuals, but they all take the measurements as given without any error bar. These are all issues which can be quite relevant when investigating complex systems. Typically, in the field of complexity, various quantities can be spuriously correlated with the dependent one, and measurements can be affected by significant uncertainties due to the poor accessibility of many systems. In this situation, as will be shown in the following, the performance of the traditional versions of the AIC and BIC are unsatisfactory, both being prone to include redundant variables in the selected models.

This work aims to provide an upgraded version of the traditional AIC and BIC criteria to alleviate the problems posed by quantities spuriously correlated with the actual predictors. These quantities tend to mislead the available versions of the indicators, inducing them to converge on models with an excessive number of non-relevant regressors. The situation is significantly worsened by the presence of significant levels of noise, which tend to blur the relations between the dependent quantities and the predictors, as shown in Section 4, which is devoted to the numerical tests. It should be mentioned that the vast majority, if not all, of the applications of model selection criteria involve experimental measurements, which are always affected by some form of noise. The capability of the proposed improvements of dealing with uncertainties is therefore an important aspect that needs to be assessed.

The paper is organized as follows. In the next section, the main information theoretic indicators used in the rest of the paper are reviewed. In Section 3, the derivation of the upgraded version of the AIC and BIC is covered. In Section 4, the performances of the upgraded criteria are evaluated through a series of systematic tests. In Section 5, an application of the derived criteria to a real-life database is reported. The conclusions of the paper are presented in the final section.

2. Brief Review of the Information Theoretic Indicators Relevant to the Upgrades of the Model Selection Criteria

The first information theoretic quantity [9], required to understand the improvements of the MSC proposed in this work, is the Mutual Information (MI) between two random variables, X and Y [9]:

M I (X_{i}, Y) = - \sum_{X} \sum_{Y} P_{X Y} \ln (\frac{P_{X Y}}{P_{X} P_{Y}})

(5)

where P_XY is the joint probability distribution function (pdf) of the random variables X and Y. Being fully nonlinear, contrary to the Pearson correlation coefficient, the MI is well suited to extract, from a given database, the best features, i.e., the best regressors, X_i, to reproduce the desired dependent variable Y.

The second important information theoretic indicator, used in the rest of the paper, is the concept of redundancy, RD, between a variable X_i and a set, S, of other variables, X_j:

R D (X_{i}, S) = \sum_{X_{j} \in S} M I (X_{i}, X_{j})

(6)

Mutual information and redundancy allow defining a quantity, called relevance RL, which quantifies the net contribution of a variable to reducing the uncertainty in a different one, Y, above what is already contributed by another set of quantities. Relevance is defined as

R L (X_{i}, Y) = M I (X_{i}, Y) - R D (X_{i}, S_{P S}) = M I (X_{i}, Y) - \sum_{X_{j} \in S_{P S}} M I (X_{i}, X_{j})

(7)

3. Derivation of the Upgraded Version of the BIC and AIC

In this section, the original versions of the BIC and AIC criteria are reviewed, and this provides an introduction to the derivation of the upgraded versions of the criteria. The BIC criterion is discussed first because it allows a more natural introduction of the proposed improvements.

3.1. Upgraded Version of the BIC

The Bayesian approach to model selection is based on the maximization of the posterior probability of a model

M_{i}

given the data

Y = y_{1}, \dots, y_{n}

. From the Bayes theorem, this posterior probability can be written as follows:

p (M_{i} | Y) = \frac{p (Y | M_{i}) \cdot p (M_{i})}{p (Y)}

(8)

where

p (Y | M_{i})

is the marginal likelihood of the Model

M_{i}

and can be evaluated as follows:

p (Y | M_{i}) = \int L (Y | M_{i}, θ_{i}) \cdot f (θ_{i} | M_{i}) d θ_{i}

(9)

where

θ_{i}

is the vector of the parameters of the model

M_{i}

and

f (θ_{i} | M_{i})

is the probability distribution of the parameters.

It can be demonstrated that for high

n

, and setting

f (θ_{i} | M_{i}) = 1

(uninformative prior), Equation (9) can be approximated with

p (Y | M_{i}) \approx L (Y | M_{i}, \hat{θ_{i}}) \cdot e^{- \frac{|\hat{θ_{i}}|}{2} \log n}

(10)

With

\hat{θ_{i}} = a r g m a x_{θ_{i} \in Θ} (L (Y | M_{i}, θ_{i}))

.

Substituting (10) in (8), we obtain the following:

p (M_{i} | Y) \propto L (Y | M_{i}, \hat{θ_{i}}) \cdot e^{- \frac{|\hat{θ_{i}}|}{2} \log n} \cdot p (M_{i})

(11)

If we set

p (M_{i}) = 1

, which implies considering all the models equally probable, (11) leads to the traditional definition of the BIC. Indeed, after taking the logarithm and simple mathematical manipulations, (11) becomes the following:

2 \cdot \log (p (M_{i} | Y)) \approx 2 \cdot \log (L (Y | M_{i}, \hat{θ_{i}})) - |\hat{θ_{i}}| \cdot \log n

(12)

The right-hand side of Equation (12) can be recognized as the BIC criterion estimate for the model

M_{i}

with an inverted sign. Indeed maximizing (12) is equivalent to minimizing:

B I C \approx - 2 \cdot \log (L (Y | M_{i}, \hat{θ_{i}})) + |\hat{θ_{i}}| \cdot \log n

(13)

In situations for which the relevant assumptions are valid, the likelihood can be replaced with the standard deviation of the residuals, with

|\hat{θ_{i}}|

as the number k of parameters in the model, allowing to recover Equation (4).

As shown in the following sections, when the redundancy between the regressors is not negligible, the traditional BIC criterion can fail to identify the right model, showing a tendency to include redundant variables in the selected solutions. To address this problem, a modified version of the BIC criterion is proposed, which, instead of assuming that the models have all the same probability, includes a penalty term for models with high redundancy in the predictor variables.

The proposed a priori probability distribution of the models depends on an overall quantity that we will indicate as

W M R R

(Weighted Mutual Regressor Relevance). Given a set of regressors

X_{1}, X_{2}, \dots X_{N}

and a dependent variable

Y

,

M R P_{X Y}

is defined as

W M R R = n \cdot \sum_{i = 1}^{N} \sum_{j = 1}^{N} M I (X_{i}, X_{j}) \cdot (1 - R L_{N} (X_{i}, Y)), f o r i \neq j;

(14)

where

M I (X_{i}, X_{j})

is the mutual information estimate between the

i

-th and

j

-th predictor variables and

R L_{N} (X_{i}, Y) = \frac{R L_{N} (X_{i}, Y)}{\max_{j} (R L_{N} (X_{i}, Y))}

is the relevance between the i-th predictor and the predicted variable normalized to the maximum value.

This quantity is higher for models which make use of predictors highly correlated between them and that at the same time have low relevance to the dependent variable.

Note that since

M I (X_{i}, X_{j}) \geq 0 \forall X_{i}, X_{j}

,

W M R R

is also positively define.

The proposed a priori models’ probability density function is a Chi-squared distribution function that can then be written as

p (M_{i}) = \frac{M I_{X Y}^{\frac{k}{2} - 1} \cdot e^{- \frac{W M R R}{2}}}{2^{\frac{k}{2}} \cdot Γ (\frac{k}{2})}, w i t h k = 2

(15)

where

Γ (\frac{k}{2}) = (1 - \frac{k}{2})!

. In this way, Models with

W M R R = 0

have the highest probability of being chosen, while models with greater

W M R R

are penalised.

Plugging (15) into (11) one obtains the following:

p (Y | M_{i}) \approx L (Y | M_{i}, \hat{θ_{i}}) \cdot e^{- \frac{|\hat{θ_{i}}|}{2} \log n} \cdot \frac{e^{- \frac{W M R R}{2}}}{2}

(16)

Which can be rewritten as

2 \cdot \log (p (Y | M_{i})) \approx 2 \cdot \log (L (Y | M_{i}, \hat{θ_{i}})) - |\hat{θ_{i}}| \cdot \log n - W M R R

(17)

Maximizing (17) is equivalent to minimizing

M I B I C = - 2 \cdot \log (L (Y | M_{i}, \hat{θ_{i}})) + |\hat{θ_{i}}| \cdot \log n + W M R R

(18)

If, as it is often the case, the likelihood is difficult or impossible to calculate, and the variables are identically distributed and independently sampled from a normal distribution, the MIBIC can be written in the practical form:

M I B I C = n \cdot \ln (σ_{(ϵ)}^{2}) + k \cdot \log n + W M R R

(19)

where as usual k indicates the number of the model’s parameters.

The choice of the prior, which is a delicate point in any Bayesian statistical treatment, deserves a comment. Since

W M R R

is positive definite, its probability distribution function should also be supported on semi-infinite intervals [0, ∞). Moreover, since the main idea behind the proposed improvement of the criterion hinges on penalizing models with strongly correlated variables, this pdf should reach its maximum value when

W M R R = 0

and decrease as

W M R R

increases. There are several pdf that satisfy these conditions, but the Chi-squared distribution with k = 2 is the most uninformative in the exponential family. Indeed, its implementation implies the simplest

W M R R

linear correction term in the upgraded BIC.

3.2. Upgraded Version of the AIC

The derivation of the AIC criteria is based on the concept of minimizing the Kullbach–Leibler divergence between the model generating the data and the fitted candidate model. Given the different derivation approach compared to the BIC, the formal addition of an a priori probability distribution function of the models is not possible. Nevertheless, since the AIC is also based on the assumption that the independent variables have already been properly identified and that the effects of the measurement uncertainties are negligible, it is reasonable to include a correction term also in the AIC, which can help in the model selection process when these assumptions are not met. As a consequence, in analogy with the already described MIBIC, the following upgraded version of the AIC, called MIAIC, is proposed:

M I A I C = n \cdot \ln (σ_{(ϵ)}^{2}) + 2 |\hat{θ_{i}}| + W M R R

(20)

It is worth mentioning that the same argument, leading to the same upgrade, is equally valid for the other indicators belonging to the AIC family, such as the c-AIC and the QAIC. Indeed, for the types of applications that are the subject of this work, these indicators can be expressed as the original AIC plus an additive term [6]. Consequently, perfectly analogue versions including the

W M R R

can be easily calculated and have proved to be at least equally effective.

4. Results of Systematic Tests with Synthetic Data

To evaluate the performance of the upgraded versions of the indicators developed in this work, a series of systematic tests have been performed. The main families of functions have been investigated: power laws, polynomials, exponentials, and combinations thereof.

Given the importance of the functional dependence and of the fact that the experimental case studied in the following belongs to this family, power laws are discussed first, which illustrates the methodology of the test in detail.

A synthetic dependent variable is generated from a set of predictor variables in the power-law form reported below:

Y = α_{0} \cdot X_{1}^{α_{1}} \cdot X_{2}^{α_{2}} \dots \cdot X_{N}^{α_{N}}

(21)

The predicted variable

Y

is generated with Equation (21) using 3 uncorrelated random predictor variables,

X_{1}, X_{2}, X_{3}

, from the

N (μ = 10, σ_{N} = 1)

distribution. The coefficients in (21) are all set equal to 1, and the number of data points generated is

n = 5000

.

A fourth correlated predictor variable is added to the set of possible regressors of

y

in the form reported below:

X_{4} = X_{1} + N ((μ = 0, σ_{N} = 0.3 \cdot s t d (X_{1})))

(22)

Then, a normally distributed noise

N (μ = 0, σ_{N} = \frac{n o i s e %}{100} \cdot s t d (X_{1})); f o r i = 1, \dots, 4

is added to all predictors. The parameter

n o i s e %

is the percentage of noise with respect to the standard deviation of the regressor and is varied between

1 % and 30 %

. A noise of the same type

N (μ = 0, σ_{N} = 0.1 \cdot s t d (Y))

is added to the independent variable

Y

.

After generating the variables and adding the noise, two models of the predicted variable fitting (20) to the noised values of

Y

have been obtained: The first using all the four noised predictors available,

X_{1}, X_{2}, X_{3}, X_{4}

, and the second using only the noised predictors

X_{1}, X_{2}, X_{3},

used to build

y

.

The two obtained models are compared using both the standard and the modified version of the AIC and BIC.

The results of the comparison varying the parameter

n o i s e %

are reported in Figure 1. Each result reported in these plots is an average of over 5 repetitions of the calculations.

As can be noted from inspection of Figure 1, apart from the cases with very low noise, the model obtained, including the redundant variable, would always be chosen over the right model by the traditional AIC/BIC. Instead, the modified versions always succeed in identifying the right model.

The analysis has then been performed for other two types of correlation functions for the redundant variables:

X_{4} = X_{1} + X_{2} + N (μ = 0, σ = σ_{N} = 0.3 \cdot s t d (X_{1} + X_{2}))

(23)

X_{4} = X_{1} \cdot X_{2} + N (μ = 0, σ = σ_{N} = 0.3 \cdot s t d (X_{1} \cdot X_{2}))

(24)

The results of the analysis are also reported in Figure 1.

The same analysis has been repeated for polynomial and exponential types of functions. The functions used to generate the data are the following, respectively:

Y = X_{1} + X_{2} + X_{3} + X_{1}^{2} + X_{2}^{2} + X_{3}^{2}

(25)

Y = X_{1} \cdot X_{2} \cdot X_{3} \cdot e^{X_{1} + X_{2} + X_{3}}

(26)

While the functions used to fit the data are respectively of the form

Y = α_{0} + α_{1} X_{1} + α_{2} X_{2} + \dots + α_{N} x_{N} + α_{1} x_{1}^{2} + α_{2} x_{2}^{2} + \dots + α_{N} x_{N}^{2}

(27)

Y = α_{0} \cdot X_{1} \cdot X_{2} \dots \cdot X_{N} \cdot e^{α_{1} X_{1} + α_{2} X_{2} + \dots + α_{N} X_{N}}

(28)

Fitting (27) and (28) to the noised values of

Y

with and without the redundant predictor and evaluating the traditional and the modified version of AIC and BIC, provides the results reported in Figure 2 and Figure 3. All the results shown have been obtained using

n = 5000

data points.

One important thing to notice in order to interpret the next figures is that without the redundant regressors, MIBIC and MIAIC provide exactly the same results as the traditional AIC and BIC, proving the consistency of the devised new indicators. On the other hand, if the redundant variables are added to the inputs, the traditional versions of the indicators would always select the wrong model (they assume a lower value), whereas the upgraded versions are not misled (the new indicators assume always higher values than when the redundant quantities are not considered).

In all the reported cases, except for small percentages of noise in the predictors, the traditional version of AIC and BIC are not able to identify the correct model, showing a tendency to select the models including redundant regressors. On the contrary, a significant improvement in the ability to detect the right model is achieved by the MIBIC as well as by MIAIC, which fails only in some cases when the noise percentage is significant.

The effect of the noise in the dependent variable has also been evaluated, but the results of the analysis are not significantly different and the conclusions are the same as for the examples reported.

5. Application to a Real-Life Database

In this section, to prove the generality of the results obtained with the upgraded criteria described in the previous sections, an application to a real-life database has been considered. The analysed database is called the ITPA database, which is the most advanced Multi-machine database built to support studies of plasma confinement in Tokamaks [10]. A description of this database is given in the next subsection. The results obtained applying the criteria developed in the present work to this database are reported in the following subsection.

5.1. The ITPA Database of the Energy Confinement Time for the H Mode

One of the most crucial quantities to assess the relevance of a nuclear fusion reactor is the so-called energy confinement time 𝜏𝐸, which quantifies how fast the internal energy of the plasma is lost [11,12,13]. Unfortunately, the transport mechanisms affecting the energy confinement in high-temperature plasmas are very complex and nonlinear, including effects at many scales. So, even if the understanding of the instabilities and turbulence effects influencing transport has progressed a lot in the last years, a theoretical or numerical solution for the proper estimation of the energy confinement time,

τ_{E}

, remains unfeasible. As a consequence, this problem has been approached empirically with the extraction of robust scaling laws for 𝜏𝐸 from experimental data. This led to the construction of several multi-machine databases for the plasma confinement time, including the ITPA database analysed in this paper. In particular, the DB3v13f version of the ITPA with the same selection rules reported in [10] is the one used in the following analysis.

The variables that are known to be relevant for the estimation of the confinement time and that will be taken into consideration in this work are 𝐼𝑝, 𝐵𝑇, 𝑃𝐿𝑇𝐻, 𝑛𝑒𝑙, 𝑀𝑒𝑓𝑓, 𝑅𝐺𝐸𝑂, 𝜖, and 𝑘𝑎, where 𝐼𝑝 is the plasma current, 𝐵𝑇 is the toroidal magnetic field, 𝑃𝐿𝑇𝐻 is the power loss across the last closed surface, 𝑛𝑒𝑙 is the line average electron density, 𝑀𝑒𝑓𝑓 is the plasma isotopic composition, 𝑅𝐺𝐸𝑂 is the plasma major radius,

ϵ = \frac{a}{R_{G E O}}

where 𝑎 is the plasma minor radius, and 𝑘𝑎 is the volume measure of elongation [10]. Indeed, these variables are the ones used in the most widely accepted scaling law for the Tokamak energy confinement time in H mode, called the IPB98(y,2):

τ_{E} = 5.62 \cdot 10^{- 2} \cdot I_{p}^{0.93} \cdot B T^{0.15} \cdot n_{e}^{0.41} \cdot P^{- 0.69} \cdot R^{1.97} \cdot k_{a}^{0.78} \cdot ϵ^{0.58} \cdot M_{e f f}^{0.19}

(29)

Due to the physical constraints and the fact that each machine is optimized to work within specific parameter ranges, the degree of correlation of the mentioned regressors is quite high, as shown in Table 1.

Moreover, the regressors, as well as the confinement time, are affected by significant uncertainties, as shown in Table 2.

5.2. Results

Employing the upgraded model selection criteria proposed in this work, the main objective of the analysis consists of identifying, within all the possible power-law models obtained combining the predictor variables included in Equation (29), the one which best represents the

τ_{E}

data.

In order to do this, the following iterative procedure has been adopted:

The first step consists of evaluating the MIAIC and MIBIC for the power-law model with all the eight regressors included. Then, removing one variable at a time from the list of regressors, eight more models are obtained and their MIAIC and MIBIC evaluated.

The model with the lowest MIAIC/MIBIC is identified, and the regressors included in the model will form the new list of possible regressors. The process is then iterated eliminating one variable at the time, until removing any of the variables included in the list does not produce any benefit in terms of MIAIC and MIBIC. At this point, the algorithm is topped and the best model is retained

Applying this procedure, the model which shows itself to be the best in terms of both MIAIC and MIBIC is

τ_{E} = α_{0} \cdot I_{p}^{α_{1}} \cdot P^{α_{2}} \cdot R^{α_{3}} \cdot k_{a}^{α_{4}} \cdot ϵ^{α_{5}} \cdot M_{e f f}^{α_{6}}

(30)

Instead, using the traditional AIC and BIC criteria, the resultant models are, respectively,

τ_{E} = α_{0} \cdot I_{p}^{α_{1}} \cdot B T^{α_{2}} \cdot n_{e}^{α_{3}} \cdot P^{α_{4}} \cdot R^{α_{5}} \cdot k_{a}^{α_{6}} \cdot ϵ^{α_{7}} \cdot M_{e f f}^{α_{8}}

(31)

τ_{E} = α_{0} \cdot I_{p}^{α_{1}} \cdot B T^{α_{2}} \cdot n_{e}^{α_{3}} \cdot P^{α_{4}} \cdot R^{α_{5}} \cdot k_{a}^{α_{6}} \cdot ϵ^{α_{7}}

(32)

The first obvious advantage of the upgraded versions of the criteria is that they provide coherent results, whereas the traditional versions of the indicators do not seem to agree on a single model, rendering the choice of the most appropriate scaling law very difficult. More importantly, the model obtained with MIAIC and MIBIC is more parsimonious, and indeed, it utilises fewer quantities than the ones derived by the AIC and IC. It retains the plasma current but considers redundant magnetic field and plasma density. This is coherent with the statistical analysis of the database, which presents very strong collinearities between these three quantities, as reported in Table 1. The obtained results are also in harmony with the everyday experience of the device operators, since the experiments are indeed typically designed with strong correlations between plasma parameters.

6. Conclusions

In applications to regression, the most widely used versions of the model selection criteria AIC and BIC are vulnerable to the presence of variables correlated to the actual predictors, particularly when the percentage of noise in the regressors is not negligible, as it is in most practical applications. To address this problem, an upgraded version of these criteria is proposed, adding an a priori Chi-squared probability distribution function of the models. This function depends on a quantity that penalizes the model with highly correlated predictors, which bring little new information about the dependent variable. The performance of the proposed criteria has been assessed with different types of generative functions, correlation functions and percentage of noise in the predictors. The results indicate that, in most cases, the newly defined criteria possess an improved capability of detecting redundancy in the predictors and thus of selecting the correct model. The improved performances are not substantially affected by the sample size as reported in Appendix A. To show the generality of the obtained results, an application to an international database built by the thermonuclear fusion community has also been reported in the final section.

With regard to future developments, from a methodological standpoint, it would be interesting to improve the treatment of the uncertainties in both the dependent and independent variables, implementing techniques inspired by the error in the variable approach [10]. Moreover, the introduction of metric alternatives to the Euclidean, such as the geodesic distance [14,15,16], has the potential to provide significant added value. An additional interesting activity would be the systematic analysis of possible prior alternatives to the one chosen for the present version of MIBIC and MIAIC. In terms of applications, the scaling laws of the more recent metallic Tokamaks, and particularly JET with the new ITER-like wall [17], are nowadays a topic of great interest in the fusion community. The new versions of the indicators could become quite useful in the investigation of scaling laws in non-power law monomial form [18,19,20,21].

Author Contributions

Conceptualization, L.S. and A.M.; methodology, L.S. and A.M.; software, L.S.; validation, A.M. and R.R.; formal analysis, L.S.; writing—original draft preparation, L.S. and A.M.; writing—review and editing, R.R. and M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Appendix A. MIBIC and MIAC Performance with Sample Size

The Appendix A contains the analysis of the effect of the number of samples on the performance of the proposed upgraded criteria. In particular, the same tests reported in Section 4 have been repeated for two different sample sizes

n = 500

and

n = 50,000

. The results of the analysis for the different generative functions and number of entries are reported in Figure A1 and Figure A2. For simplicity, only the results for the correlation function reported in (24) have been included in the figures. As it can be noted by visual insèecion of the plots reported in this Appendix A, the performance’s improvement associated with the proposed criteria is not affected by the sample size. This is a general result: the MIBIC and MIAIC perform better than the traditional versions independently from the number of entries in the datbases. These results can be also extended to the other correlation functions considered in Section 4, even though the related figures have not been explicitly reported in the paper.

Figure A1. BIC, BICred, MIBIC, MIBICred for n = 500 and different generative functions vs. the percentage of noise in the predictors. The subscript red indicates the models using the redundant regressors.

Figure A2. BIC, BICred, MIBIC, MIBICred for n = 50000 and different generative functions vs. the percentage of noise in the predictors. The subscript red indicates the models using the redundant regressors.

References

Bailly, F.; Longo, G. Mathematics and the Natural Sciences; Imperial College Press: London, UK, 2011. [Google Scholar]
D’Espargnat, B. On Physics and Philosophy; Princeton University Press: Oxford, MS, USA, 2002. [Google Scholar]
Claeskens, G. Statistical model choice. Annu. Rev. Stat. Its Appl. 2016, 3, 233–256. [Google Scholar] [CrossRef]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Schwarz Gideon, E. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Kenneth, P.B.; Anderson, D.R. Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach, 2nd ed.; Springer: Berlin, Germany, 2002. [Google Scholar]
Murari, A.; Peluso, E.; Cianfrani, F.; Gaudio, P.; Lungaroni, M. On the Use of Entropy to Improve Model Selection Criteria. Entropy 2019, 21, 394. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Rossi, R.; Murari, A.; Gaudio, P.; Gelfusa, M. Upgrading Model Selection Criteria with Goodness of Fit Tests for Practical Applications. Entropy 2020, 22, 447. [Google Scholar] [CrossRef] [PubMed] [Green Version]
MacKay, D.J.C. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
McDonald, D.; Cordey, J.; Righi, E.; Ryter, F.; Saibene, G.; Sartori, R.; Alper, B.; Becoulet, M.; Brzozowski, J.; Coffey, I.; et al. ELMy H-modes in JET helium-4 plasmas. Plasma Phys. Control Fusion 2004, 46, 519–534. [Google Scholar] [CrossRef]
Wesson, J. Tokamaks, 3rd ed.; Clarendon Press: Oxford, UK, 2004. [Google Scholar]
Romanelli, F.; Laxåback, M. Overview of JET results. Nucl. Fusion 2009, 49, 104006. [Google Scholar] [CrossRef]
Ongena, J.; Monier-Garbet, P.; Suttrop, W.; Andrew, P.; Bécoulet, M.; Budny, R.; Corre, Y.; Cordey, G.; Dumortier, P.; Eich, T. Towards the realization on JET of an integrated H-mode scenario for ITER. Nucl. Fusion 2004, 44, 124–133. [Google Scholar] [CrossRef]
Craciunescu, T.; Murari, A. Geodesic distance on Gaussian manifolds for the robust identification of chaotic systems. Nonlinear Dyn. 2016, 86, 677–693. [Google Scholar] [CrossRef]
Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
Murari, A.; Boutot, P.; Vega, J.; Gelfusa, M.; Moreno, R.; Verdoolaege, G.; de Vries, P.C.; JET-EFDA Contributors. Clustering based on the geodesic distance on Gaussian manifolds for the automatic classification of disruption. Nuclear Fusion 2013, 53, 033006. [Google Scholar] [CrossRef]
Pamela, J.; Romanelli, F.; Watkins, M.L.; Lioure, A.; Matthews, G.; Philipps, V.; Jones, T.; Murari, A.; Géraud, A.; Crisanti, F.; et al. The JET programme in support of ITER. Fusion Eng. Des. 2007, 82, 590–602. [Google Scholar] [CrossRef]
Murari, A.; Lupelli, I.; Gelfusa, M.; Gaudio, P. Non-power law scaling for access to the H-mode in tokamaks via symbolic regression. Nucl. Fusion 2013, 53, 043001. [Google Scholar] [CrossRef]
Murari, A.; Lupelli, I.; Gelfusa, M.; Peluso, E. Symbolic regression via genetic programming for data driven derivation of confinement scaling laws without any assumption on their mathematical form. Plasma Phys. Control Fusion 2015, 57, 014008. [Google Scholar] [CrossRef]
Murari, A.; Lupelli, I.; Gelfusa, M.; Peluso, E.; Lungaroni, M. Application of symbolic regression to the derivation of scaling laws for tokamak energy confinement time in terms of dimensionless quantities. Nucl. Fusion 2015, 56, 26005. [Google Scholar] [CrossRef]
Murari, A.; Lupelli, I.; Gelfusa, M.; Gaudio, P.; Vega, J. A statistical methodology to derive the scaling law for the H-mode power threshold using a large multi-machine database. Nucl. Fusion 2012, 52, 063016. [Google Scholar] [CrossRef] [Green Version]

Figure 1. BIC, BICred, MIBIC, and MIBICred for the power-law generative function and different correlations as a function of the percentage of noise in the predictors. The subscript red indicates the models using the redundant regressors.

Figure 2. BIC, BICred, MIBIC, and MIBICred for the polynomial generative function and different correlations as a function of the percentage of noise in the predictors. The subscript red indicates the models using the redundant regressors.

Figure 3. BIC, BICred, MIBIC, and MIBICred for the exponential generative function and different correlations as a function of the percentage of noise in the predictors. The subscript red indicates the models using the redundant regressors.

Table 1. Person correlation coefficient matrix for the eight regressors used to model the energy confinement time.

	$ϵ$	$M_{e f f}$	$R_{G E O}$	$k_{a}$	$B_{T}$	$I_{P}$	$n_{e}$	$P_{L T H}$
$ϵ$	1.00	0.29	0.11	0.41	−0.02	0.41	0.10	0.26
$M_{e f f}$	0.29	1.00	0.30	0.42	0.17	0.38	0.16	0.36
$R_{G E O}$	0.11	0.30	1.00	0.30	0.09	0.73	−0.36	0.67
$k_{a}$	0.41	0.42	0.30	1.00	−0.07	0.48	0.15	0.42
$B_{T}$	−0.02	0.17	0.09	−0.07	1.00	0.43	0.52	0.34
$I_{P}$	0.41	0.38	0.73	0.48	0.43	1.00	0.00	0.76
$n_{e}$	0.10	0.16	−0.36	0.15	0.52	0.00	1.00	0.03
$P_{L T H}$	0.26	0.36	0.67	0.42	0.34	0.76	0.03	1.00

Table 2. Lower bounds of the uncertainties for the entries of the ITPA database.

	$ϵ$	$M_{e f f}$	$R_{G E O}$	$k_{a}$	$B_{T}$	$I_{P}$	$n_{e}$	$P_{L T H}$	$τ_{E}$
Rel. err.	1%	8%	1%	10%	1%	1%	5%	14%	10%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Spolladore, L.; Gelfusa, M.; Rossi, R.; Murari, A. Improved Treatment of the Independent Variables for the Deployment of Model Selection Criteria in the Analysis of Complex Systems. Entropy 2021, 23, 1202. https://doi.org/10.3390/e23091202

AMA Style

Spolladore L, Gelfusa M, Rossi R, Murari A. Improved Treatment of the Independent Variables for the Deployment of Model Selection Criteria in the Analysis of Complex Systems. Entropy. 2021; 23(9):1202. https://doi.org/10.3390/e23091202

Chicago/Turabian Style

Spolladore, Luca, Michela Gelfusa, Riccardo Rossi, and Andrea Murari. 2021. "Improved Treatment of the Independent Variables for the Deployment of Model Selection Criteria in the Analysis of Complex Systems" Entropy 23, no. 9: 1202. https://doi.org/10.3390/e23091202

APA Style

Spolladore, L., Gelfusa, M., Rossi, R., & Murari, A. (2021). Improved Treatment of the Independent Variables for the Deployment of Model Selection Criteria in the Analysis of Complex Systems. Entropy, 23(9), 1202. https://doi.org/10.3390/e23091202

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved Treatment of the Independent Variables for the Deployment of Model Selection Criteria in the Analysis of Complex Systems

Abstract

1. Introduction to Model Selection Criteria Based on Bayesian Statistics and Information Theory

2. Brief Review of the Information Theoretic Indicators Relevant to the Upgrades of the Model Selection Criteria

3. Derivation of the Upgraded Version of the BIC and AIC

3.1. Upgraded Version of the BIC

3.2. Upgraded Version of the AIC

4. Results of Systematic Tests with Synthetic Data

5. Application to a Real-Life Database

5.1. The ITPA Database of the Energy Confinement Time for the H Mode

5.2. Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. MIBIC and MIAC Performance with Sample Size

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI