1. Introduction
We have been working since 2015 on the problem of testing the alignment of protein domain families which are proposed by expert biologists and bioinformaticians. We have found that the use of selected entropy measures is very proficient for testing the results published by those professionals and they favour a rigorous ANOVA statistical analysis [
1]. In order to reduce the search space for admissible values of entropy measures, we have emphasized the need for work in the region related to strict concavity of these entropies. This study has been undertaken in a previous work, and we present in
Section 2 a summary of those developments. In the present work, we aim to complement the results of a previous publication [
2], and a subsequent restriction on the parameter space has to be performed in order to guarantee the synergy of the probability distributions to be tested. Non-synergetic distributions are not worthwhile for working because they will not preserve the fundamental property of getting more information of amino acids into
t-sets of columns than to sum up the information obtained from individual columns. In
Section 3, a brief digression is then made for introducing the Sharma–Mittal class of entropy measures.
Section 4 emphasizes the aspects of synergy of the distributions and their consequences for the reduction of the parameter space of Sharma–Mittal entropies. In
Section 5, we treat the analysis of the maximal extension of the parameter space, and we repeat the reduction process imposed by the requirement of fully synergetic distributions of
Section 4. We conclude the paper in
Section 6 by studying the relation of Hölder and generalized Khinchin–Shannon (GKS) inequalities.
2. The Construction of the Probabilistic Space
Let us consider a set of domains ( rows) from a chosen family of protein domains. In order to associate a rectangular array with this family, to be taken as its representative in the probabilistic space we are constructing, we specify its number of columns as . This means that among rows, we disregard all rows such that the number of their amino acids satisfies and preserve rows whose number of amino acids satisfies , but disregard () amino acids in these rows. We then choose m rows from among the rows to obtain rectangular arrays. There are of these rectangular arrays. Any one of them can be used as a representative of the domain family to be analysed in the statistical procedure to be implemented.
The next step is to assign a joint probability of occurrence of a set of variables
in columns
to be given by
where
stands for the number of occurrences of the set
in the
t columns of the subarray
of the representative array
(
). The symbols
will be running over the letters of the one-letter code for the twenty amino acids:
{A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.
We then have
We also introduce the conditional probabilities of occurrence, which are given implicitly by
where
is the probability of occurrence of the amino acids in the columns
, if the distribution of amino acids in the
-th column is known a priori.
The Bayes’ law for probabilities of occurrence [
2,
3] can be written as
The equality of the three first right-side members, as well as the equality of the three last ones, does correspond to the application of Bayes’ law [
2,
3]. The symmetries for the joint probability distribution
are due to the ordering of the columns for the distributions of amino acids.
From the ordering
, the values assumed by the variables
are respectively given by
We then have
geometric objects
of
t columns and
components each.
3. The Sharma–Mittal Class of Entropy Measures
As emphasized in Ref. [
2], the introduction of random variable functions such as entropy measures associated with the probabilities of occurrence, is suitable to provide an analysis of the evolution of these probabilities through the regions of the parameter space of entropies. The class of Sharma–Mittal entropy measures seems to be particularly adapted to this task when related to the occurrence of amino acids in the objects
. The thermodynamic interpretation of the notion of entropy greatly helps to classify the distribution of its values associated with protein domain databases and to interpret its evolution through the Fokker–Planck equations to be treated in forthcoming articles in this line of research.
The two-parameter Sharma–Mittal class of entropy measures is usually given by
where
The parameters
r,
s must bound a region corresponding to a strict concavity in the parameter space. A necessary requirement to be satisfied [
3] is
where
stands for the escort probability associated with the joint probability
, or,
Equation (
8) leads to
Some special cases of one-parameter entropies are commonplace in the scientific literature [
3,
4,
5,
6,
7,
8,
9]:
The
region is the domain of the Havrda–Charvat [
6] entropy measure
,
The
,
, region will stand for the domain of the Landsberg–Vedral [
7] entropy measure,
,
The Renyi
[
8] and the “non-extensive” Gaussian [
9]
entropy measures are obtained from limit processes:
After using the definition of
, Equation (
7), and
from Equations (
1) and (
2), we get:
where
is the Gibbs–Shannon entropy measure
The Gibbs–Shannon entropy measure, Equation (
15), is also obtained by taking the convenient limits of the special cases of Sharma–Mittal entropies, Equations (
11)–(
14):
We shall analyse in the next section the structure of the two-parameter space of Sharma–Mittal entropy by taking into consideration these special cases.
We are now reminded that for the limit of Gibbs–Shannon entropy, a conditional entropy measure is defined [
3] by
We then have analogously for the conditional Sharma–Mittal entropy measure [
3]
It is easy to show by trivial calculation that, analogously to Equation (
16), we will have
From Equations (
6), (
7) and (
18) and the application of the Bayes’ law, Equation (
4), we can write
4. Aspects of Synergy and the Reduction of the Parameter Space for Fully Synergetic Distributions
For the Gibbs–Shannon entropy measure, the inequality written by A. Y. Khinchin [
3,
10] is
This inequality would be described by Khinchin as: “On the average, the knowledge a priori of the distribution on the column
can only decrease the uncertainty of the distribution on the
columns”. We can write an analogous inequality for the Sharma–Mittal class of entropies
We then get from Equations (
20) and (
22)
After iteration of this equation,
, we can also write
The inequalities in (
21)–(
24) are associated with what are called “synergetic conditions”. In this section, we also derive the fully synergetic conditions as GKS inequalities.
After using Equations (
7) and (
9) in Equation (
23), we get
and after iteration and use of Equation (
24)
The hatched region of strict concavity in the parameter space of Sharma–Mittal entropies,
, is depicted in
Figure 1. The special cases corresponding to Havrda–Charvat’s (
), Landsberg–Vedral’s (
), Renyi’s (
), and “non-extensive” Gaussian’s (
) entropies are also represented.
We can identify three subregions in
Figure 1. They will correspond to
where the ordering of
-symbols has been obtained from Equation (
26). The subregions R
and R
are what we call fully synergetic subregions, and the corresponding inequalities are the GKS inequalities [
2].
The subregions R
, R
, and R
are depicted in
Figure 2a–c, respectively. The union of subregions R
and R
is the fully synergetic Khinchin–Shannon restriction to be imposed on the strict concavity region of
Figure 1 and it is depicted in
Figure 2d below.
5. The Maximal Extension of the Parameter Space and Its Reduction for Fully Synergetic Distribution
In
Figure 1 and
Figure 2d, we have depicted the structure of the strict concavity region for Sharma–Mittal entropy measures and its reduction to a subregion by the application of the requirement of fully synergetic distributions, respectively. Our analysis has used a coarse-grained approach to concavity given by Equations (
8) and (
10). We now introduce some necessary refinements for characterizing the probability of occurrence in subarrays of
m rows and
t columns,
. For
t columns, there are
possibilities of occurrence of amino acids, which could be a large number, but we could count not individual amino acids, but groups of
t-sets of amino acids (
-groups) which appear on the
m rows of the
array. We characterize these
-groups by
, from all equal
-groups (
) to
m different
-groups (
). We also call
, the number of equal
t-sets of a given
-group.
In Equation (
2), the sum is over all the amino acids that make up the geometric object defined in Equation (
1), the probability of occurrence. We can now perform the sum over
-groups and write
where
are the
t-sets of a
-group. We also have from Equation (
7)
From Equations (
30) and (
31), we can now proceed to the calculation of the Hessian matrix for Sharma–Mittal entropy measures. We have for the first derivative of
We then have for a generic element of the Hessian matrix [
2]
where
is the escort probability associated to
, or
The principal minors are given by
and we have
according to Equation (
31).
From Equations (
35) and (
36), the requirement of strict concavity will lead to
We then have
This does correspond to the criterion of negative definiteness of the Hessian matrix for strict concavity of multivariate functions [
11].
Each
k-value is associated with the
k-epigraph region, which is the
k-extension of the strict concavity region presented in
Figure 1. These regions are given by
The greatest lower bound of the sequence of
k-curves is given by
. We then have
We can then write for the maximal extended region of strict concavity
The region corresponding to Equation (
41) is depicted in
Figure 3 below.
We are now ready to undertake the application of restrictions for fully synergetic distributions (validity of GKS inequalities) to the maximal strict concavity region of
Figure 3.
We start by identifying two regions included in
Figure 3. They will be given by
These regions are depicted in
Figure 4a,b, respectively.
In order to find the reduced region corresponding to
Figure 3, analogously to what has been done for
Figure 1, we also need the subregions R
, R
, Equations (
27) and (
29): the resulting subregion of fully synergetic distributions is given by
and is depicted in
Figure 5.
6. Hölder Inequalities and GKS Inequalities: A Possible Conjecture
In this section, we study the relation between GKS inequalities [
2] and Hölder inequalities by using examples of distributions obtained from databases of protein domain families. In order to start, some definitions and properties of the probabilistic space are now in order.
Let us first introduce the definition of the conditional probability of occurrence of the escort probability of occurrence [
12]. This is a simple application to escort probabilities of Equation (
3):
From the definitions of escort probabilities, Equation (
9), we can write
and
In Equations (
44)–(
46), the symbols
;
assume the representative letters of the one-letter code for the 20 amino acids,
;
{A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.
After substituting Equations (
45) and (
46) into Equation (
44), we get
and from Equation (
46)
We also write the definition of escort probability of occurrence of the conditional probability of occurrence [
12]
We can check the definitions of Equations (
48) and (
49) from the equality of the two escort probabilities with the original conditional probability, for
We should note that the denominators of the right-hand sides of Equations (
48) and (
49), or,
and
will be equal if all amino acids in the
column are equal. If we have, for instance, the
column given by:
The unit vectors of probabilities
and
will also be equal and given by
This means that for this special case of an event of rare occurrence, we also have the equality of the conditional of the escort probability and the escort probability of the conditional probability, or the left-hand sides of Equations (
48) and (
49), respectively.
For a
-column with a generic distribution of amino acids, the denominators
Z and
on the right-hand sides of Equations (
48) and (
49) will no longer be equal. An ordering of these denominators should be decided from the probabilities of amino acid occurrence in a chosen protein domain family.
This study is undertaken with the help of the functions
Z and
of Equations (
51) and (
52) and with the functions
J and
U, defined below:
Our method will then be the comparison of pairs of functions in order to proceed with the search for the effect of fully synergetic distributions of amino acids.
There are six comparisons to study:
or
where
is defined by,
Equations (
57)–(
59) should be multiplied by
and after that, each one has to be summed over
. We then have, respectively,
Equations (
60), (
62) and (
63) can be written, respectively, as
The Hölder’s inequality as applied to probabilities of occurrence [
3] is written as
After multiplying by
and summing over
, we get
We also define
We then summarize the results obtained:
Equation (
64) is only an identity:
.
Equations (
65) and (
68) can be ordered by Hölder’s inequality, Equations (
70) and (
71).
Equations (
66) and (
69) can be ordered by GKS inequalities, corresponding to fully synergetic distributions of amino acids,
.
Equation (
67) cannot be ordered without additional experimental/phenomenological information on the probabilities of occurrence to be obtained from updated versions of protein domain family databases [
13].
We now collect the formulae obtained from the analysis performed on this section. Equations (
65) and (
68) are ordered by Hölder’s inequality. We write
Equations (
66) and (
69) are ordered by GKS inequality. We write
After using Equation (
73), we can write Equation (
67) as
In
Figure 6a,b we have depicted the curves corresponding to functions
and
for seven 3-sets of contiguous columns and 80 rows, chosen from databases Pfam 27.0 and Pfam 35.0, respectively. There are also inset figures in order to show the curves for
.
In
Figure 7a,b, we do the same for the differences
. We emphasize that for the 3-sets such that
,
, the GKS inequalities
will result from the validity of Hölder’s inequality. We have worked with the PF01926 protein domain family to perform all the calculations.
7. Concluding Remarks
The first comment we want to make to the present work is about the possibility of working in a region of the parameter space that preserves the strict concavity and the fully synergetic structure of the Sharma–Mittal class of entropy measure distributions to be visited by solutions of a new successful statistical mechanics approach. The usual work with Havrda–Charvat distributions describes the evolution along the boundary () of the region () that was correctly considered to correspond to strict concavity, but it is also known to be non-synergetic for . We now have the opportunity to develop this statistical mechanics approach along an extended boundary, preserving the strict concavity and providing the study of the evolution of fully synergetic entropy distributions. A first sketch of these developments will be presented in a forthcoming publication.
With respect to
Figure 6 and
Figure 7, we could hypothesize that if the ordering of
B and
could not be obtained, this would be due to the poor alignment of some protein domain families we have been using, but we are not confident enough that we could do this, because we would need much more information “in silico” to be obtained from many other protein domain families. In other words, we expect that a good alignment of a protein domain family will result in the ordering of
B and
, but we need to verify this in a large number of families from different Pfam versions before we proceed with a proposal of a method to improve the Pfam database. This looks promising for good scientific work in the line of research we have been aiming to introduce in Ref. [
2] and in this contribution.