Next Article in Journal
Technoeconomic Analysis for Deployment of Gait-Oriented Wearable Medical Internet-of-Things Platform in Catalonia
Previous Article in Journal
Telehealth-Based Information Retrieval and Extraction for Analysis of Clinical Characteristics and Symptom Patterns in Mild COVID-19 Patients
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Principle of Information Increase: An Operational Perspective on Information Gain in the Foundations of Quantum Theory

Department of Physics, University at Albany (SUNY), Albany, NY 12222, USA
*
Authors to whom correspondence should be addressed.
Information 2024, 15(5), 287; https://doi.org/10.3390/info15050287
Submission received: 7 April 2024 / Revised: 8 May 2024 / Accepted: 8 May 2024 / Published: 17 May 2024

Abstract

:
A measurement performed on a quantum system is an act of gaining information about its state. However, in the foundations of quantum theory, the concept of information is multiply defined, particularly in the area of quantum reconstruction, and its conceptual foundations remain surprisingly under-explored. In this paper, we investigate the gain of information in quantum measurements from an operational viewpoint in the special case of a two-outcome probabilistic source. We show that the continuous extension of the Shannon entropy naturally admits two distinct measures of information gain, differential information gain and relative information gain, and that these have radically different characteristics. In particular, while differential information gain can increase or decrease as additional data are acquired, relative information gain consistently grows and, moreover, exhibits asymptotic indifference to the data or choice of Bayesian prior. In order to make a principled choice between these measures, we articulate a Principle of Information Increase, which incorporates a proposal due to Summhammer that more data from measurements leads to more knowledge about the system, and also takes into consideration black swan events. This principle favours differential information gain as the more relevant metric and guides the selection of priors for these information measures. Finally, we show that, of the symmetric beta distribution priors, the Jeffreys binomial prior is the prior that ensures maximal robustness of information gain for the particular data sequence obtained in a run of experiments.

1. Introduction

A measurement performed on a quantum system is an act of acquiring information about its state. This informational perspective on quantum measurement is widely embraced in practical applications such as quantum tomography [1,2,3,4], Bayesian experimental design [5], and informational analysis of experimental data [6,7]. It is also embraced in foundational research.
In particular, information assumes a central role in the quantum reconstruction program [8], which seeks to elucidate the fundamental physical origins of quantum theory by deriving its formalism from information-inspired postulates [9,10,11,12,13,14,15,16,17]. Nonetheless, in the foundational exploration of quantum theory, the concept of information is articulated and formalized in many different ways, which raises the question of whether there exists a more systematic basis for choosing how to formalize the concept of information within this domain.
In this paper, we scrutinize the notion of information from an operational standpoint and propose a physically intuitive postulate to determine the appropriate information gained from measurements.
In both tomographic applications and reconstruction of quantum theory, the focus often lies on probability distributions of physical parameters or quantities, which are updated based on the measurement results. In these contexts, the outcomes of a measurement performed on a quantum system are modelled as the interrogation of an n-outcome probabilistic source characterised by a set of parameters. For example, a given measurement on a given system can be described by a probability distribution Pr ( x | D ) of a quantity x, which is updated from a prior probability distribution given the results D obtained from a series of measurements performed on identical copies of a system. It is natural to consider using Shannon entropy to quantify the information gained from this updated distribution. However, Shannon entropy is limited to discrete distributions, whereas physical quantities and their associated probability distributions can be continuous.
The question thus arises: What is a suitable measure for quantifying the information obtained from real data, especially for quantities associated with continuous probability distributions?
One potential solution is to employ Kullback–Leibler (KL) divergence, also known as the relative entropy, H ( x | D ) = Pr ( x | D ) ln Pr ( x | D ) Pr ( x | I ) d x , where Pr ( x | I ) represents the prior distribution of x, and Pr ( x | D ) represents the posterior distribution of x updated with the data D. This quantity is commonly referred to as the information gain from the prior distribution to the posterior distribution, and is widely used.
Since the KL divergence is non-negative and invariant under changes of coordinates, it appears to be a reasonable generalization of the Shannon entropy for continuous probability distributions. However, there are situations where information gain defined in terms of the KL divergence does not have a unique representation. Consider a scenario where one has acquired a series of data D, and one proceeds to take additional measurements, obtaining additional data D . What is the additional information gain pertaining to D ? Using the KL divergence, there are two distinct ways to express the information related to this additional data. The first, to which we refer henceforth as the differential information gain, is simply the difference between the information gain from the combined dataset { D , D } and the information gain from D alone (see Figure 1). The second, which we refer to as the relative information gain, is given by the KL divergence of the posterior distribution after obtaining the complete dataset { D , D } compared to the posterior distribution after receiving data D alone (see Figure 2). These two measures of information gain exhibit notably different characteristics. For instance, whether the differential information gain increases or decreases when data D is acquired depends on the choice of the prior distribution over the parameter, while the relative information gain consistently increases regardless of the choice of prior.
As we shall discuss in Section 2, both of these measures can be viewed as arising as a consequence of seeking to generalize the Shannon entropy to continuous probability distributions. In order to determine which of these options is most appropriate for our purposes, we seek a physically intuitive informational postulate to guide our selection. The first criterion comes from the intuitive notion proposed by Summhammer [18,19] that more data from measurements leads to more knowledge about the system. This idea has its origin in the observation that, as we conduct more measurements to determine the value of a physical quantity, the measurement uncertainty tends to decrease. In the following, we employ information theory to formalize and explore the plausibility of this idea. We find that relative information gain is consistently non-negative, whereas the positivity of differential information gain hinges on the choice of the prior distribution.
Contrary to Summhammer’s criterion, we argue that under certain circumstances, negative information gain due to acquisition of additional data D is also meaningful. Take, for instance, the occurrence of a black swan event: an event so rare and unexpected that it significantly increases one’s uncertainty about the colour of swans. If the gain of information is considered to result from a reduction in the degree of uncertainty, the information gain associated with the observation of a black swan should indeed be negative. By combining this observation with Summhammer’s criterion, we are led to the Principle of Information Increase: the information gain from additional data should be positive asymptotically and negative in extreme cases. On the basis of the Principle of Information Increase, in the case of a two-outcome probabilistic source, we show that differential information gain is the more appropriate measure.
In addition, we formulate a new criterion, the robustness of information gain, for selecting priors to use with the differential information gain. The essential idea behind this criterion is as follows. If the result of the additional data D is fixed, then the information gain due to D will vary for different D. Robustness quantifies this difference in information gain across all possible data D. We show that for a two-outcome probabilistic source amongst the symmetric beta distributions, the Jeffreys binomial prior exhibits the highest level of robustness.
The quantification of knowledge gained from additional data is a topic that has received limited attention in the literature. In the realm of foundational research on quantum theory, this issue has been acknowledged but not extensively explored. Summhammer initially proposed the notion that “more data from measurements lead to more knowledge about the system” but did not employ information theory to address this problem, instead using changes in measurement uncertainty to quantify knowledge obtained in the asymptotic limit. This approach limits the applicability of the idea, as it excludes considerations pertaining to prior probability distributions and does not readily apply to finite data.
Wootters demonstrated the significance of the Jeffreys prior in the context of quantum systems from a different information-theoretical perspective [20]. In the domain of communication through quantum systems, the Jeffreys prior can maximize the information gained from measurements. Wootters approaches the issue from a more systematic perspective, utilizing mutual information to measure the information obtained from measurements. However, mutual information quantifies the average information gain over all possible data sequences, which is not suitable for addressing the specific scenario we discussed earlier, for which the focus is on the information gain from a fixed data sequence.
More broadly, the question of how much information is gained with the acquisition of additional data has been a relatively under-explored topic in both practical applications and foundational research on quantum theory. Commonly, mutual information is employed as a utility function. However, as noted above, mutual information essentially represents the expected information gain averaged over all possible data sequences. Consequently, it does not address the specific question of how much information is gained when a particular additional data point is obtained. From our perspective, this averaging process obscures essential edge effects, including black swan events, which, as we will discuss, serve as valuable guides for selecting appropriate information measures.
While our investigation primarily focuses on information gain in quantum systems, we conjecture that the principles and conclusions we draw can be extended to general probabilistic systems. Based on our analysis, we recommend quantification using differential information gain and the utilization of the Jeffreys multinomial prior. If one seeks to calculate the expected information gain in the next step, both the expected differential information gain and the expected relative information gain can be employed since, as we demonstrate for the two-outcome probabilistic case, they yield the same result.
The paper is organized as follows. In Section 2, we detail the two information gain measures, both of which have their origins in the generalization of Shannon entropy to continuous probability distributions. We will also examine Jaynes’ approach to continuous entropy, which serves as the foundation for understanding these two information gain measures. Section 3 and Section 4 focus on the numerical and asymptotic analysis of differential information gain and relative information gain for two-outcome probabilistic sources. Our primary emphasis is on how these measures behave under different prior distributions. We will explore black swan events, where the additional data D are highly improbable given D. In this unique context, we will assess the physical meaningfulness of the two information gain measures. In Section 5, we will discuss expected information gain under the assumption that data D from additional measurements have not yet been received. Despite the general differences between the two measures, it is intriguing to note that the two expected information gain measures are equal. Section 6 presents a comparison of the two information gain measures and the expected information gain. It is within this section that we propose the Principle of Information Increase, which crystallises the results of our analysis of the two measures of information gain. Finally, Section 7 explores the relationships between our work and other research in the field.

2. Continuous Entropy and Bayesian Information Gain

2.1. Entropy of Continuous Distribution

The Shannon entropy serves as a measure of uncertainty concerning a random variable before we have knowledge of its value. If we regard information as the absence of uncertainty, the Shannon entropy can also be used as a measure of information gained about a variable after acquiring knowledge about its value. However, it is important to note that Shannon entropy is applicable only to discrete random variables. To extend the concept of entropy to continuous variables, Shannon introduced the idea of differential entropy. Unlike Shannon entropy, differential entropy was not derived on an axiomatic basis. Moreover, it has a number of limitations.
First, the differential entropy can yield negative values, as exemplified by the differential entropy of a uniform distribution over the interval [ 0 , 1 2 ] , which equals log 2 . Negative entropy, indicating a negative degree of uncertainty, lacks meaningful interpretation. Second, the differential entropy is coordinate-dependent [21], so that its value is not conserved under a change of variables. This implies that viewing the same data through different coordinate systems may result in the assignment of different degrees of uncertainty. Since the choice of coordinate systems is usually considered arbitrary, this coordinate-dependence also lacks a meaningful interpretation.
In an attempt to address the challenges associated with continuous entropy, Jaynes introduced a solution known as the limiting density of discrete points (LDDP) approach in his work [22]. In this approach, the probability density p ( x ) of a random variable X is initially defined on a set of discrete points x x 1 , x 2 , , x n . Jaynes proposed an invariant measure m ( x ) such that, as the collection of points x i becomes increasingly numerous, in the limit as n ,
lim n 1 n ( number of points in a < x < b ) = a b m ( x ) d x
With the help of m ( x ) , the entropy of X can then be represented as
H ( X ) = lim n log n p ( x ) log p ( x ) m ( x ) d x
In this manner, the weaknesses associated with differential entropy appear to be resolved. This quantity remains invariant under changes of variables and is always non-negative. A similar approach is also discussed in [21]. However, two new issues arise. In Equation (2), H ( X ) contains an infinite term, and the measure function m ( x ) is unknown.
Regarding the infinite term, two potential solutions exist. The first option is to retain this infinite term and to reserve interpretation to the difference in the continuous entropy of two continuous distributions. The second solution is more straightforward: simply to omit the problematic log n term.
1.
Entropy of continuous distribution as a difference:
For example, when variable X is updated to X due to certain actions, the decrease in entropy can be expressed as:
Δ H ( X X ) H ( X ) H ( X ) = p ( x ) log p ( x ) m ( x ) d x p ( x ) log p ( x ) m ( x ) d x
where p ( x ) represents the probability distribution of X . Here, the two infinite terms cancel. The quantity Δ H quantifies the reduction in uncertainty about variable X resulting from these actions. This reduction in uncertainty can also be interpreted as an increase in information.
2.
Straightforward solution:
Jaynes directly discards the infinite term in Equation (2). For the sake of convenience, the minus sign is also dropped. This leads to the definition of Shannon–Jaynes information:
H J a y n e s ( X ) = p ( x ) log p ( x ) m ( x ) d x
This term quantifies the amount of information we possess regarding the outcome of X rather than the degree of uncertainty about X. H J a y n e s is equivalent to the KL divergence between the distributions p ( x ) and m ( x ) .
In short, there are two ways to represent the entropy of a continuous distribution, with no obvious criterion to choose between them. In a special case where the variable X initially follows a distribution identical to the measure function, i.e., p ( x ) = m ( x ) , and X undergoes evolution to X with distribution p ( x ) , then we find that Δ H ( X X ) = H J a y n e s ( X ) .
The remaining challenge lies in the selection of the measure function m ( x ) . When applying this concept of continuous entropy to the relationship between information theory and classical statistical physics, Jaynes opted for a uniform measure over phase space [22]. However, there is no established criterion for the choice of the measure function in any given application. We note that this measure function is analogous to the prior distribution in the context of Bayesian probability, with which it is often identified, which then leads to the well-known challenge of prior selection in Bayesian data analysis.

2.2. Bayesian Information Gain

In a coin-tossing model, let p denote the probability of getting a head in a single toss, and let N be the total number of tosses. After N tosses, the outcomes of these N tosses can be represented by an N-tuple, denoted as T N = ( t 1 , t 2 , , t N ) , where each t i represents the result of the ith toss, with t i taking values in the set { Head , Tail } . Applying the Bayes rule, the posterior probability for the probability of getting a head is given by:
Pr ( p | N , T N , I ) = Pr ( T N | N , p , I ) Pr ( p | I ) Pr ( T N | N , p , I ) Pr ( p | I ) d p
where Pr ( p | I ) represents the prior. The information gain after N tosses would be the KL divergence from the prior distribution to the posterior distribution:
I ( N ) = D KL ( Pr ( p | N , T N , I ) | | Pr ( p | I ) ) = 0 1 Pr ( p | N , T N , I ) ln Pr ( p | N , T N , I ) Pr ( p | I ) d p
Based on the earlier discussion on continuous entropy, this quantity can be interpreted in two ways, either as the difference between the information gain after N tosses and the information gain without any tosses or as the KL divergence from the posterior distribution to the prior distribution.
When considering the information gain of additional tosses based on the results of the previous N tosses, we may observe two different approaches to represent this quantity.
Let t N + 1 represent the outcome of the ( N + 1 ) th toss, and let T N + 1 = ( t 1 , t 2 , , t N , t N + 1 ) denote the combined outcomes of the first N tosses and the ( N + 1 ) th toss. The posterior distribution after these N + 1 tosses is given by:
Pr ( p | N + 1 , T N + 1 , I ) = Pr ( T N + 1 | N + 1 , p , I ) Pr ( p | I ) Pr ( T N + 1 | N + 1 , p , I ) Pr ( p | I ) d p
When considering information gain as a difference between two quantities, the first form of information gain for this single toss t N + 1 can be expressed as:
I diff = D KL ( Pr ( p | N + 1 , T N + 1 , I ) | | Pr ( p | I ) ) D KL ( Pr ( p | N , T N , I ) | | Pr ( p | I ) )
In this expression, the first term H ( Pr ( p | N + 1 , t N + 1 , I ) | | Pr ( p | I ) ) represents the information gain from 0 tosses to N + 1 tosses, while the second term H ( Pr ( p | N , T N , I ) | | Pr ( p | I ) ) represents the information gain from 0 tosses to N tosses. The difference between these terms quantifies the information gain in the single ( N + 1 ) th toss (see Figure 1). In this context, we can refer to I diff as the differential information gain in a single toss.
Figure 1. Differential information gain in a single toss. Assuming we have data from the first N tosses, denoted as T N . Using a specific prior distribution, we can calculate the information gain for these first N tosses, denoted as I ( N ) . If we now consider the ( N + 1 ) th toss and obtain the result t N + 1 , we can repeat the same procedure to calculate the information gain for a total of N + 1 tosses, denoted as I ( N + 1 ) . The information gain specific to the ( N + 1 ) th toss can be obtained as the difference between I ( N + 1 ) and I ( N ) .
Figure 1. Differential information gain in a single toss. Assuming we have data from the first N tosses, denoted as T N . Using a specific prior distribution, we can calculate the information gain for these first N tosses, denoted as I ( N ) . If we now consider the ( N + 1 ) th toss and obtain the result t N + 1 , we can repeat the same procedure to calculate the information gain for a total of N + 1 tosses, denoted as I ( N + 1 ) . The information gain specific to the ( N + 1 ) th toss can be obtained as the difference between I ( N + 1 ) and I ( N ) .
Information 15 00287 g001
Alternatively, we directly calculate the information gain from the Nth toss to the ( N + 1 ) th toss. Hence, the second form of information gain is defined as follows:
I rel = D KL ( Pr ( p | N + 1 , T N + 1 , I ) | | Pr ( p | N , T N , I ) ) ,
which is simply the KL divergence from the posterior distribution after N tosses to the posterior distribution after N + 1 tosses (see Figure 2). We refer to I rel as the relative information gain in a single toss.
Figure 2. Relative information gain in a single toss: The posterior distribution calculated from the results of the first N tosses serves as the prior for the ( N + 1 ) th toss. The KL divergence between this posterior and the subsequent posterior represents the information gain in the ( N + 1 ) th toss.
Figure 2. Relative information gain in a single toss: The posterior distribution calculated from the results of the first N tosses serves as the prior for the ( N + 1 ) th toss. The KL divergence between this posterior and the subsequent posterior represents the information gain in the ( N + 1 ) th toss.
Information 15 00287 g002
In general, these two quantities, I diff and I rel , are not the same unless N = 0 , which implies that no measurements have been performed. I diff could take on negative values, while I rel is always non-negative due to the properties of the KL divergence. (This non-negativity is a consequence of Jensen’s inequality applied to the convex logarithmic function, ensuring that the expected logarithmic difference between two probability distributions, which constitutes the KL divergence, cannot be negative.) Although KL divergence is not a proper distance metric between probability distributions (as it does not satisfy the triangle inequality), it is a valuable tool for illustrating the analogy of displacement and distance in a random walk model. (In a random walk, the change in total distance after N + 1 steps compared to after N steps could be either positive or negative, analogous to how I diff can have positive or negative values. On the other hand, the net displacement between the positions at step N and step N + 1 represents the absolute change in position, which is analogous to I rel always having a non-negative value.) This analogy helps elucidate the subtle difference between the two types of information gain.
Our goal is to determine which information gain measure is a more suitable choice. To do so, we use Summhammer’s aforementioned postulate—“more measurements lead to more knowledge about the physical system” [18,19]—as our point of departure. If we quantify “knowledge” in terms of information gain from data, this notion suggests that the information gain from additional data should be positive if it indeed contributes to our understanding. This consideration makes relative information gain seem an appealing choice, as it is always non-negative. However, the derivation of differential information gain also carries significance. This leads to the question of whether Summhammer’s intuitive idea is sufficient, and if not, what can replace it. In the following sections, we first will investigate differential information gain in both the finite N and asymptotic cases. We will explore the implications of negative values of differential information gain, particularly in extreme situations. We will then conduct numerical and asymptotic analyses of relative information gain. After analysing both measures of information gain, we will be better equipped to compare and establish connections between them and to assess the physical meaningfulness of Summhammer’s proposal.

3. Differential Information Gain

3.1. Finite Number of Tosses

For the prior distribution, we employ the symmetric beta distribution, which serves as the conjugate prior for the binomial distribution:
Pr ( p | I ) = p α ( 1 p ) α B ( α + 1 , α + 1 )
where α > 1 , and B ( · , · ) is the beta function.
In general, the beta distribution is characterized by two parameters. However, as the prior over p is invariably taken to be symmetric about p = 1 / 2 (which follows from the desideratum that the prior be invariant under outcome relabelling), we use a symmetric, single-parameter beta distribution. This distribution encompasses a wide spectrum of priors, including the uniform distribution (when α = 0 ) and the Jeffreys binomial prior (when α = 0.5 ).
The differential information gain of the ( N + 1 ) th toss is (see Appendix A)
I diff = ψ ( h N + α + 2 ) ψ ( N + 2 α + 3 ) + h N h N + α + 1 N N + 2 α + 2 + ln N + 2 α + 2 h N + α + 1
where ψ is the digamma function (the digamma function can be defined in terms of the gamma function: ψ ( x ) = Γ ( x ) Γ ( x ) ), and h N is the number of heads in the first N tosses.
In this context, we assume that t N + 1 = Head . There is also a corresponding I diff ( t N + 1 = Tail ) , but there is no loss of generality since we consider all possible values of T N and since the expressions for both cases (Head and Tail) are symmetric.
I diff is a function of h N and α , and h N ranges from 0 to N. In the following, we select a specific value for α and calculate all the N + 1 values of I diff for each N (see Figure 3).

3.1.1. Positivity of I diff

Returning to our initial question—“Will more data lead to more knowledge?”—if we use the term “knowledge” to represent the differential information gain and use I diff to quantify the information gained in each measurement, the question becomes rather straightforward: “Is I diff always positive?”
In Figure 3, we present the results of numerical calculations for various values of N. Upon close examination of the graph, it becomes evident that I diff is not always positive, except under specific conditions. In the following sections, we will investigate the conditions that lead to exceptions.
Figure 3. Differential information gain ( I diff ) vs. N for different priors. Here, the y-axis represents the value of I diff , and the x-axis corresponds to the value of N. In each graph, we fix the value of α to allow for a comparison of the behaviour of I diff under different priors. Given N, there are N + 1 points in the vertical direction as h N ranges from 0 to N. Notably, for α = 0.7 , all points lie above the x-axis, while for other priors, negative points are present, and the fraction of negative points becomes constant as N increases. The asymptotic behaviour of this fraction is shown in Figure 4. Moreover, it appears that the graph is most concentrated when α = 0.5 , whereas for α < 0.5 and α > 0.5 , the graph becomes more dispersed.
Figure 3. Differential information gain ( I diff ) vs. N for different priors. Here, the y-axis represents the value of I diff , and the x-axis corresponds to the value of N. In each graph, we fix the value of α to allow for a comparison of the behaviour of I diff under different priors. Given N, there are N + 1 points in the vertical direction as h N ranges from 0 to N. Notably, for α = 0.7 , all points lie above the x-axis, while for other priors, negative points are present, and the fraction of negative points becomes constant as N increases. The asymptotic behaviour of this fraction is shown in Figure 4. Moreover, it appears that the graph is most concentrated when α = 0.5 , whereas for α < 0.5 and α > 0.5 , the graph becomes more dispersed.
Information 15 00287 g003
For certain priors, the differential information gain is consistently positive (Figure 3a), while for other priors, both positive and negative regions exist (Figure 3b–d). We note that for priors leading to negative regions, the lowest line exhibits greater dispersion compared to the other data lines. This lower line represents the scenario where the first N tosses all result in tails, but the ( N + 1 ) th toss yields a head. This situation is akin to a black swan event, and negative information gain in this extreme case holds significant meaning—if we have tossed a coin N times and obtaining all tails, we anticipate another tail in the next toss; hence, receipt of heads on the next toss raises the degree of uncertainty about the outcome of the next toss, leading to a reduction in information about the coin’s bias.

3.1.2. Fraction of Negatives

In order to illustrate the variations in the positivity of information gain under different priors, we introduce a new quantity that we refer as to as the Fraction of Negatives (FoN), which represents the ratio of the number of h N values that lead to negative I diff and N + 1 . For instance, if, for a given α , N = 10 and I diff < 0 when h N = 0 , 1 , 2 , 3 , the FoN under this α and N is 4 11 .
From Figure 4, we identify a critical point, denoted as α p , which is approximately 0.7 . For any α α p , I diff is guaranteed to be positive for all N and h N values.
Figure 4. Fraction of Negatives (FoN) vs. N under different values of α. In Figure 3, we can observe that larger α values lead to more dispersed lines and an increased number of negative values for each N. We use FoN to quantify this fraction of negative points. It appears that for α 0.7 , FoN is consistently zero, indicating that I diff is always positive. For α 0.5 FoN decreases and tends to be zero as N becomes large, while for α > 0.5 , FoN tends to a constant as N increases, and this constant grows with increasing values of α .
Figure 4. Fraction of Negatives (FoN) vs. N under different values of α. In Figure 3, we can observe that larger α values lead to more dispersed lines and an increased number of negative values for each N. We use FoN to quantify this fraction of negative points. It appears that for α 0.7 , FoN is consistently zero, indicating that I diff is always positive. For α 0.5 FoN decreases and tends to be zero as N becomes large, while for α > 0.5 , FoN tends to a constant as N increases, and this constant grows with increasing values of α .
Information 15 00287 g004
If α > α p , negative terms exist for some h N ; however, the patterns of these negative terms differ across various α values.
Additionally, we notice the presence of a turning point, α 0 = 0.5 . For α α 0 , FoN tends to zero as N increases, whereas for α > α 0 , FoN approaches a constant as N grows.
A clearer representation of the critical point α p and the turning point α 0 can be found in Figure 5, where the critical point α p is approximately 0.68 .

3.1.3. Robustness of I diff

In Figure 3, different priors not only exhibit varying degrees of positivity but also display varying degrees of variation in I diff for different values of h N ; we refer to this as divergence. The divergence depends upon the choice of prior. To better understand this dependence, we quantify the dependence of I diff on h N by the standard deviation of I diff across different values of h N . Figure 6 illustrates how the standard deviation changes with respect to α while keeping N constant.
It is evident that when α is close to 0.5 , the standard deviation is at its minimum. Reduced dependence of I diff on h N enhances its robustness against the effects of nature, as we attribute h N to natural factors, while N is determined by human measurement choices. As N increases, the minimum point approaches 0.5 . In the limit of large N, this minimum point will eventually converge to α = 1 2 , which means that under this specific choice of prior, I diff depends minimally on h N and primarily on N.

3.2. Large N Approximation

Utilizing a recurrence relation and a large x approximation, the digamma function can be approximated as:
ψ ( x ) = 1 x 1 + ψ ( x 1 ) 1 x 1 + ln ( x 1 ) 1 2 ( x 1 ) = 1 2 ( x 1 ) + ln ( x 1 )
As a result, the large N approximation for the differential information gain in Equation (11) becomes:
I diff = 2 h N + 1 2 ( h N + α + 1 ) 2 N + 1 2 ( N + 2 α + 2 )
Using this approximation, when α = 1 2 , I diff = 1 2 ( N + 1 ) , which shows that I diff solely depends on N. This finding aligns with Figure 3, which demonstrates that I diff is most concentrated when α = 0.5 and is also consistent with the results of [23].
In Figure 4, we observe that the FoN tends to become constant for very large values of N. These constants can be estimated using the large N approximation of I diff in Equation (13) (see Table 1). If I diff 0 , then
h N 2 N α + N + α + 1 4 α + 3 ,
and we obtain:
FoN = 1 N + 1 2 N α + N + α + 1 4 α + 3 2 α + 1 4 α + 3
This equation aligns with the asymptotic lines in Figure 4, providing support for the observation mentioned in Figure 3: namely, that for α = 0.7 , all points lie above the x-axis, while for other priors, negative points are present, and the fraction of negative points becomes constant.

4. Relative Information Gain

The second form of information gain in a single toss is relative information gain, which represents the KL divergence from the posterior after N tosses to the posterior after N + 1 tosses. We continue to use the one-parameter beta distribution prior in the form of Equation (10). The relative information gain is (see Appendix B):
I rel ( t N + 1 = Head ) = ψ ( h N + α + 2 ) ψ ( N + 2 α + 3 ) + ln N + 2 α + 2 h N + α + 1
Relative information gain exhibits entirely different behaviour compared to differential information gain. Due to the properties of KL divergence, relative information gain is always non-negative, eliminating the need to consider negative values. We explore the dependence of relative information gain on priors and the interpretation of information gain in extreme cases.
In Figure 7, it becomes evident that, under different priors, the data lines exhibit similar shapes. This suggests that relative information gain is relatively insensitive to the choice of priors. On each graph, the top line represents the extreme case where the first N tosses result in tails and the ( N + 1 ) th toss results in a head. This line is notably separated from the other data lines, indicating that relative information gain behaves more like a measure of the degree of surprise associated with this additional data. In this black swan event, the posterior after N + 1 tosses differs significantly from the posterior after N tosses.
For small values of N, both the average value and the standard deviation of I rel exhibit a clear monotonic relationship with α , meaning that larger values of α result in smaller average values and standard deviations. However, as N becomes large, all priors converge and become indistinguishable. Nonetheless, it is important to note that relative information gain remains heavily independent on the specific data sequences ( h N ). Figure 8 illustrates how the standard deviation of I rel under different priors converges to the same value as N increases.
By utilizing the aforementioned approximation of the digamma function, we obtain:
I rel ( t N + 1 = Head ) 1 2 ( h N + α + 1 ) 1 2 ( N + 2 α + 2 ) = N h N + α + 1 2 ( h N + α + 1 ) ( N + 2 α + 2 )
In the large N limit, I rel becomes:
I rel ( t N + 1 = Head ) 1 2 N h N N 1 1 ,
which is independent of α . Thus, it appears that the properties of relative information gain and differential information gain are complementary to each other. The differences between them are summarized in Table 2.

5. Expected Information Gain

In this section, we discuss a new scenario: after N tosses but before the ( N + 1 ) th toss has been taken, can we predict how much information gain will occur in the next toss? The answer is affirmative, as discussed earlier.
After N tosses, we obtain a data sequence T N with h N heads. However, we can only estimate the probability p based on the posterior Pr ( p | N , T N , I ) . The expected value of p can be expressed as:
p = 0 1 p Pr ( p | N , T N , I ) d p = h N + α + 1 N + 2 α + 2
Based on this expected value of p, we can calculate the average of the information gain in the ( N + 1 ) th toss. We define the expected differential information gain in the ( N + 1 ) th toss as:
I diff ¯ = p × I diff ( t N + 1 = Head ) + 1 p × I diff ( t N + 1 = Tail ) = h N + α + 1 N + 2 α + 2 ψ ( h N + α + 2 ) + N h N + α + 1 N + 2 α + 2 ψ ( N h N + α + 2 ) ψ ( N + 2 α + 3 ) + h N + α + 1 N + 2 α + 2 ln N + 2 α + 2 h N + α + 1 + N h N + α + 1 N + 2 α + 2 ln N + 2 α + 2 N h N + α + 1
I diff ¯ represents the expected value of differential information gain in the ( N + 1 ) th toss. Similarly, we can define the expected relative information gain as:
I rel ¯ = p × I rel ( t N + 1 = Head ) + 1 p × I rel ( t N + 1 = Tail ) = h N + α + 1 N + 2 α + 2 ψ ( h N + α + 2 ) + N h N + α + 1 N + 2 α + 2 ψ ( N h N + α + 2 ) ψ ( N + 2 α + 3 ) + h N + α + 1 N + 2 α + 2 ln N + 2 α + 2 h N + α + 1 + N h N + α + 1 N + 2 α + 2 ln N + 2 α + 2 N h N + α + 1
Surprisingly, I diff ¯ = I rel ¯ . This relationship holds true for any prior, not being limited to the beta distribution type prior, and furthermore holds for an arbitrary n-outcome probabilistic source. Please refer to Appendix C for a detailed proof. This suggests that there is only one choice for the expected information gain.
We first show the numerical results of expected information gain under different priors. It is evident that all data points are above the x-axis, indicating that the expected information gain is positive-definite, as anticipated. Since both I rel and p are positive, it follows that I rel ¯ must also be positive.
As with the discussions of differential information gain and relative information gain, we are also interested in examining the dependence of expected information gain on α or h N . However, such dependence appears to be weak, as illustrated in Figure 9 and Figure 10. Expected information gain demonstrates strong robustness concerning variations in α and h N .
The asymptotic expression of expected information gain is
I diff ¯ = I rel ¯ = 1 2 N

6. Comparison of Three Information Gain Measures, and the Information Increase Principle

From an operational perspective, the information measures we have considered can be categorized into two types: differential information gain and relative information gain pertain to a measurement that has already been made, while expected information gain pertains to a measurement that has yet to be conducted.
Regarding positivity, which is tied to the fundamental question of “Will acquiring more data from measurements lead to a deeper understanding of the system?”: for relative information gain and expected information gain, the answer is affirmative, but differential information gain is positive only under certain specific prior conditions.
All three measures are functions of variables denoted as N, α , and h N , which characterize the size of the data sequences, the prior information, and the existing data sequence, respectively. How sensitive are these measures to these parameters, particularly for large values of N? As we have shown, differential information gain is heavily influenced by all three parameters. It becomes nearly independent of h N only when α = 0.5 . Relative information gain is not highly sensitive to the choice of priors. In the case of large values of N, relative information gain is affected by both h N and N, whereas expected information gain depends solely on N. The comparison between them is summarized in Table 3.
At first, one might have expected that the idea that more data from measurements lead to more knowledge about the system would hold strictly: namely, that the information gain from additional data would always be strictly positive. However, our perspective has been challenged by the observation of black swan events. In the extreme scenario where the first N tosses all result in tails and the ( N + 1 ) th toss yields a head, a negative information gain in this ( N + 1 ) th toss may be a more reasonable interpretation. To address this, we propose the
Principle of Information Increase:In a series of interrogations of an n-outcome probabilistic source, the information gain from additional data should tend towards positivity in the asymptotic limit. However, in the extreme case where the first N data points are identical and the data of the ( N + 1 ) th trial is contrary to the previous data, the information gain in this exceptional case should be negative.
Applying this criterion, the choice of using the differential information gain becomes more appropriate for measuring the extent of knowledge contributed by additional data. For the beta distribution prior, it should be constrained within the range of approximately 0.68 α 0.5 . If we also consider the robustness of information gain under various given data scenarios, then the Jeffreys binomial prior ( α = 0.5 ) emerges as the most favourable choice.

7. Related Work

7.1. Information Increase Principle and the Jeffreys Binomial Prior

In [18,19], Summhammer introduces the idea that more measurements lead to more knowledge about a physical quantity and quantifies the level of knowledge regarding a quantity by assessing its uncertainty range after a series of repeated measurements. Quantified in this manner, the notion can be summarized as: “The uncertainty range of a physical quantity should decrease as the number of measurements increases.” For a quantity θ , the uncertainty range Δ θ is a function of the number of measurements:
Δ θ ( N + 1 ) < Δ θ ( N )
If this quantity is determined by the probability of a two-outcome measurement, such as the probability of obtaining heads (p) in a coin toss, then there exists a relationship between the uncertainty range of θ and that of p,
Δ θ = θ p Δ p
In large N approximation, Δ p = p ( 1 p ) / N , so that
Δ θ = θ p p ( 1 p ) / N .
One intuitive way to ensure Equation (23) holds is by forcing Δ θ to be purely a function of N. Observing the relationship between Δ θ and Δ p , the simplest solution would be to set Δ θ = const . N . Under this solution, the relationship between p and θ takes the following form:
θ p p ( 1 p ) = const . ,
which yields Malus’ law p ( θ ) = cos 2 ( m ( θ θ 0 ) / 2 ) , with m Z .
Summhammer does not employ information theory to quantify “knowledge about a physical quantity” but instead utilizes the statistical uncertainty associated with the quantity. However, viewed from the Bayesian perspective, if we assume that the prior distribution of the physical quantity, θ , is uniform, the difference between θ and p in Equation (25) implies that the prior distribution of the probability follows the Jeffreys binomial prior:
Pr ( p | I ) = θ p Pr ( θ | I ) = 1 π 1 p ( 1 p )
Thus, in the large N approximation, Summhammer’s result can be interpreted to mean that the prior associated with the probability of a uniformly distributed physical quantity must adhere to the Jeffreys binomial prior.
Goyal [23] introduces an asymptotic Principle of Information Gain, which states that “In n interrogations of a N-outcome probabilistic source with an unknown probabilistic vector P , the amount of Shannon–Jaynes information provided by the data about P remains independent of P for all P in the limit as n .” Goyal establishes the equivalence between this principle and the Jeffreys rule. Under his Principle of Information Gain, the Jeffreys multinomial prior is then derived. In the case of a two-outcome probabilistic model, the Jeffreys multinomial prior reduces to the Jeffreys binomial prior. Asymptotic analysis reveals that Shannon–Jaynes information is not only independent of the probability vector P but also monotonically increases with the number of interrogations. It is worth noting that Shannon–Jaynes information can be viewed as the accumulation of differential information gain. This asymptotic result aligns with our findings: under the Jeffreys binomial prior, the differential information gain is solely dependent on the number of measurements.

7.2. Other Information-Theoretical Motivations of the Jeffreys Binomial Prior

Wootters [20] introduces a novel perspective on the Jeffreys binomial prior, where quantum measurement is employed as a communication channel. In this framework, Alice aims to transmit a continuous variable, denoted as θ , to Bob. Instead of directly sending θ to Bob, Alice transmits a set of identical coins to Bob, where the probability of getting heads, p ( θ ) , in each toss is a function of θ . Bob’s objective is to maximize the information about θ that he can extract from a finite number of tosses. The measure of information used in this context is the mutual information between θ and the total number of heads, n, in N tosses.
I ( n : θ ) = H ( n ) H ( n | θ ) = n = 0 N p ( n ) ln P ( n ) n = 0 N p ( n | p ( θ ) ) ln p ( n | p ( θ ) )
However, the function p ( θ ) is unknown, and the optimization process begins with a set of discrete values, p 1 , p 2 , , p L rather than utilizing the continuous function p ( θ ) . For each discrete value, p k , there is an associated weight, w k . The mutual information can be expressed as follows:
I ( n : θ ) = n = 0 N p ( n ) ln P ( n ) + k = 1 L w k n = 0 N p ( n | p k ) ln p ( n | p k )
In the large N approximation, it is found that the weight w takes on a specific form:
w ( p ) = 1 π p ( 1 p )
which serves a role akin to the prior probability of p. Remarkably, this prior probability aligns with the Jeffreys binomial prior. A similar procedure can be extended to the Jeffreys multinomial prior distribution. Wootters’ approach shares similarities with the concept of a reference prior, where the selected prior aims to maximize mutual information, which can be viewed as the expected information gain across all data. The outcome is consistent with the reference prior for multinomial data [24], thus revealing another informational interpretation of the Jeffreys prior.

8. Conclusions

In this paper, motivated by recent work in quantum reconstruction and quantum state tomography, we have investigated the concept of information gain for a two-outcome probabilistic source from an operational perspective. We have introduced an informational postulate, the Principle of Information Increase, which serves as a criterion for selecting the appropriate measure to quantify the extent of information gained from measurements and to guide the choice of prior. We have shown that differential information gain is the most physically meaningful measure when compared to the other contender: the relative information gain. We have also uncovered the unanticipated and rather remarkable result that the expected value of these two measures of information gain are equal for any prior and for any n-outcome probabilistic source.
Within the set of symmetric beta distributions, we have shown that the Jeffreys binomial prior exhibits notable characteristics. Both Summhammer’s work and ours demonstrate that, under this prior, the intuitive notion that more data from measurements leads to more knowledge about the system holds true, as confirmed by two distinct methods of quantifying knowledge. Additionally, Wootters shows that this prior enables the communication of maximal information, further highlighting its significance. Here, we have formulated the novel notion of robustness and have shown that the Jeffreys binomial prior displays maximal robustness within the set of symmetric beta distributions. Our work raises the intriguing question of whether this feature could be extended to the multinomial Jeffreys prior and whether it would be possible to lift the initial restriction to the set of beta distributions. We also speculate that a deeper understanding of the robustness of the Jeffreys prior remains to be uncovered.

Author Contributions

Conceptualization, P.G.; methodology, P.G.; software, Y.Y.; formal analysis, P.G. and Y.Y..; investigation, Y.Y.; data curation, Y.Y.; writing—original draft preparation, Y.Y.; writing—review and editing, P.G.; visualization, P.G. and Y.Y.; supervision, P.G.; project administration, P.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are derived from analytical expressions and numerical calculations detailed within the paper. All calculations were conducted under the specified parameters presented in each graph. The code used for these calculations is straightforward, reflecting direct implementation of the analytical expressions provided. While there are no external data sources, the code utilized for generating the data are available from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Derivation of Differential Information Gain

The posterior is determined by T N and a prior. For the sake of simplicity, we set that the prior belongs to the family of beta distributions:
Pr ( p | I ) = p α ( 1 p ) α B ( α + 1 , α + 1 )
where α > 1 , B ( x , y ) is the beta function.
Given N, there are 2 N different values of T N . However, we may not need to calculate all 2 N sequences. Suppose every toss is independent—this happens in quantum mechanics—then this coin tossing model would become a binomial distribution. Let h N be the number of heads inside T N ; the posterior Pr ( p | N , T N , I ) is equivalent to Pr ( p | N , h N , I ) , and the likelihood will be
Pr ( h N | N , p , I ) = N h N p h N ( 1 p ) N h N .
Hence, the posterior after N tosses is
Pr ( p | N , h N , I ) = Pr ( h N | N , p , I ) Pr ( p | I ) Pr ( h N | N , p , I ) Pr ( p | I ) d p = p h N + α ( 1 p ) N h N + α B ( h N + α + 1 , N h N + α + 1 )
The information gain in the ( N + 1 ) th toss would be
I diff = D KL ( Pr ( p | N + 1 , { T N , t N + 1 } , I ) | | Pr ( p | I ) ) D KL ( Pr ( p | N , h N , I ) | | Pr ( p | I ) )
I diff is determined by h N and the prior, and the result of the ( N + 1 ) th toss t N + 1 . t N + 1 could be either “Head” or “Tail”; then, the posterior after N + 1 tosses could be
Pr ( p | N + 1 , { T N , t N + 1 = Head Head } , I ) = p h N + α + 1 ( 1 p ) N h N + α B ( h N + α + 2 , N h N + α + 1 )
Pr ( p | N + 1 , { T N , t N + 1 = Tail } , I ) = p h N + α ( 1 p ) N h N + α + 1 B ( h N + α + 1 , N h N + α + 2 )
Taking t N + 1 = Head , the first term in (A4) would become
D KL ( Pr ( p | N + 1 , { T N , t N + 1 = Head } , I ) | | Pr ( p | I ) ) = 0 1 Pr ( p | N + 1 , h N + 1 , I ) ln Pr ( p | N + 1 , h N + 1 , I ) Pr ( p | I ) d p = 0 1 p h N + α + 1 ( 1 p ) N h N + α B ( h N + α + 2 , N h N + α + 1 ) ln p h N + 1 ( 1 p ) N h N B ( α + 1 , α + 1 ) B ( h N + α + 2 , N h N + α + 1 ) d p = 0 1 p h N + α + 1 ( 1 p ) N h N + α B ( h N + α + 2 , N h N + α + 1 ) ln [ p h N + 1 ( 1 p ) N h N ] + ln B ( α + 1 , α + 1 ) B ( h N + α + 2 , N h N + α + 1 ) d p = 0 1 p h N + α + 1 ( 1 p ) n h N + α B ( h N + α + 1 , n h N + α + 1 ) ln [ p h N + 1 ( 1 p ) N h N ] d p + ln B ( α + 1 , α + 1 ) B ( h N + α + 2 , n h N + α + 1 )
By using the integral
0 1 x a ( 1 x ) b l n ( x ) d x = B ( a + 1 , b + 1 ) [ ψ ( a + 1 ) ψ ( a + b + 2 ) ]
where ψ ( x ) is the digamma function, we can obtain the following result:
D KL ( Pr ( p | N + 1 , { T N , t N + 1 = Head } , I ) | | Pr ( p | I ) ) = ( h N + 1 ) ψ ( h N + α + 2 ) + ( N h N ) ψ ( N h N + α + 1 ) ( N + 1 ) ψ ( N + 2 α + 3 ) + ln B ( α + 1 , α + 1 ) B ( h N + α + 2 , n h N + α + 1 )
The second term in (A4) would become
D KL ( Pr ( p | N , h N , I ) | | Pr ( p | I ) ) = 0 1 Pr ( p | N , h N , I ) ln Pr ( p | N , h N , I ) Pr ( p | I ) d p = 0 1 p h N + α ( 1 p ) N h N + α B ( h N + α + 1 , N h N + α + 1 ) ln p h N ( 1 p ) N h N B ( α + 1 , α + 1 ) B ( h N + α + 1 , N h N + α + 1 ) d p = 0 1 p h N + α ( 1 p ) N h N + α B ( h N + α + 1 , N h N + α + 1 ) ln [ p h N ( 1 p ) N h N ] + ln B ( α + 1 , α + 1 ) B ( h N + α + 1 , N h N + α + 1 ) d p = 0 1 p h N + α ( 1 p ) n h N + α B ( h N + α + 1 , n h N + α + 1 ) ln [ p h N ( 1 p ) N h N ] d p + ln B ( α + 1 , α + 1 ) B ( h N + α + 1 , n h N + α + 1 ) = h N ψ ( h N + α + 1 ) + ( N h N ) ψ ( N h N + α + 1 ) N ψ ( N + 2 α + 2 ) + ln B ( α + 1 , α + 1 ) B ( h N + α + 1 , n h N + α + 1 )
Now, we obtain the final expression of (A4):
I diff ( t N + 1 = Head ) = D KL ( Pr ( p | N + 1 , { T N , t N + 1 = Head } , I ) | | Pr ( p | I ) ) D KL ( Pr ( p | N , h N , I ) | | Pr ( p | I ) ) = ψ ( h N + α + 2 ) ψ ( N + 2 α + 3 ) + h N h N + α + 1 N N + 2 α + 2 + ln N + 2 α + 2 h N + α + 1
Similarly, we can obtain I diff when t N + 1 = Tail :
I diff ( t N + 1 = Tail ) = ψ ( N h N + α + 2 ) ψ ( N h N + 2 α + 3 ) + N h N N h N + α + 1 N N + 2 α + 2 + ln N + 2 α + 2 N h N + α + 1
This suggests that for fixed N and α , I diff ( t N + 1 = Head ) and I diff ( t N + 1 = Tail ) are symmetric since h N ranges from 0 to N.

Appendix B. Derivation of Relative Information Gain

From Appendix A, we know that the posterior after N tosses is
Pr ( p | N , T N , I ) = Pr ( p | N , h N , I ) = p h N + α ( 1 p ) N h N + α B ( h N + α + 1 , N h N + α + 1 )
Therefore, the posterior after N + 1 tosses would be
Pr ( p | N + 1 , T N + 1 , I ) = Pr ( h N , T N + 1 | p , N + 1 , I ) Pr ( p | I ) 0 1 Pr ( h N , T N + 1 | p , N + 1 , I ) Pr ( p | I ) d p
Depending on different results for t N + 1 , the posterior after N + 1 tosses would be
Pr ( p | N + 1 , { T N , t N + 1 = Head } , I ) = p h N + α + 1 ( 1 p ) N h N + α B ( h N + α + 2 , N h N + α + 1 )
Pr ( p | N + 1 , { T N , t N + 1 = Tail } , I ) = p h N + α ( 1 p ) N h N + α + 1 B ( h N + α + 1 , N h N + α + 2 )
And the corresponding relative information gain would be
I rel ( t N + 1 = Head ) = D KL ( Pr ( p | N + 1 , { T N , t N + 1 = Head } , I ) | | Pr ( p | N , h N , I ) ) = 0 1 Pr ( p | N + 1 , { T N , t N + 1 = Head } , I ) ln Pr ( p | N + 1 , { T N , t N + 1 = Head } , I ) Pr ( p | N , h N , I ) d p = 0 1 p h N + α + 1 ( 1 p ) N h N + α B ( h N + α + 2 , N h N + α + 1 ) ln p B ( h N + α + 1 , N h N + α + 1 ) B ( h N + α + 2 , N h N + α + 1 ) d p = ψ ( h N + α + 2 ) ψ ( N + 2 α + 3 ) + ln N + 2 α + 2 h N + α + 1
I rel ( t N + 1 = Tail ) = ψ ( N h N + α + 2 ) ψ ( N h N + 2 α + 3 ) + ln N + 2 α + 2 N h N + α + 1

Appendix C. Equivalence of Expected Differential Information Gain and Expected Relative Information Gain

In a n-outcome model, the probability of each outcome is p i , and
p 1 + p 2 + + p n = 1
After N “tosses”, the data sequence has the form
D N = ( f 1 , f 2 , , f n ) , i = 1 n f i = N
where f i is the number of ith outcomes in these N tosses.
We may use a tuple p = ( p 1 , p 2 , , p n ) to represent the probabilities of these outcomes. The prior is just Pr ( p | I ) , and the posterior based on the data D N is Pr ( p | D N , I ) .
The average value of the ith outcome probability is
p i = p i Pr ( p | D N , I ) d p 1 d p 2 d p n
Assume the ( N + 1 ) th toss is the ith outcome, and the posterior of these after this additional toss is
Pr ( p | D N , d N + 1 = i , I ) = p i Pr ( p | D N , I ) p i Pr ( p | D N , I ) d p 1 d p 2 d p n = p i p i Pr ( p | D N , I )
Then we can write I diff as
I diff ( d N + 1 = i ) = D KL ( Pr ( p | D N , d N + 1 = i , I ) | Pr ( p | I ) ) D KL ( Pr ( p | D N , I ) | Pr ( p | I ) ) = p i p i Pr ( p | D N , I ) ln p i Pr ( p | D N , I ) p i Pr ( p | I ) d p 1 d p 2 d p n Pr ( p | D N , I ) ln Pr ( p | D N , I ) Pr ( p | I ) d p 1 d p 2 d p n
Then the expected differential information gain is given by
I diff ¯ = i = 1 n p i I diff ( d N + 1 = i ) = i = 1 n p i Pr ( p | D N , I ) ln p i Pr ( p | D N , I ) p i Pr ( p | I ) d p 1 d p 2 d p n i = 1 n p i Pr ( p | D N , I ) ln Pr ( p | D N , I ) Pr ( p | I ) d p 1 d p 2 d p n = i = 1 n p i Pr ( p | D N , I ) ln p i p i d p 1 d p 2 d p n + p i Pr ( p | D N , I ) ln Pr ( p | D N , I ) Pr ( p | I ) d p 1 d p 2 d p n Pr ( p | D N , I ) ln Pr ( p | D N , I ) Pr ( p | I ) d p 1 d p 2 d p n = i = 1 n p i Pr ( p | D N , I ) ln p i p i d p 1 d p 2 d p n + i = 1 n p i Pr ( p | D N , I ) ln Pr ( p | D N , I ) Pr ( p | I ) d p 1 d p 2 d p n Pr ( p | D N , I ) ln Pr ( p | D N , I ) Pr ( p | I ) d p 1 d p 2 d p n = i = 1 n p i Pr ( p | D N , I ) ln p i p i d p 1 d p 2 d p n
Similarly, I rel can be written as
I rel ( d N + 1 = i ) = D KL ( Pr ( p | D N , d N + 1 = i , I ) | Pr ( p | D N , I ) ) = p i p i Pr ( p | D N , I ) ln p i Pr ( p | D N , I ) p i Pr ( p | D N , I ) d p 1 d p 2 d p n = p i p i Pr ( p | D N , I ) ln p i p i d p 1 d p 2 d p n
Then the expected relative information gain is, accordingly,
I rel ¯ = i = 1 n p i I rel ( d N + 1 = i ) = i = 1 n p i Pr ( p | D N , I ) ln p i p i d p 1 d p 2 d p n
From (A24) and (A26), we can see that in this n-outcome model, the expected differential information gain I diff ¯ and expected relative information gain I rel ¯ are equal, irrespective of the choice of prior.

References

  1. Patra, M.K. Quantum state determination: Estimates for information gain and some exact calculations. J. Phys. A Math. Theor. 2007, 40, 10887–10902. [Google Scholar] [CrossRef]
  2. Madhok, V.; Riofrío, C.A.; Ghose, S.; Deutsch, I.H. Information Gain in Tomography–A Quantum Signature of Chaos. Phys. Rev. Lett. 2014, 112, 014102. [Google Scholar] [CrossRef]
  3. Quek, Y.; Fort, S.; Ng, H.K. Adaptive quantum state tomography with neural networks. npj Quantum Inf. 2021, 7, 105. [Google Scholar] [CrossRef]
  4. Gupta, R.; Xia, R.; Levine, R.D.; Kais, S. Maximal Entropy Approach for Quantum State Tomography. PRX Quantum 2021, 2, 010318. [Google Scholar] [CrossRef]
  5. McMichael, R.D.; Dushenko, S.; Blakley, S.M. Sequential Bayesian experiment design for adaptive Ramsey sequence measurements. J. Appl. Phys. 2021, 130, 144401. [Google Scholar] [CrossRef]
  6. Placek, B.; Angerhausen, D.; Knuth, K.H. Analyzing Exoplanet Phase Curve Information Content: Toward Optimized Observing Strategies. Astron. J. 2017, 154, 154. [Google Scholar] [CrossRef]
  7. Ma, C.W.; Ma, Y.G. Shannon information entropy in heavy-ion collisions. Prog. Part. Nuclear Phys. 2018, 99, 120–158. [Google Scholar] [CrossRef]
  8. Grinbaum, A. Elements of information-theoretic derivation of the formalism of quantum theory. Int. J. Quantum Inf. 2003, 1, 289–300. [Google Scholar] [CrossRef]
  9. Brukner, V.; Zeilinger, A. Information Invariance and Quantum Probabilities. Foundations Phys. 2009, 39, 677–689. [Google Scholar] [CrossRef]
  10. Goyal, P.; Knuth, K.H.; Skilling, J. Origin of Complex Quantum Amplitudes and Feynman’s Rules. Phys. Rev. A 2010, 81, 022109. [Google Scholar] [CrossRef]
  11. Caticha, A. Entropic dynamics, time and quantum theory. J. Phys. A Math. Theor. 2011, 44, 225303. [Google Scholar] [CrossRef]
  12. Masanes, L.; Müller, M.P.; Augusiak, R.; Pérez-García, D. Existence of an information unit as a postulate of quantum theory. Proc. Natl. Acad. Sci. USA 2013, 110, 16373–16377. [Google Scholar] [CrossRef]
  13. De Raedt, H.; Katsnelson, M.I.; Michielsen, K. Quantum theory as plausible reasoning applied to data obtained by robust experiments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150233. [Google Scholar] [CrossRef]
  14. Höhn, P.A. Quantum Theory from Rules on Information Acquisition. Entropy 2017, 19, 98. [Google Scholar] [CrossRef]
  15. Aravinda, S.; Srikanth, R.; Pathak, A. On the origin of nonclassicality in single systems. J. Phys. A Math. Theor. 2017, 50, 465303. [Google Scholar] [CrossRef]
  16. Czekaj, L.; Horodecki, M.; Horodecki, P.; Horodecki, R. Information content of systems as a physical principle. Phys. Rev. A 2017, 95, 022119. [Google Scholar] [CrossRef]
  17. Chiribella, G. Agents, Subsystems, and the Conservation of Information. Entropy 2018, 20, 358. [Google Scholar] [CrossRef]
  18. Summhammer, J. Maximum predictive power and the superposition principle. Int. J. Theor. Phys. 1994, 33, 171–178. [Google Scholar] [CrossRef]
  19. Summhammer, J. Maximum predictive power and the superposition principle. arXiv, 1999; arXiv:quant-ph/9910039. [Google Scholar]
  20. Wootters, W.K. Communicating through Probabilities: Does Quantum TheoryOptimize the Transfer of Information? Entropy 2013, 15, 3130–3147. [Google Scholar] [CrossRef]
  21. Cover, T.M.; Thomas, J.A. Differential Entropy. In Elements of Information Theory; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2005; chapter 8; pp. 243–259. [Google Scholar] [CrossRef]
  22. Jaynes, E.T. Information Theory and Statistical Mechanics. In Statistical Physics; Ford, K.W., Ed.; W. A. Benjamin, Inc.: Tokyo, Japan, 1963; pp. 181–218. [Google Scholar]
  23. Goyal, P. Prior Probabilities: An Information-Theoretic Approach. AIP Conf. Proc. 2005, 803, 366–373. [Google Scholar] [CrossRef]
  24. Berger, J.O.; Bernardo, J.M. Ordered Group Reference Priors with Application to the Multinomial Problem. Biometrika 1992, 79, 25–37. [Google Scholar] [CrossRef]
Figure 5. Fraction of Negatives (FoN) vs. α for different values of N. We identify a critical point, denoted as α p , where the FoN equals zero when α α p . The critical point exhibits a gradual variation with respect to N following these patterns: (i) for small N, α p is in close proximity to 0.68 ; (ii) for large N, α p tends to 0.5 .
Figure 5. Fraction of Negatives (FoN) vs. α for different values of N. We identify a critical point, denoted as α p , where the FoN equals zero when α α p . The critical point exhibits a gradual variation with respect to N following these patterns: (i) for small N, α p is in close proximity to 0.68 ; (ii) for large N, α p tends to 0.5 .
Information 15 00287 g005
Figure 6. Robustness of differential information gain ( I diff ). The y-axis represents the logarithm of the standard deviation of I diff over all possible h N values, while the x-axis depicts various selections of α . A smaller standard deviation indicates that different h N values lead to the same result, implying greater independence of I diff from h N . This independence signifies the robustness of I diff with respect to the natural variability in h N , as we consider h N to be solely determined by nature. The standard deviation, given a fixed N, is notably influenced by α , and there exists an α value at which the dependence on h N is minimized. This particular α value approaches 0.5 as N increases.
Figure 6. Robustness of differential information gain ( I diff ). The y-axis represents the logarithm of the standard deviation of I diff over all possible h N values, while the x-axis depicts various selections of α . A smaller standard deviation indicates that different h N values lead to the same result, implying greater independence of I diff from h N . This independence signifies the robustness of I diff with respect to the natural variability in h N , as we consider h N to be solely determined by nature. The standard deviation, given a fixed N, is notably influenced by α , and there exists an α value at which the dependence on h N is minimized. This particular α value approaches 0.5 as N increases.
Information 15 00287 g006
Figure 7. Relative information gain ( I rel ) over different priors. The y-axis represents the value of I rel , while the x-axis represents N. For each N, there are N + 1 different values of I rel . It is important to note that I rel is consistently positive across these selected priors. Similar to the differential information gain, each graph displays numerous divergent lines. However, the shape of these divergent lines remains remarkably consistent across varying values of α . The majority of these lines fall within the range of I rel between 0 and 0.2 .
Figure 7. Relative information gain ( I rel ) over different priors. The y-axis represents the value of I rel , while the x-axis represents N. For each N, there are N + 1 different values of I rel . It is important to note that I rel is consistently positive across these selected priors. Similar to the differential information gain, each graph displays numerous divergent lines. However, the shape of these divergent lines remains remarkably consistent across varying values of α . The majority of these lines fall within the range of I rel between 0 and 0.2 .
Information 15 00287 g007
Figure 8. Robustness of relative information gain ( I rel ). The y-axis represents the standard deviation of I rel across all possible values of h N . This demonstrates the substantial independence of I rel from h N . Additionally, as N increases, the standard deviations tend to approach zero for all priors.
Figure 8. Robustness of relative information gain ( I rel ). The y-axis represents the standard deviation of I rel across all possible values of h N . This demonstrates the substantial independence of I rel from h N . Additionally, as N increases, the standard deviations tend to approach zero for all priors.
Information 15 00287 g008
Figure 9. Expected information gain vs. N for fixed α. The y-axis represents the value of expected information, while the x-axis represents the value of N. Notably, all expected information gain values are positive. The shapes of each graph exhibit remarkable similarity, with a limited number of divergent lines. As α increases, the number of divergent lines decreases.
Figure 9. Expected information gain vs. N for fixed α. The y-axis represents the value of expected information, while the x-axis represents the value of N. Notably, all expected information gain values are positive. The shapes of each graph exhibit remarkable similarity, with a limited number of divergent lines. As α increases, the number of divergent lines decreases.
Information 15 00287 g009
Figure 10. Robustness of expected information gain. The y-axis represents the standard deviation of the expected information gain over all possible values of h N , while the x-axis represents the value of N. As N increases, and even for relatively small values of N, the standard deviation tends toward zero for all priors.
Figure 10. Robustness of expected information gain. The y-axis represents the standard deviation of the expected information gain over all possible values of h N , while the x-axis represents the value of N. As N increases, and even for relatively small values of N, the standard deviation tends toward zero for all priors.
Information 15 00287 g010
Table 1. Fraction of Negatives (FoN) under selected priors. A comparison between numerical results and asymptotic results demonstrates their agreement.
Table 1. Fraction of Negatives (FoN) under selected priors. A comparison between numerical results and asymptotic results demonstrates their agreement.
α FoN (Numerical Result,  N = 1000 )FoN (Asymptotic Result)Discrepancy between the Two Results
−0.7000
−0.60.0010 0.1 %
−0.50.0130 1.3 %
−0.40.1440.143 0.1 %
00.3340.333 0.1 %
10.4290.4290
30.4670.4670
Table 2. Comparison of characteristics of two measures of information gain.
Table 2. Comparison of characteristics of two measures of information gain.
Information Gain MeasureAsymptotic Forms ( t N + 1 = Head )Asymptotic Sensitivity to Prior
Differential Information Gain I diff 2 h N + 1 2 ( h N + α + 1 ) 2 N + 1 2 ( N + 2 α + 2 ) Heavily dependent upon prior. Independent of  h N for certain priors ( α = 1 / 2 ).
Relative Information Gain I rel 1 2 ( h N + α + 1 ) 1 2 ( N + 2 α + 2 ) Insensitive to prior. For large N, only affected by h N .
Table 3. Comparison of three information gain measures.
Table 3. Comparison of three information gain measures.
Type of Information GainPositivityRobustness about  T N
DifferentialStrictly positive when  α < α p where  α p 0.68 . Asymptotically positive when  α 0.5 .Robustness exists only when  α = 0.5 of beta distribution prior.
RelativeStrictly positive for all priors.No significant differences of robustness among beta distribution priors.
ExpectedStrictly positive for all priors.No significant differences of robustness among beta distribution priors.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, Y.; Goyal, P. Principle of Information Increase: An Operational Perspective on Information Gain in the Foundations of Quantum Theory. Information 2024, 15, 287. https://doi.org/10.3390/info15050287

AMA Style

Yu Y, Goyal P. Principle of Information Increase: An Operational Perspective on Information Gain in the Foundations of Quantum Theory. Information. 2024; 15(5):287. https://doi.org/10.3390/info15050287

Chicago/Turabian Style

Yu, Yang, and Philip Goyal. 2024. "Principle of Information Increase: An Operational Perspective on Information Gain in the Foundations of Quantum Theory" Information 15, no. 5: 287. https://doi.org/10.3390/info15050287

APA Style

Yu, Y., & Goyal, P. (2024). Principle of Information Increase: An Operational Perspective on Information Gain in the Foundations of Quantum Theory. Information, 15(5), 287. https://doi.org/10.3390/info15050287

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop