1. Introduction
Patterns of nature often follow probability distributions. Physical processes lead to an exponential distribution of energy levels among a collection of particles. Random fluctuations about mean values generate a Gaussian distribution. In biology, the age of cancer onset tends toward a gamma distribution. Economic patterns of income typically match variants of the Pareto distributions with power law tails.
Theories in those different disciplines attempt to fit observed patterns to an underlying generative process. If a generative model predicts the observed pattern, then the fit promotes the plausibility of the model. For example, the gamma distribution for the ages of cancer onset arises from a multistage process [
1]. If cancer requires
k different rate-limiting events to occur, then, by classical probability theory, the simplest model for the waiting time for the
kth event to occur is a gamma distribution.
Many other aspects of cancer biology tell us that the process indeed depends on multiple events. However, how much do we really learn by this inverse problem, in which we start with an observed distribution of outcomes and then try to infer underlying process? How much does an observed distribution by itself constrain the range of underlying generative processes that could have led to that observed pattern?
The main difficulty of the inverse problem has to do with the key properties of commonly observed patterns. The common patterns are almost always those that arise by a wide array of different underlying processes [
2,
3]. We may say that a common pattern has a wide basin of attraction, in the sense that many different initial starting conditions and processes lead to that same common outcome. For example, the central limit theorem is, in essence, the statement that adding up all sorts of different independent processes often leads to a Gaussian distribution of fluctuations about the mean value.
In general, the commonly observed patterns are common because they are consistent with so many different underlying processes and initial conditions. The common patterns are therefore particularly difficult with regard to the inverse problem of going from observed distributions to inferences about underlying generative processes. However, an observed pattern does provide some information about the underlying generative process, because only certain generative processes lead to the observed outcome. How can we learn to read a mathematical expression of a probability pattern as a statement about the family of underlying processes that may generate it?
2. Overview
In this article, I will explain how to read continuous probability distributions as simple statements about underlying process. I presented the technical background in an earlier article [
4], with addition details in other publications [
3,
5,
6]. Here, I focus on developing the intuition that allows one to read probability distributions as simple sentences. I also emphasize key unsolved puzzles in the understanding of commonly observed probability patterns.
Section 3 introduces the four components of probability patterns: the dissipation of all information, except the preservation of average values, taken over the measurement scale that relates changes in observed values to changes in information, and the underlying scale on which information dissipates relative to alternative scales on which probability pattern may be expressed.
Section 4 develops an information theory perspective. A distribution can be read as a simple statement about the scaling of information with respect to the magnitude of the observations. Because measurement has a natural interpretation in terms of information, we can understand probability distributions as pure expressions of measurement scales.
Section 5 illustrates the scaling of information by the commonly observed log-linear pattern. Information in observations may change logarithmically at small magnitudes and linearly at large magnitudes. The classic gamma distribution is the pure expression of the log-linear scaling of information.
Section 6 presents the inverse linear-log scale. The Lomax and generalized Student’s distributions follow that scale. Those distributions include the classic exponential and Gaussian forms in their small-magnitude linear domain, but add power law tails in their large-magnitude logarithmic domain.
Section 7 shows that the commonly observed log-linear and linear-log scales form a dual pair through the Laplace transform. That transform changes addition of random variables into multiplication, and multiplication into addition. Those arithmetic changes explain the transformation between multiplicative log scaling and additive linear scaling. In general, integral transforms describe dualities between pairs of measurement scales, clarifying the relations between commonly observed probability patterns.
Section 8 considers cases in which information dissipates on one scale, but we observe probability pattern on a different scale. The log-normal distribution is a simple example, in which observations arise as products of perturbations. In that case, information dissipates on the additive log scale, leading to a Gaussian pattern on that log scale.
Section 8 continues with the more interesting case of extreme values, in which one analyzes the largest or smallest value of a sample. For extreme values, dissipation of information happens on the scale of cumulative probabilities, but we express probability pattern on the typical scale for the relative probability at each magnitude. Once one recognizes the change in scale for extreme value distributions, those distributions can easily be read in terms of my four basic components.
Section 9 returns to dual scales connected by integral transforms. In superstatistics, one evaluates a parameter of a distribution as a random variable rather than a fixed value. Averaging over the distribution of the parameter creates a special kind of integral transform that changes the measurement scale of a distribution, altering that original distribution into another form with a different scaling relation.
Section 10 considers alternative perspectives on generative process. We may observe pattern on one scale, but the processes that generated that pattern may have arisen on a dual scale. For example, we may observe the classic gamma probability pattern of log-linear scaling, in which we measure the time per event. However, the underlying generative process may have a more natural interpretation on the inverse linear-log scaling of the Lomax distribution. That inverse scale has dimensions of events per unit time, or frequency.
Section 11 reiterates how to read probability distributions. I then introduce the Lévy stable distributions, in which dual scales relate to each other by the Fourier integral transform. The Lévy case connects log scaling in the tails of distributions to constraints in the dual domain on the average of power law expressions. The average of power law expressions describes fractional moments, which associate with the common stretched exponential probability pattern.
Section 12 explains the relations between different probability patterns. Because a probability pattern is a pure expression of a measurement scale, the genesis of probability patterns and the relations between them reduce to understanding the origins of measurement scales. The key is that the dissipation of information and maximization of entropy set a particular invariance structure on measurement scales. That invariance strongly influences the commonly observed scales and thus the commonly observed patterns of nature.
Section 12 continues by showing that particular aspects of invariance lead to particular patterns. For example, shift invariance with respect the information in underlying values and transformed measured values leads to exponential scaling of information. By contrast, affine invariance leads to linear scaling. The distinctions between broad families of probability distributions turn on this difference between shift and affine invariance for the information in observations.
Section 13 presents a broad classification of measurement scales and associated probability patterns. Essentially all commonly observed distributions arise within a simple hierarchically generated sequence of measurement scales. That hierarchy shows one way to consider the genesis of the common distributions and the relations between them. I present a table that illustrates how the commonly observed distributions fit within this scheme.
Section 14 considers the most interesting unsolved puzzle: Why do linear and logarithmic scaling dominate the base scales of the commonly observed patterns? One possibility is that linear and log scaling express absolute and relative incremental information, the two most common ways in which information may scale. Linear and log scaling also have a natural association with addition and multiplication, suggesting a connection between common arithmetic operations and common scaling relations.
Section 15 suggests one potential solution to the puzzle of why commonly observed measurement scales are simple. Underlying values may often be transformed by multiple processes before measurement. Each transformation may be complex, but the aggregate transformation may smooth into a simple relation between initial inputs and final measured outputs. The scaling that defines the associated probability pattern must provide invariant information with respect to underlying values or final measured outputs. If the ultimate transformation of underlying values to final measured outputs is simple, then the required invariance may often define a simple information scaling and associated probability pattern.
The Discussion summarizes key points and emphasizes the major unsolved problems.
3. The Four Components of Probability Patterns
To parse probability patterns, one must distinguish four properties. In this section, I begin by briefly describing each property. I then match the properties to the mathematical forms of different probability patterns, allowing one to read probability distributions in terms of the four basic components. Later sections develop the concepts and applications.
First, dissipation of information occurs because most observable phenomena arise by aggregation over many smaller scale processes. The multiple random, small scale fluctuations often erase the information in any particular lower level process, causing the aggregate observable probability pattern to be maximally random subject to constraints that preserve information [
2,
7,
8].
Second, average values tend to be the only preserved information after aggregation has dissipated all else. Jaynes [
2,
7,
8] developed dissipation of information and constraint by average values as the key principles of maximum entropy, a widely used approach to understanding probability patterns. I extended Jaynesian maximum entropy by the following components [
4–
6].
Third, average values may arise on different measurement scales. For example, in large scale fluctuations, one might only be able to obtain information about the logarithm of the underlying values. The constrained average would be the mean of the logarithmic values, or the geometric mean. The information in measurements may change with magnitude. In some cases, the scale may be linear for small fluctuations but logarithmic for large fluctuations, leading to an observed linear-log scale of observations.
Fourth, the measurement scale on which information dissipates may differ from the scale on which one observes pattern. For example, a multiplicative process causes information to dissipate on the additive logarithmic scale, but we may choose to analyze the observed multiplicative pattern. Alternatively, information may dissipate by the multiplication of the cumulative probabilities that individual fluctuations fall below some threshold, but we may choose to analyze the extreme values of aggregates on a transformed linear scale.
The measurement scaling defines the various commonly observed probability distributions. By learning to parse the scaling relations of measurement implicit in the mathematical expressions of probability patterns, one can read those expression as simple statements about underlying process. The previously hidden familial relations between different kinds of probability distributions become apparent through their related forms of measurement scaling.
3.2. Constraint by Average Values
Suppose that we are studying the distribution of energy levels in a population of particles. We want to know the probability that any particle has a certain level of energy The probability distribution over the population describes the probability of different levels of energy per particle.
Typically, there is a certain total amount of energy to be distributed among the particles in the population. The fixed total amount of energy constrains the average energy per particle.
To find the distribution of energy, we could reasonably assume that many different processes operate at a small scale, influencing each particle in multiple ways. Each small scale process often has a random component. In the aggregate of the entire population, those many small scale random fluctuations tend to increase the total entropy in the population, subject to the constraint that the mean is set extrinsically.
For any pattern influenced by small-scale random fluctuations, the only constraint on randomness may be a given value for the mean. If so, then pattern follows maximum entropy subject to a constraint on the mean [
7,
8].
3.2.1. Constraint on the Mean
When we maximize the entropy in
Equation (1) to find the probability distribution consistent with the inevitable dissipation of information and increase in entropy, we must also account for the constraint on the average value of observable events. The technical approach to maximizing a quantity, such as entropy, subject to a constraint is the method of Lagrange multipliers. In particular, we must maximize the quantity
in which the constraint on the average value is written as
. The integral term of the constraint is the average value of
y over the distribution
py, and the term,
μ, is the actual average value set by constraint. The method guarantees that we find a distribution,
py, that satisfies the constraint, in particular that the average of the distribution that we find is indeed equal to the given constraint on the average,
. We must also set the total probability to be one, expressed by the constraint
.
We find the maximum of
Equation (2) by solving
∂ε /∂py = 0 for the constants
κ and λ that satisfy the constraint on total probability and the constraint on average value, yielding
in which λ =
1/μ, and ∝ means “is proportional to.” The total probability over a distribution must be one. If we use that constraint on total probability, we can find
κ such that ψ
e−λy would be an equality rather than a proportionality for
py for some constant, ψ. That is easy to do, but adds additional steps and a lot of notational complexity without adding any further insight. I therefore present distributions without the adjusting constants, and write the distributions as “
py ∝” to express the absence of the constants and the proportionality of the expression.
The expression in
Equation (3) is known as the exponential distribution, or sometimes the Gibbs or Boltzmann distribution. We can read the distribution as a simple statement. The exponential distribution is the probability pattern for a positive variable that is most random, or has least information, subject to a constraint on the mean. Put another way, the distribution contains information only about the mean, and nothing else.
3.2.2. Constraint on the Average Fluctuations from the Mean
Sometimes we are interested in fluctuations about a mean value or central location. For example, what is the distribution of errors in measurements? How do average values in samples vary around the true mean value? In these cases, we may describe the intrinsic variability by the variance. If we constrain the variance, we are constraining the average squared distance of fluctuations about the mean.
We can find the distribution that is most random subject to a constraint on the variance by using the variance as the constraint in
Equation (2). In particular, let
, in which
σ2 is the variance and
μ is the mean. This expression constrains the squared distance of fluctuations, (
y − μ)
2, averaged over the probability distribution of fluctuations,
py, to be the given constraint,
σ2.
Without loss of generality, we can set
μ = 0 and interpret
y as a deviation from the mean, which simplifies the constraint to be
. We can then write the constraint on the mean or the constraint on the variance as a single general expression
in which
fy is
y or
y2 for constraints on the mean or variance, respectively, and
is the extrinsicially set constraint on the mean or variance, respectively. Then the maximization of entropy subject to constraint takes the general form
If we constrain the mean, then
fy = y and λ = 1/
μ, yielding the exponential form in
Equation (3). If we constrain the variance, then
fy = y2, and λ = 1/2σ
2, which is the Gaussian distribution.
3.3. The Measurement Scale for Average Values
The constraint on randomness may be transformed by the measurement scale [
4,
6]. We may write the transformation of the observable values,
fy, as T(
fy) =
Tf. Here,
fy is
y or
y2 depending on whether we are interested in the average value or in the average distance from a central location, and T is the measurement scale. Thus, the constraint in
Equation (4) can be written as
which generalizes the solution in
Equation (5) to
This form provides a simple way to express many different probability distributions, by simply choosing T
f to be a constraint that matches the form of a distribution. For example, the power law distribution,
py ∝
y−λ, corresponds to the measurement scale T
f = log(
y). In general, finding the measurement scale and the associated constraint that lead to a particular form for a distribution is useful, because the constraint concisely expresses the information in a probability pattern [
4,
6].
Simply matching probability patterns to their associated measurement scales and constraints leaves open the problem of why particular scalings and constraints arise. What sort of underlying generative processes lead to a particular scaling relation, Tf, and therefore attract to the same probability pattern? I address that crucial question in later sections. For now, it is sufficient to note that we have a simple way to connect the dissipation of information and constraint to probability patterns.
5. The Log-linear Scale
Cancer incidence illustrates how probability patterns may express simple scaling relations [
1]. For many cancers, the probability
py that an individual develops disease near the age
y, among all those born at age zero, is approximately
which is the gamma probability pattern. A simple generative model that leads to a gamma pattern is the waiting time for the
kth event to occur. For example, if cancer developed only after
k independent rate-limiting barriers or stages have been passed, then the process of cancer progression would lead to a gamma probability pattern.
That match between a generative multistage model of process and the observed gamma pattern led many people to conclude that cancer develops by a multistage process of progression. By fitting the particular incidence data to a gamma pattern and estimating the parameter
k, one could potentially estimate the number of rate-limiting stages required for cancer to develop. Although this simple model does not capture the full complexity of cancer, it does provide the basis for many attempts to connect observed patterns for the age of onset to the underlying generative processes that cause cancer [
1].
Let us now read the gamma pattern as an expression about the scaling of probability in relation to magnitude. We can then compare the general scaling relation that defines the gamma pattern to the different kinds of processes that may generate a pattern matched to the gamma distribution.
The probability expression in
Equation (14) can be divided into two terms. The first term is
which matches our general expression for probability patterns in
Equation (7) with T
f = log(
y). This equivalence associates the power law component of the gamma distribution with a logarithmic measurement scale.
For the second term,
e−αy, in
Equation (14), we have T
f =
y, which expresses linear scaling in
y. Thus, the two terms in
Equation (14) correspond to logarithmic and linear scaling
which leads to an overall measurement function that has the general log-linear form T
f = log(
y) −
by. For the parameters in this example,
b = α/(
k − 1).
When y is small, Tf ≈ log(y), and the logarithmic term dominates changes in the information of the probability pattern, dSy, and the measurement scale, dTf. By contrast, when y is large, Tf ≈ − by, and the linear term dominates. Thus, the gamma probability pattern is simply the expression of logarithmic scaling at small magnitudes and linear scaling at large magnitudes. The value of b determines the magnitudes at which the different scales dominate.
Generative processes that create log-linear scaling typically correspond to a gamma probability pattern. Consider the classic generative process for the gamma, the waiting time for the kth independent event to occur. When the process begins, none of the events has occurred. For all k events to occur in the next time interval, all must happen essentially simultaneously.
The probability of multiple independent events to occur essentially simultaneously is the product of the probabilities for each event to occur. Multiplication leads to power law expressions and logarithmic scaling. Thus, at small magnitudes, the change in information scales with the change in the logarithm of time.
By contrast, at large magnitudes, after much time has passed, either the kth event has already happened, and the waiting is already over, or k − 1 events have happened, and we are waiting only for the last event. Because we are waiting for a single event that occurs with equal probability in any time interval, the scaling of information with magnitude is linear. Thus, the classic waiting time problem is a generative model that has log-linear scaling.
The gamma pattern itself is a pure expression of log-linear scaling. That probability pattern matches any underlying generative process that converges to logarithmic scaling at small magnitudes and linear scaling at large magnitudes. Many processes may be essentially multiplicative at small scales and approximately linear at large scales. All such generative processes will also converge to the gamma probability distribution. In the general case, k is a continuous parameter that influences the magnitudes at which logarithmic or linear scaling dominate.
Later, I will return to this important link between generative process and measurement scale. For now, let us continue to follow the consequences of various scaling relations.
The log-linear scale contains the purely linear and the purely logarithmic as special cases. In
Equation (14), as
k → 1, the probability pattern becomes the exponential distribution, the pure expression of linear scaling. Alternatively, as
α → 0, the probability pattern approaches the power law form, the pure expression of logarithmic scaling.
6. The Linear-log Scale
Another commonly observed pattern follows a Lomax or Pareto Type II form
which is associated with the measurement scale T
f = log(1 +
y/α). This distribution describes linear-log scaling. For small values of
y relative to
α, we have T
f → y/α, and the distribution becomes
which is the pure expression of linear scaling. For large values of
y relative to
α, we have T
f → log(
y/α), and the distribution becomes
which is the pure expression of logarithmic scaling.
In these examples, I have used
fy = y in the scaling relation T
f = log(1 +
fy/α). We can add to the forms of the linear-log scale by using
fy = (
y − μ)
2, describing squared deviations from the mean. To simplify the notation, let
μ = 0. Then
Equation (17) becomes
which is called the generalized Student’s or q-Gaussian distribution [
12]. When the deviations from the mean are relatively small compared with
α, linear scaling dominates, and the distribution is Gaussian,
. When deviations from the mean are relatively large compared with
α, logarithmic scaling dominates, causing power law tails,
py ∝
y−2k.
7. Relation between Linear-log and Log-linear Scales
The specific way in which these two scales relate to each other provides much insight into pattern and process.
7.1. Common Scales and Common Patterns
The log-linear and linear-log scales include most of the commonly observed probability patterns. The purely linear exponential and Gaussian distributions arise as special cases. Pure linearity is perhaps rare, because very large or very small values often scale logarithmically. For example, we measure distances in our immediate surroundings on a linear scale, but typically measure very large cosmological distances on a logarithmic scale, leading to a linear-log scaling of distance.
On the linear-log scale, positive variables often follow the Lomax distribution
Equation (17). The Lomax expresses an exponential distribution with a power law tail. Over a sufficiently wide range of magnitudes, many seemingly exponential distributions may in fact grade into a power law tail, because of the natural tendency for the information at extreme magnitudes to scale logarithmically. Alternatively, many distributions that appear to be power laws may in fact grade into an exponential shape at small magnitudes.
When studying deviations from the mean, the linear-log scale leads to the generalized Student’s form. That distribution has a primarily Gaussian shape but with power law tails. The tendency for the tails to grade into a power law may again be the rule when studying pattern over a sufficiently wide range of magnitudes [
12].
In some cases, the logarithmic scaling regime occurs at small magnitudes rather than large magnitudes. Those cases of log-linear scaling typically lead to a gamma probability pattern. Many natural observations approximately follow the gamma pattern, which includes the chi-square pattern as a special case.
7.2. Relations between the Scales
The linear-log and log-linear scales seem to be natural inverses of each other. However, what does an inverse scaling mean? We obtain some clues by noting that the mathematical relation between the scales arises from
The right side is the Laplace transform of the log-linear gamma pattern in the variable
x, here interpreted for real-valued
fy. That transform inverts the scale to the linear-log form, which is the Lomax distribution for
fy = y or the generalized Student’s distribution for
fy = y2.
This relation between scales is easily understood with regard to mathematical operations [
4,
6]. The Laplace transform changes the addition of random variables into the multiplication of those variables, and it changes the multiplication of random variables into the addition of those variables [
13]. Logarithmic scaling can be thought of as the expression of multiplicative processes, and linear scaling can be thought of as the expression of additive processes.
The Laplace transform, by changing multiplication into addition, transforms log scaling into linear scaling, and by changing addition into multiplication, transforms linear scaling into log scaling. Thus, log-linear scaling changes to linear-log scaling. The inverse Laplace transform works in the opposite direction, changing linear-log scaling into log-linear scaling.
The fact that the Laplace transform connects two of the most important scaling relations is interesting. However, what does it mean in terms of reading and understanding common probability patterns? The following sections suggest one possibility.
10. Alternative Descriptions of Generative Process
We often wish to associate an observed probability pattern with the underlying generative process. The generative process may dissipate information directly on the measurement scale associated with the observed probability pattern. Or, the generative process may dissipate information on a different scale, but we observe the pattern on a transformed scale.
Consider, as an example, the Laplace duality between the linear-log and log-linear scales in
Equation (26). Suppose that we observe the gamma pattern of log-linear scaling. We wish to associate that observed gamma pattern to the underlying generative process.
The generative process may directly create a log-linear scaling pattern. The classic example concerns waiting time for the kth independent event. For small times, the k events must happen nearly simultaneously. As noted earlier, the probability of multiple independent events to occur essentially simultaneously is the product of the probabilities for each event to occur. Multiplication leads to power law expressions and logarithmic scaling. Thus, at small magnitudes, the change in information scales with the change in the logarithm of time.
By contrast, at large magnitudes, after much time has passed, either the kth event has already happened, and the waiting is already over, or k − 1 events have happened, and we are waiting only for the last event. Because we are waiting for a single event that occurs with equal probability in any time interval, the scaling of information with magnitude is linear. Thus, the classic waiting time problem expresses a generative model that has log-linear scaling.
Any process that scales log-linearly tends to the gamma pattern by the dissipation of all other information. The only requirement is that, in the aggregate, small magnitude events associate with underlying multiplicative combinations of probabilities, and large event magnitudes associate with additive combinations.
In this case, we move from underlying process to observed pattern: a process tends to scale log-linearly, and dissipation of information on that scale shapes pattern into the gamma distribution form. However, often we are concerned with the inverse problem. We observe the log-linear gamma pattern, and we want to know what process caused that pattern.
The duality of the log-linear and linear-log scales in
Equation (26) means that a generative process could occur on the linear-log scale, but we may observe the resulting pattern on the log-linear scale. For example, the number of events per unit time (frequency) may combine in a linear, additive way at small frequencies and in a multiplicative, logarithmic way at large frequencies. That linear-log process would often converge to a Lomax distribution of frequency pattern, or to a Student’s distribution if we measure squared deviations,
fy =
y2. If we observe the outcome of that process in terms of the inverted units of time per event, those inverted dimensions lead to log-linear scaling and a gamma pattern, or to a gamma pattern with a Gaussian tail if we measure squared deviations.
Is it meaningful to say that the generative process and dissipation of information arise on a linear-log scale of events per unit time, but we observe the pattern on the log-linear scale of time per event? That remains an open question.
On the one hand, the scaling relations and dissipation of information contain exactly the same information whether on the linear-log or log-linear scales. That equivalence suggests a single underlying generative process that may be thought of in alternative ways. In this case, we may consider constraints on average frequency or, equivalently, constraints on average time. More generally, constraints on either of a dual pair of scales with inverted dimensions would be equivalent.
On the other hand, the meaning of constraint by average value may make sense only on one of the scales. For example, it may be meaningful to consider only the average waiting time for an event to occur. That distinction suggests that we consider the underlying generative process strictly in terms of the log-linear scale. However, if our observations of pattern are confined to the inverse frequency scale, then the observed linear-log scaling would only be a reflection of the true underlying process on the dual log-linear scale.
All paired scales through integral transformation pose the same issues of duality and interpretation with regard to the connection between generative process and observed pattern.
11. Reading Probability Distributions
In this section, I recap the four components of probability patterns. A clear sense of those four components allows one to read the mathematical expressions of probability distributions as sentences about underlying process.
The four components are: the dissipation of all information; except the preservation of average values; taken over the measurement scale that relates changes in observed values to changes in information; and the transformation from the underlying scale on which information dissipates to alternative scales on which probability pattern may be expressed.
Common probability patterns arise from those four components, described in
Equation (8) by
I show how to read probability distributions in terms of the four components and this general expression. To illustrate the approach, I parse several commonly observed probability patterns. This section mostly repeats earlier results, but does so in an alternative way to emphasize the simplicity of form in common probability expressions.
11.1. Linear Scale
The exponential and Gaussian are perhaps the most common of all distributions. They have the form
The exponential case,
fy = y, corresponds to the preservation of the average value,
y. The Gaussian case,
fy = (
y − μ)
2, preserves the average squared distance from the mean, which is the variance. For convenience, I often set
μ = 0 and write
fy = y2 for the squared distance. The exponential and Gaussian express the dissipation of information and preservation of average values on a linear scale. We use either the average value itself or the average squared distance from the mean.
11.2. Combinations of Linear and Log Scales
Purely linear scaling is likely to be rare over a sufficiently wide range of magnitudes. For example, one naturally plots geographic distances on a linear scale, but very large cosmological distances on a logarithmic scale.
On a geographic scale, an increment of an additional meter in distance can be measured directly anywhere on earth. The equivalent measurement information obtained at any geographic distance leads to a linear scale.
By contrast, the information that we can obtain about meter-scale increments tends to decrease with cosmological distance. The declining measurement information obtained at increasing cosmological distance leads to a logarithmic scale.
The measurement scaling of distances and other quantities may often grade from linear at small magnitudes to logarithmic at large magnitudes. The linear-log scale is given by T
f = log (1 +
fy|α). Using that measurement scale in
Equation (28), with
my = 1 and λ =
k, we obtain
When
fy is small relative to
α, we get the standard exponential form of linear scaling in
Equation (29), which corresponds to the exponential or Gaussian pattern. The tail of the distribution, with
fy greater than
α, is a power law in proportion to
f−k. An exponential pattern with a power law tail is the Lomax or Pareto type II distribution. A Gaussian with a power law tail is the generalized Student’s distribution.
If one measures observations over a sufficiently wide range of magnitudes, many apparently exponential or Gaussian distributions will likely turn out to have the power law tails of the Lomax or generalized Student’s forms. Similarly, observed power law patterns may often turn out to be exponential or Gaussian at small magnitudes, also leading to the Lomax or generalized Student’s forms.
Other processes lead to the inverse log-linear scale, which changes logarithmically at small magnitudes and linearly at large magnitudes. The log-linear scale is given by T
f = log(
fy) −
bfy, in which
b determines the transition between log scaling at small magnitudes and linear scaling at large magnitudes. Using that measurement scale in
Equation (28) with
my = 1 and
fy = y, and adjusting the parameters to match earlier notation, we obtain the gamma distribution
which is a power law with logarithmic scaling for small magnitudes and an exponential with linear scaling for large magnitudes. The gamma distribution includes as a special case the widely used chi-square distribution. Thus, the chi-square pattern is a particular instance of log-linear scaling.
If we use the log-linear scale for squared deviations from zero,
fy = y2, then we obtain
which is a gamma pattern with a Gaussian tail, expressing log-linear scaling with respect to squared deviations. For
k = 2, this is the well-known Rayleigh distribution.
In some cases, information scales logarithmically at both small and large magnitudes, with linearity dominating at intermediate magnitudes [
20]. In a log-linear-log scale, precision at the extremes may depend more strongly on magnitude, or there may be a saturating tendency of process at extremes that causes relative scaling of information with magnitude. Relative scaling corresponds to logarithmic measures.
Commonly observed log-linear-log patterns often lead to the beta family of distributions [
4]. For example, we can modify the basic linear-log scale, T
f = log (1 +
y/α), by adding a logarithmic component at small magnitudes, yielding the scale T
f =
blog(
y) − log (1 +
y/α), for
b = γ/
k, which leads to a variant of the beta-prime distribution
This distribution can be read as a linear-log Lomax distribution, (1 +
y/α)
−k, with an additional log scale power law component,
yγ, that dominates at small magnitudes. Other forms of log-linear-log scaling often lead to variants from the beta family.
11.3. Direct Change of Scale
In many cases, process dissipates information and preserves average values on one scale, but we observe or analyze data on a different scale. When the scale change arises by simple substitution of one variable for another, the form of the probability distribution is easy to read if one directly recognizes the scale of change. Here, I repeat my earlier discussion for the way in which one reads the commonly observed log-normal distribution. Other direct scale changes follow this same approach.
If process causes information to dissipate on a scale
x, preserving only the average squared distance from the mean (the variance), then
x tends to follow the Gaussian pattern
in which the mean of
x is
μ, and the variance is 1/2λ. If the scale,
x, on which information dissipates is logarithmic, but we observe or analyze data on a linear scale,
y, then
x = log(
y). The value of
my in
Equation (8) is the change in
x with respect to
y, yielding d log(
y)/d
y =
y−1. Thus, the distribution on the
y scale is
which is simply the Gaussian pattern for log(
y), corrected by
my = y−1 to account for the fact that dissipation of information and constraint of average value are happening on the logarithmic scale, log(
y), but we are analyzing pattern on the linear scale of
y. Other direct changes of scale can be read in this way.
11.4. Extreme Values and Exponential Scaling
Extreme values arise from the probability of observing a magnitude beyond some threshold. Probabilities beyond a threshold depend on the cumulative probability of all values beyond the cutoff. For an initially linear scale with
fx =
x, cumulative tail probabilities typically follow the generic form
e−λx or, simplifying by using λ = 1, the exponential form
e−x. The cumulative tail probabilities above a threshold,
y, define the scaling relation between
x and
y, as
Thus, extreme values that depend on tail probabilities tend to define an exponential scaling,
x = e−y = T
f. Because we have changed the scale from the cumulative probabilities,
x, to the probability of some threshold,
y, that determines the extreme value observed, we must account for that change of scale by
my = |T′
f| =
e−y, where the prime is the derivative with respect to
y. Using
Equation (8) for the generic method of direct change in scale, and using the form of
my here for the change from the cumulative scale of tail probabilities to the direct scaling of threshold values, we obtain the general form of the extreme value distributions as
In this simple case, T
f =
e−y, thus
a form of the Gumbel extreme value distribution. Note that this form is just a direct change from linear to exponential scaling,
x = e−y.
Alternatively, we can obtain the same Gumbel form by any process that leads to exponential-linear scaling of the form λT(
y) =
y +
λe−y, in which the exponential term dominates for small values and the linear term dominates for large values. That scaling leads directly to the distribution
The probability of a small value being the largest extreme value decreases exponentially in
y, leading to the double exponential term
dominating the probability. By contrast, the probability of observing large extreme values decreases linearly in
y, leading to the exponential term
e−y dominating the probability.
11.6. Lévy Stable Distributions
Another important family of common distributions arises by a similar scaling duality
Consider each part in relation to the Laplace pair in
Equation (21). The left side is the Cauchy distribution, a special case of the linear-log generalized Student’s distribution with
k = 1 and
α = φ2. On the right,
e−φ|x| is a symmetric exponential distribution, because
e−φx is the classic exponential distribution for
x > 0, and
eφx for
x < 0 is the same distribution reflected about the
x = 0 axis. The two distributions together form a new distribution over all positive and negative values of
x.
Each positive and negative part of the symmetric exponential, by itself, expresses linearity in x. However, the sharp switch in direction and the break in smoothness at x = 0 induces a quasi-logarithmic scaling at small magnitudes, which corresponds to the linearity at small magnitudes in the transformed domain of the Cauchy distribution.
In this case, the integral transform is Fourier rather than Laplace, using the transformation kernel e−xiy over all positive and negative values of x. For our purposes, we can consider the consequences of the Laplace and Fourier transforms as similar with regard to inverting the dimensions and scaling relations between a pair of measurement scales.
The Cauchy distribution is a particularly important probability pattern. In one simple generative model, the Cauchy arises by the same sort of summing up of random perturbations and dissipation of information that leads to the Gaussian distribution by the central limit theorem. The Cauchy differs from the Gaussian because the underlying random perturbations follow logarithmic scaling at large magnitudes.
Log scaling at large magnitudes causes power law tails, in which the distributions of the underlying random perturbations tend to have the form 1/|x|1+γ at large magnitudes of x. When the tail of a distribution has that form, then the total probability in the tail above magnitudes of |x| is approximately 1/|x|γ. The Cauchy is the particular distribution with γ = 1. Thus, one way to generative a Cauchy is to sum up random perturbations and constrain the average total probability in the tail to be 1/|x|.
Note that the constraint on the average tail probability of 1/
|x| for the Cauchy distribution on the left side of
Equation (30) corresponds, in the dual domain on the right side of that equation, to
e−φ|x|, in which the measurement scale is T
f =
|x|. The average of the scaling T
f corresponds to the preserved average constraint after the dissipation of information. In this case, the dual domain preserves only the average of |
x|. Thus the dual scaling domains preserve the average of
|x| in the symmetric exponential domain and the average total tail probability of 1/
|x| in the dual Cauchy domain.
We can express a more general duality that includes the Cauchy as a special case by
The only difference from
Equation (30) is that in the symmetric exponential, I have written |
x|
γ. The parameter γ creates a power law scaling T
f = |
x|
γ, which corresponds to a distribution that is sometimes called a stretched exponential.
The distribution in the dual domain, py, is a form of the Lévy stable distribution. That distribution does not have a mathematical expression that can be written explicitly. The Lévy stable distribution, py, can be generated by dissipating all information by summation of random perturbations while constraining the average of the total tail probability to be 1/|x|γ for γ < 2. For γ = 1, we obtain the Cauchy distribution. When γ = 2, the distributions in both domains become Gaussian, which is the only case that domains paired by Laplace or Fourier transform inversion have the same distribution.
Note that the paired scales in
Equation (31) match a constraint on the average of |
x|
γ with an inverse constraint on the average tail probability, 1/|
x|
γ. Here, γ is not necessarily an integer, so the average of |
x|
γ can be thought of as a fractional moment in the stretched exponential domain that pairs with the power law tail in the inverse Lévy domain [
3].
12. Relations between Probability Patterns
I have shown how to read probability distributions as statements about the dissipation of information, the constraint on average values, and the scaling relations of information and measurement. Essentially all common distributions have the form given in
Equation (8) as
Dissipation of information and constraint on average values set the
e−λfy form. Scaling measures transform the observables,
fy, to T
f = T(
fy). The term
my accounts for changes between dissipation of information on one scale and measurement of final pattern on a different scale.
The scaling measures, Tf, determine the differences between probability patterns. In this section, I discuss the scaling measures in more detail. What defines a scaling relation? Why are certain common scaling measures widely observed? How are the different scaling measures connected to each other to form families of related probability distributions?
12.1. Invariance and Common Scales
The form of the maximum entropy distributions influences the commonly observed scales and associated probability distributions [
4,
6]. In particular, we obtain the same distribution in
Equation (32) for either the measurement function T
f or the affine transformed measurement function T
f ↦ a +
bT
f. An affine transformation shifts the variable by the constant
a and multiplies it by the constant
b.
The shift by
a changes the constant of proportionality
in which ξ =
e−λa. In maximum entropy, the final proportionality constant always adjusts to satisfy the constraint that the total probability is one
Equation (2). Thus, the final adjustment of total probability erases any prior multiplication of the distribution by a constant. A shift transformation of T
f does not change the associated probability pattern.
Multiplication by
b also has no effect on probability pattern, because
for
. In maximum entropy, the final value of the constant multiplier for T
f always adjusts so that that the average value of T
f satisfies an extrinsic constraint, as given in
Equation (6).
Thus, maximum entropy distributions are invariant to affine transformations of the measurement scale. That affine invariance shapes the form of the common measurement scales. In particular, consider transformations of the observables,
G(
fy), such that
Any scale, T, that satisfies this relation causes the transformed scale T [G(
fy)] to yield the same maximum entropy probability distribution as the original scale T
f = T(
fy).
For example, suppose our only information about a probability distribution is that its form is invariant to a transformation of the observable values
fy by a process that changes
fy to
G(
fy). Then it must be that the scaling relation of the measurement function Tf satisfies the invariance in
Equation (33). By evaluating how that invariance sets a constraint on T
f, we can find the form of the probability distribution.
The classic example concerns the invariance of logarithmic scaling to power law transformation [
21]. Let T(
y) = log(
y) and
G(
y) =
cyγ. Then by
Equation (33), we have
which demonstrates that logarithmic scaling is affine invariant to power law transformations of the form
cyγ, in which
affine invariance means that the scaling relation T and the associated transformation G satisfy
Equation (33).
12.2. Affine Invariance of Measurement Scaling
Put another way, a scaling relation, T, is defined by the transformations, G, that leave unchanged the information in the observables with respect to probability patterns. In maximum entropy distributions,
unchanged means affine invariance. This affine invariance of measurement scaling in probability distributions is so important that I like to write the key expression in
Equation (33) in a more compact and memorable form
Here, the circle means composition of functions, such that T ○ G = T[G(
fy)], and the symbol “~” for similarity means equivalent with respect to affine transformation. Thus, the right side of
Equation (33) is similar to T with respect to affine transformation, and the left side
Equation (33) is equivalent to T ○ G. Reversing sides of
Equation (33) and using “~” for affine similarity leads to
Equation (35).
Note, from
Equation (11) and
Equation (33), that
Sy = T ○ G, showing that the information in a probability distribution,
Sy, is invariant to affine transformation of T. Thus, we can also write
which emphasizes the fundamental role of invariant information in defining the measurement scaling, T, and the associated form of probability patterns.
12.3. Base Scales and Notation
Earlier, I defined fy = f(y) as an arbitrary function of the variable of interest, y. I have used either y or y2 or (y − μ)2 for fy to match the classical maximum entropy interpretation of average values constraining either the mean or the variance.
To express other changes in the underlying variable,
y, I introduced the measurement functions or scaling relations, T
f = T(
fy). In this section, I use an expanded notation to reveal the structure of the invariances that set the forms of scaling relations and probability distributions [
4]. In particular, let
be a function of
fy. Then, for example, we can write an exponential scaling relation as T(
fy) =
eβw. We may choose a base scale,
w, such as a linear base scale,
w(
fy) =
fy, or a logarithmic base scale,
w(
fy) = log(
fy), or a linear-log base scale,
w(
fy) = log(1 +
fy/α), or any other base scale. Typically, simple combinations of linear and log scaling suffice. Why such simple combinations suffice is an essential unanswered question, which I discuss later.
Previously, I have referred to fy as the observable, in which we are interested in the distribution of y but only collect statistics on the function fy. Now, we will consider w = w(fy) as the observable. We may, for example, be limited to collecting data on w = log(fy) or on measurement functions T(fy) that can be expressed as functions of the base scale w. We can always revert to the simpler case in which w = fy or w = y.
In the following sections, the expanded notation reveals how affine invariance sets the structure of scaling relations and probability patterns.
12.4. Two Distinct Affine Relations
All maximum entropy distributions satisfy the affine relation in
Equation (33), expressed compactly in
Equation (35). In that general affine relation, any measurement function, T, could arise, associated with its dual transformation, G, to which T is affine invariant. That general affine relation does not set any constraints which measurement functions T may occur, although the general affine relation may favor certain scaling relations to be relatively common.
By contrast with the general affine form T ~ T ○ G, for any T and its associated G, we may consider how specific forms of G determine the scaling, T. Put another way, if we require that a probability pattern be invariant to transformations of the observables by a particular G, what does that tell us about the form of the associated scaling relation, T, and the consequent probability pattern?
Here we must be careful about potential confusion. It turns out that an affine form of G is itself important, in which, for example,
G(
w) =
δ +
θw. That specific affine choice for G is distinct from the general affine form of
Equation (35). With that in mind, the following sections explore the consequences of an affine transformation, G, or a shift transformation, which is a special case of an affine transformation.
12.5. Shift Invariance and Generalized Exponential Measurement Scales
Suppose we know only that the information in probability patterns does not change when the observables undergo shift transformation, such that G(w) = δ + w. In other words, the form of the measurement scale, T, must be affine invariant to adding a constant to the base values, w. A shift transformation is a special case of an affine transformation G(w) = δ + θw, in which the affine transform becomes strictly a shift transformation for the restricted case of θ = 1.
The exponential scale
maintains the affine invariance in
Equation (33) to a shift transformation, G. If we apply shift transformation to the observables,
w ↦ δ +
w, then the exponential scale becomes
eβ(δ+w), which is equivalent to
beβw for
b = eβδ. We can ignore the constant multiplier,
b, thus, the exponential scale is shift invariant with respect to
Equation (33).
Using the shift invariant exponential form for T
f, the maximum entropy distributions in
Equation (32) become
This exponential scaling has a simple interpretation. Consider the example in which
w is a linear measure of time,
y, and
β is a rate of exponential growth (or decay). Then the measurement scale, T
f, transforms each underlying time value,
y, into a final observable value after exponential growth,
eβy. The random time values,
y, become random values of final magnitudes, such as random population sizes after exponential growth for a random time period. In general, exponential growth or decay is shift invariant, because it expresses a constant rate of change independently of the starting point.
If the only information we have about a scaling relation is that the associated probability pattern is shift invariant to transformation of observables, then exponential scaling provides a likely measurement function, and the probability distribution may often take the form of
Equation (37).
The Gumbel extreme value distribution in
Equation (25) follows exponential scaling. In that case, the underlying observations,
y, are transformed into cumulative exponential tail probabilities that, in aggregate, determine the probability that an observation is the extreme value of a sample. The exponential tail probabilities are shift invariant, in the sense that a shifted observation,
δ +
y, also yields an exponential tail probability. The magnitude of the cumulative tail probability changes with a shift, but the exponential form does not change.
12.6. Affine Duality and Linear Scaling
Suppose probability patterns do not change when observables undergo an affine transformation
G(
w) =
δ +
θw. Affine transformation of observables allows a broader range of changes than does shift transformation. The broader the range of allowable transformations of observables, G, the fewer the measurement functions, T, that will satisfy the affine invariance in
Equation (33). Thus affine transformation of observables leads to a narrower range of compatible measurement functions than does shift transformation.
When G is affine with
θ ≠ 1, then the associated measurement function T
f must itself be affine. Because T
f is invariant to shift and multiplication, we can say that invariance to affine G means that T
f =
w, and thus the maximum entropy probability distribution in
Equation (32) becomes linear in the base measurement scale,
w, as
This form follows when the probability pattern is invariant to affine transformation of the observables,
w. By contrast, invariance to a shift transformation of the observables leads to the broader class of distributions in
Equation (37), of which
Equation (38) is special case for the more restrictive condition of invariance to affine transformation of observables.
To understand the relation between affine and shift transformations of observables, G, it is useful to write the expression for the measurement function in
Equation (36) more generally as
noting that we can make any affine transformation of a measurement function, T
f ↦ a + bT
f, without changing the associated probability distribution. With this new measurement function for shift invariance, as
β → 0, then T
f → w, and we recover the measurement function associated with affine G.
Suppose, for example, that we interpret
β as a rate of exponential change in the underlying observable,
w, before the final measurement. Then, as
β → 0, the underlying observable and the final measurement become equivalent, T
f → w, because
12.7. Exponential and Gaussian Distributions Arise from Affine Invariance
Suppose we know only that the information in probability patterns does not change when the observables undergo affine transformation,
w ↦
δ +
θw. The invariance of probability pattern to affine transformation of observables leads to distributions of the form in
Equation (38). Thus, if the observable is the underlying value,
w = y, then the probability distribution is exponential
and if the observable is
y2, the squared distance of the underlying value from its mean, then the probability distribution is Gaussian
By contrast, if the probability pattern is invariant to a shift of the observables, but not to an affine transformation of the observables, then the distribution falls into the broader class based on exponential measurement functions in
Equation (37).
13. Hierarchical Families of Measurement Scales and Distributions
The general form for probability distributions in
Equation (37), repeated here
arises from a base measurement scale,
w, and shift invariance of the probability pattern to changes
w ↦ δ +
w. Each base scale,
w, defines a family of related probability distributions, including the linear form
as a special case when the probability pattern is invariant to affine changes
w ↦
δ +
θw, which corresponds to
β → 0 in
Equation (39).
We may consider a variety of base scales,
w, creating a variety of distinct measurement scales and families of distributions. Ultimately, we must consider how the base scales arise. However, it is useful first to study the commonly observed base scales. The relations between these common base scales form a hierarchical pattern of measurement scales and probability distributions [
4].
13.1. A Recursive Hierarchy for the Base Scale
The base scales associated with common distributions typically arise as combinations of linear and logarithmic scaling. For example, the linear-log scale can be defined by log(c + x). This scale changes linearly in x when x is much smaller than c and logarithmically in x when x is much larger than c. As c → 0, the scale becomes almost purely logarithmic, and for large c, the scale becomes almost purely linear.
We can generate a recursive hierarchy of linear-log scale deformations by
The hierarchy begins with
w(0) = fy, in which
fy denotes our underlying observable. Recursive expansion of the hierarchy yields: a linear scale,
w(0) = fy; a linear-log deformation,
w(1) = log(
c1 +
fy); a linear-log deformation of the linear-log scale,
w(2) = log(
c2 + log(
c1 +
fy)); and so on. A log deformation of a log scale arises as a special case, leading to a double log scale.
Other scales, such as the log-linear scale, can be expanded in a similarly recursive manner. We may also consider log-linear-log scales and linear-log-linear scales. We can abbreviate a scale,
w, by its recursive deformation and by its level in a recursive hierarchy. For example,
is the second recursive expansion of a linear-log deformation. The initial value for any recursive hierarchy with a superscript of
i = 0 associates with the base observable
w(0) = fy, which I will also write as “Linear,” because the base observable is always a linear expression of the underlying observable,
fy.
13.2. Examples of Common Probability Distributions
Table ?? shows that commonly observed probability distributions arise from combinations of linear and logarithmic scaling. For example, the simple linear-log scale expresses linear scaling at small magnitudes and logarithmic scaling at large magnitudes. The distributions that associate with linear-log scaling include very common patterns.
For direct observables, fy = y, the linear-log scale includes the purely linear exponential distribution as a limiting case, the purely logarithmic power law (Pareto type I) distribution as a limiting case, and the Lomax (Pareto type II) distribution that is exponential at small magnitudes and has a power law tail at large magnitudes.
For observables that measure the squared distance of fluctuations from a central location, fy= (y − μ)2, or y2 for simplicity, the linear-log scale includes the purely linear Gaussian (normal) distribution as a limiting case, and the generalized Student’s distribution that is a Gaussian linear pattern for small deviations from the central location and grades into a logarithmic power law pattern in the tails at large deviations.
Most of the commonly observed distributions arise from other simple combinations of linear and logarithmic scaling. To mention just two further examples among the many described in
Table ??, the log-linear scale leads to the gamma distribution, and the log-linear-log scale leads to the commonly observed beta distribution.
14. Why do Linear and Logarithmic Scales Dominate?
Processes in the natural world often cause highly nonlinear transformations of inputs into outputs. Why do those complex nonlinear transformations typically lead in the aggregate to simple combinations of linear and logarithmic base scales? Several possibilities exist [
20]. I mention a few in this section. However, I do not know of any general answer to this essential question. A clear answer would greatly enhance our understanding of the commonly observed patterns in nature.
14.2. Common Arithmetic Operations Lead to Common Scaling Relations
Perhaps linear and logarithmic scaling reflect aggregation by addition or multiplication of fluctuations. Adding fluctuations often tends in the limit to a smooth linear scaling relation. Multiplying fluctuations often tends in the limit to a smooth logarithmic scaling relation.
Consider the basic log-linear scale that leads to the gamma distribution. A simple generative model for the gamma distribution arises from the waiting time for the kth event to occur. At time zero, no events have occurred.
At small magnitudes of time, the occurrence of all k events requires essentially simultaneous occurrence of all of those events. Nearly simultaneous occurrence happens roughly in proportion to the product of the probability of any single event occurring in a small time interval. Multiplication associates with logarithmic scaling.
At large magnitudes of time, either all k events have occurred, or in most cases k − 1 events have occurred and we wait only for the last event. The waiting time for a single event follows an the exponential distribution associated with linear scaling. Thus, the waiting time for k events naturally follows a log-linear pattern.
Any process that requires simultaneity at extreme magnitudes leads to logarithmic scaling at those limits. Thus, a log-linear-log scale may be a very common underlying pattern. Special cases include log-linear, linear-log, purely log, and purely linear. For those variant patterns, the actual extreme tails may be logarithmic, although difficulty observing the extreme tail pattern may lead to many cases in which a linear tail is a good approximation over the range of observable magnitudes.
Other aspects of aggregation and limiting processes may also lead to the simple and commonly observed scaling relations. For example, fractal theory provides much insight into logarithmic scaling relations [
27,
28]. However, I do not know of any single approach that matches the simplicity of the commonly observed combinations of linear and logarithmic scaling patterns to a single, simple underlying theory.
The invariances associated with simple scaling patterns may provide some clues. As noted earlier, shift invariance associates with exponential scaling, and affine invariance associates with linear scaling. It is easy to show that power law invariance associates with logarithmic scaling. For example, in the measurement scale invariance expression given in
Equation (33), the invariance holds for a log scale, T(
y) = log(
y), in relation to power law transformations of the observables, G(
y) =
cyγ, as shown in
Equation (34).
We may equivalently say that a scaling relation satisfies power law invariance or that a scaling relation is logarithmic. Noting the invariance does not explain why the scaling relation and the associated invariance are common, but it does provide an alternative and potentially useful way in which to study the problem of commonness.
15. Asymptotic Invariance
The measurement functions, T, that define maximum entropy distributions satisfy the affine invariance given in
Equation (35), repeated here
One can think of G as an input-output function that transforms observations in a way that does not change information with respect to probability pattern.
Most of the commonly observed probability patterns have a simple form, associated with a simple measurement function composed of linear, logarithmic, and exponential components. I have emphasized the open problem of why the measurement functions, T, tend to be confined to those simple forms. That simplicity of measurement implies an associated simplicity for the form of G under which information remains invariant. If we can figure out why G tends to be simple, then perhaps we may understand the simplicity of T.
15.2. Invariance in the Limit
Suppose that, for a simple measurement function, T, and a complex input-output process, G, the basic invariance does not hold
Equation (43). However, it may be that multiple rounds of processing by G ultimately lead to a relatively simple transformation of the initial inputs to the final outputs. In other words, G may be complex, but for sufficient large
n, the form of G
n may be simple [
20]. This aggregate simplicity may lead in the limit to asymptotic invariance
as
n becomes sufficiently large. It is not necessary for every G to be identical. Instead, each G may be a sample from a pool of alternative transformations. Each individual transformation may be complicated. However, in the aggregate, the overall relation between the initial inputs and final outputs may smooth asymptotically into a simple form, such as a power law. If so, then the associated measurement scale smooths asymptotically into a simple logarithmic relation.
Other aggregates of input-output processing may smooth into affine or shift transformations, which associate with linear or exponential scales. When different invariances hold at different magnitudes of the initial inputs, then the measurement scale will change with magnitude. For example, a log-linear scale may reflect asymptotic power law and affine invariances at small and large magnitudes.
16. Discussion
Aggregation smooths underlying complexity into simple patterns. The common probability patterns arise by the dissipation of information in aggregates. Each additional random perturbation increases entropy until the distribution of observations takes on the maximum entropy form. That form has lost all information except the constraints on simple average values.
For each particular probability distribution, the constraint on average value arises on a characteristic measurement scale. That scaling relation, T, defines the form of the maximum entropy probability distributions
as initially presented in
Equation (8), for which T = T
f. Here,
my accounts for cases in which information dissipates on one scale, but we measure probability pattern on a different scale.
The common probability distributions tend to have simple forms for T that follow linear, logarithmic, or exponential scaling at different magnitudes. The way in which those three fundamental scalings grade into each other as magnitude changes sets the overall scaling relation.
A scaling relation defines the associated maximum entropy distribution. Thus, reading a probability distribution as a statement about process reduces to reading the embedded scaling relation, and trying to understand the processes that cause such scaling. Similarly, understanding the familial relations between probability patterns reduces to understanding the familial relations between different measurements scales.
The greatest open puzzle concerns why a small number of simple measurement scales dominant the commonly observed patterns of nature. I suggested that the solution may follow from the basic invariance that defines a measurement scale.
Equation (35) presented that invariance as
The measurement scale, T, is affine invariant to transformation of the observations by G. In other words, the information in measurements with regard to probability pattern does not change if we use the directly measured observations or we measure the observations after transformation by G, when analyzed on the scale T.
In many cases, the small scale processes, G, that transform underlying values may have complex forms. If so, then the associated scaling relation T, might also be complex, leaving open the puzzle of why observable forms of T tend to be simple. I suggested that underlying values may often be transformed by multiple processes before ultimate measurement. Those aggregate transformations may smooth into a simple form with regard to the relation between initial inputs and final measurable outputs. If we express a sequence of
n transformations as G
n, then the asymptotic invariance of the aggregate processing may be simple in the sense that
as given by
Equation (45). Here, the measurement scaling T, and the aggregate input-output processing G
n are relatively simple and consistent with commonly observed patterns.
The puzzle concerns how aggregate input-output processing smooths into simple forms [
20]. In particular, how does a combination of transformations lead in the aggregate to a simple asymptotic invariance?
The scaling pattern for any aggregate input-output relation may have simple asymptotic properties. The application to probability patterns arises when we embed a simple asymptotic scaling relation into the maximum entropy process of dissipating information. The dissipation of information in maximum entropy occurs as measurements are made on the aggregation of individual outputs.
Two particularly simple forms of invariance by T to input-output processing by G
n may be important. If G
n is a shift transformation
w ↦
δ +
w for some base scaling
w, then the associated measurement scale has the form T
f =
eβw. This exponential scaling corresponds to the fact that exponential growth or decay is shift invariant. With exponential scaling, the general maximum entropy form is
The extreme value distributions and other common distributions derive from that double exponential form. The particular distribution depends on the base scaling,
w, as illustrated in
Table ??.
Shift transformation is a special case of the broader class of affine transformations,
w ↦
δ +
θw. If G
n causes affine changes, then the broader class of input-output relations leads to a narrower range of potential measurement scales that preserve invariance. In particular, an affine measurement scale is the only scale that preserves information about probability pattern in relation to affine transformations. For maximum entropy probability distributions, we may write T
f =
w for the measurement scale that preserves invariance to affine G
n, leading to the simpler form for probability distributions
which includes most of the very common probability distributions. Thus, the distinction between asymptotic shift and affine changes of initial base scales before potential measurement may influence the general form of probability patterns.
In summary, the common patterns of nature follow a few generic forms. Those forms arise by the dissipation of information and the scaling relations of measurement. The measurement scales arise from the particular way in which the information in a probability pattern is invariant to transformation. Information invariance apparently limits the common measurement scales to simple combinations of linear, logarithmic, and exponential components. Common probability distributions express how those component scales grade into one another as magnitude changes.