1. Introduction
In his popular expository article on universality examples in mathematics and physics Tao [
1] mentions Zipf’s empirical power law of word frequencies as lacking any “convincing explanation for how the law comes about and why it is universal”. By
universality he means the idea where systems follow certain macroscopic laws that are largely independent of their microscopic details. In this article we drop down many levels from natural language to discuss
two universality properties associated with the toy monkey model of Zipf’s law. Both of these properties were previously considered in Perline [
2], but the presentation there was incompletely developed, and here we clarify and greatly expand upon these two ideas by showing: (1) how the model displays a nearly universal tendency towards a
exponent in its approximate power law behavior; and (2) the significance of the central limit theorem (CLT) for a random number of i.i.d. variables in the model with a finite word length cutoff. We will sometimes refer to the case with a finite word length cutoff as
Monkey Twitter, as explained in
Section 3. Our analysis regarding the very general tendency towards a
exponent when letter probabilities are drawn as a sample from any of an enormous class of probability distributions is a new result in the literature on this topic. Our application of the CLT with a random number of summands was first introduced with respect to the monkey model in [
2], and we want to emphasize here that this is a significant aspect of the structure of the model.
Our paper is laid out as follows. In
Section 2 we prove the strong tendency towards an approximate
exponent under very broad conditions by means of a limit theorem for the logarithms of random spacing due to Shao and Hahn [
3]. An equivalent but relatively longer proof of this, still based on [
3], has been given in [
4]. In
Section 3, we explain the underlying lognormal structure of the
central part of the distribution of word probabilities in the case of a
finite word length cutoff and show how it leads to a hybrid Zipf-lognormal distribution (called a lognormal-Pareto distribution in [
2]). These are universality properties in exactly Tao’s sense because the distribution of the word probabilities (the macro behavior) is within very broad bounds independent of the details of how the letter probabilities for the monkey’s typewriter are selected (the micro behavior). In
Section 4, we discuss the multiple connections between these results and well-known research from other areas where Pareto–Zipf type distributions have been investigated. In the remainder of this section we sketch some historical background.
It has been known for many years that the monkey-at-the-typewriter scheme for generating random “words” conforms approximately to an inverse power law of word frequencies and therefore roughly mimics the approximate inverse power form of Zipf’s [
5] statistical regularity for many natural languages [
6].
However, Zipf’s empirical word frequency law not only approximates a power distribution, but as he emphasized, it also very frequently exhibits an exponent in the neighborhood of . That is, letting
represent the observed ranked word frequencies, Zipf found
, with
a constant depending on the sample size. Many qualifications have been raised regarding this approximation (including the issue of the divergence of the harmonic series); yet as I.J. Good [
7] commented: “The Zipf law is unreliable but it is often a good enough approximation to demand an explanation”.
Figure 1 illustrates the fascinating universal character of this word frequency law using the text from four different authors writing in four different European languages in four different centuries. The plots are shown in the log-log form that Zipf employed, and they resemble countless other examples that researchers have studied since his time. The raw text files for the four books used in the graphs were downloaded from the Public Gutenberg databank [
8].
The monkey model has a convoluted history. The model can be thought of as actually embedded within the combinatoric logic of Mandelbrot’s early work providing an information-theoretic explanation of Zipf’s law. For example, in Mandelbrot [
9] there is an appendix entitled “Combinatorial Derivation of the Law
” (his notation) that effectively specifies a monkey model, including a Markov version, but is not explicitly stated as such. However, his derivation there is informal, and the first completely rigorous analysis of the general monkey model with independently combined letters was given only surprisingly recently by Conrad and Michenmacher [
10]. Their analysis utilizes analytic number theory and is, as a result, somewhat complicated—a point they note at the end of their article. A simpler analysis using only elementary methods based on the Pascal pyramid has now been given by Bochkarev and Lerner [
11]. They have also analyzed the more general Markov problem [
12] and hidden Markov models [
13]. Edwards, Foxall and Perkins [
14] have provided a directly relevant analysis in the context of scaling properties for paths on graphs explaining how the Markov variation can generate both an approximate power law or a weaker scaling law, depending on the nature of the transition matrix.
A clarifying and important milestone in the history of this topic came from the cognitive psychologist, G.A. Miller [
15,
16]. Miller used the simple model with equal letter probabilities and an independence assumption to give a heuristic analysis showing that this version generates a distribution of word probabilities that yields a step function approximation to an inverse power law such that the probability
of the
largest word probability is (in a sense that needs to be made precise) roughly of the form
, for
.
In addition, he made the very interesting empirical observation that by using a keyboard of
letters and one space character, with the space character having a probability of 0.18 (about what is seen in empirical studies of English text), the value of
in this case is approximately −1.06, very close to the nearly −1 value in samples of natural language text as illustrated in
Figure 1. In his simplified model, it turns out that
, where
is the space probability, so that
approaches −1 from below as
K increases. (Natural logarithms are denoted
and logarithms with any other radix will be explicitly indicated, as in
. Note that Miller’s
involves a a
ratio of logarithms, so that it is invariant with respect to the radix used in the numerator and denominator.)
In this article we give a broad generalization of Miller’s observation to the case of unequal letter probabilities. In light of our present results given in
Section 2, the application of sample spacings from a uniform distribution in [
2] to study the
behavior using an asymptotic regression line should now be viewed as just a step in a direction that ultimately led us to the Shao and Hahn limit used here.
2. Proof that Tends Towards from the Asymptotics of Log-Spacings
For our analysis we specify a keyboard with an alphabet of letters and a space character S. The letter characters have non-zero probabilities . The space character has probability s, so that . A word is defined as any sequence of non-space letters terminating with a space. A word W of exactly n letters is a string such as and has a probability of the form because letters are struck independently. The space character with no preceding letter character will be considered a word of length zero. The rank ordered sequence of descending word probabilities in the ensemble of all possible words is written ( is always the first and largest word probability). We break ties for words with equal probabilities by alphabetical ordering, so that each word probability has a unique rank r.
Conrad and Mitzenmacher [
10] give a carefully constructed definition of power law behavior in the context of the monkey model as the situation where there exist
such that the inequality
holds for sufficiently large
r. It turns out that
. A few comments about this definition are in order. First, note that there will always be many words with tied probability values. For example, the words
and
have the same probability
. Because of this, the word probabilities in this model could never
exactly conform to a perfect inverse power law
. The asymptotic behavior for
is more subtle than this and highlights the significance of Conrad and Mitzenmacher’s analysis. They are able to give a stronger result when at least two letter probabilities have an irrational log-ratio
. For this case, they show that there exists a constant
C such that
, where ∼ denotes asymptotic equivalence,
. For our purposes here, we will simply refer to the Conrad and Mitzenmacher results as generally implying that
has
approximate power law behavior. Note that inequality (1) implies the weaker asymptotic condition
as
. We’ll call this
asymptotic log-log linear behavior and observe that on a log-log scale,
and
are asymptotically negligible so that only the exponent
becomes significant. As can be seen in
Figure 2b–d, to be explained shortly, monkey model word probabilities generated from unequal letter probabilities in the manner described below produce graphs distinctly log-log linear.
Conrad and Mitzenmacher derive the exponent value
but do not express it in what we consider its simplest form. Following Bochkarev and Lerner [
11], the parameter
β can be represented as the solution to
with
. Bochkarev and Lerner [
11] indicate
in the inequality for their Theorem 1, but what was evidently intended is
. Their
is equal to our
β. In the Miller model with equal letter probabilities,
, so from Equation (
2),
β in this case is found to be
. In the Fibonacci example given by Conrad and Mitzenmacher [
10] and in Mitzenmacher [
17] they use
letters with probabilities
,
and
so that
. Then
β is the solution to
, giving
Conditions leading to
were not explored in [
10,
11]. To understand how this behavior can emerge, we define spacings through a random division of the unit interval and then state the Shao-Hahn limit law. Let
be a sample of
i.i.d. random variables drawn from a distribution on
with a bounded density function
. Shao and Hahn present the conditions for this limit in a more general way that reduces to this simpler statement when a density function exists. Write the order statistics of the sample as
. The
K spacings are defined as the differences between the successive order statistics:
,
for
and
. We’ll refer to this as a
generalized broken stick process. By Shao and Hahn [
3] Corollary 3.6, we have
as
and where
a.s. signifies
almost sure convergence,
is the Euler constant and
is the differential entropy of
. Clearly,
so dividing through by
gives
as
. The right side of the limit in (
6) goes to 0 because
goes to 0 by the boundedness of the density
. Expressing logarithms with a radix
then leads to the limit
Our
universality property for
β will now follow almost immediately from this. We use sample spacings to populate the
K letter probabilities for the monkey keyboard. Since
s is the probability of the space character, define
so that
. Let
. Then from the limit (
7), we have
as
. Observe that
. Since
from (
8) and since
, showing that
will prove the argument that
for finite but sufficiently large
K. To see that this is the case, note:
where the inequality in Equation (10) follows from the geometric-arithmetic mean inequality. Therefore, from
it has now been established that
. In the special case of Miller’s model using all equal letter probabilities,
.
We want to complement this analytic result for using a natural and insightful approximation Prof. David Mumford kindly brought to our attention after looking at a preprint of this article. Consider the function for . Then and a first order approximation from a Taylor expansion about gives . Since β satisfies , it follows that ), giving ). If the term is sufficiently large, then . For the class of random variables that we considered in partitioning to populate the letter probabilities, this will be the case for large K.
Figure 2 presents graphical results using simulations of sample spacings from several distributions to illustrate our result for
β. In
Figure 2a we plot
by
for the Miller model with exactly the parameter values he used,
i.e.,
letters, a space probability
and equal letter probabilities
. The graph is based on the ranks
, corresponding to all words of length
non-space letters.
Figure 2b through
Figure 2d are based on generating
letter probabilities from three different continuous distributions with bounded densities
on
using a generalized broken stick method to obtain the sample spacings. Again
was used for the probability of the space character and the letter probabilities were populated with values from the spacings with
,
, so that in each case,
, as in Miller’s example. For each graph in
Figure 2b–d, we generated the largest 475,255 word probability values in order to match the Miller example of
Figure 2a. The three continuous distributions, all defined on
, are:
- -
a uniform distribution with density ;
- -
a beta
distribution with density
- -
a triangular distribution with density
The graphs in
Figure 2b–d illustrate our theoretical derivation, but we also note that the very linear plots indicate what appears to be an almost “immediate” convergence to approximate power law behavior exceeding what might be expected from the asymptotics of Conrad-Mitzenmacher and Bochkarov-Lerner.
3. Monkey Twitter: Anscombe’s CLT for the Model with a Finite Word Length Cutoff
The discussion of the finite word length version of the monkey model in [
2] was incomplete because it did not explicitly provide the normalizing constants for the application of Anscombe’s CLT for a random number of i.i.d. variables, as we do here. We also want to explain more clearly the nature of the hybrid Zipf-lognormal distribution that results from a finite length cutoff when letter probabilities are not identical.
The focus in
Section 2 on the infinite ensemble of word probabilities
ranked to give
has actually obscured the significant
hierarchical structure of the monkey model—what Mandelbrot (p. 345, [
18]), called a
lexicographic tree—which only becomes evident when word length is considered. To see this, write
for the multiset of all word probabilities for monkey words of exactly length
n. There are
probabilities in
having a total sum of
and an average value of
, declining geometrically with
n. Now taking word length into account, Mandelbrot’s lexicographic tree structure becomes clear with the simple case of
letters: the root node of the tree has a probability of
s (for the ”space” word of length 0); it has two branches to the next level consisting of nodes for words of length 1 with probabilities
and
; these each branch out to the next level with four nodes having probability values
and so on.
The formation of word probabilities as the product of letter probabilities immediately suggests lognormal structure (normal on a log scale). A simple way to exhibit the underlying approximate lognormal character of the finite multisets of word probabilities for all words of exactly length n and those of length is to construct their natural product probability spaces using the basic counting measure for letter probability values. For this analysis it makes no difference how the letter probabilities have been specified. We only require that they are positive values and at least two of them are different from each other.
The relevance of the CLT is first seen by representing any probability
as a product of i.i.d. random variables times the constant
s:
, where each
takes on one of the letter probability values
with probability
(
i.e., we use the natural counting measure to construct a probability space on
). Let
and
. Assume that
,
i.e., the letter probabilities are not all equal. Then
is the mean and
is the variance of
for
, and it follows that
and
are the respective mean and variance of
for
. Therefore,
is asymptotically normally distributed
(the term
is rendered negligible in the asymptics) so that for sufficiently large
n,
P itself will be approximately lognormal
. This approximate normality for the log-probabilities of words of fixed length is quite obvious and was noted in passing by Mandelbrot (p. 210, [
19]). However, he missed a
stronger observation that the
word probabilities for
have a distribution that behaves in its central part away from the tails very much like
. That is, a version of the CLT can be applied to words of length
, not just to words of length
n, and the two distributions,
in one sense, are very close to each other.
To show this, we make use of CLT results that pertain to a random number of i.i.d. random variables. In the preceding paragraph, we explained how for , could be regarded as the sum of a fixed number of i.i.d. random variables (plus a constant). Now we will apply an extended version of this same logic to the multiset of all words of length . For , there are word of length j, . Think of as represented by a sum , where is the number of letters in the word. As before, we regard the as random variables taking on the values , each with probability . Now we will also let be a random variable assuming the values . Define the distribution of in terms of the natural counting measure for word length: since there are words of length j and a total of words in all, then , .
We first state Anscombe’s generalization of the CLT (Theorem 3.1, p.15, [
20]) before applying it. Let
be the sum of
n i.i.d. random variables with mean 0 and variance
, so that
converges in distribution to
. Let
be an integer random variable such that
in probability as
(
). Then by Anscombe the sum
normalized as
also converges in distribution to
, just as
does. Note the presence of
θ in the denominator in the case for
.
It is now clear how to apply this theorem to
as we have constructed it. We will not repeat the calculations of [
2] here, but it is easily shown that
in probability as
. Because this limiting constant
θ is 1, Anscombe’s generalization can be applied with the
identical normalizing constants
and
as used above with
for
. Consequently, it is
also true that for
, the normalized sum
has an asymptotic
distribution. In other words the two random sums
and
behave so similarly in their centers that the word probabilities in
and
are
both approximately
. (Again, the constant
can be ignored in dealing with the asymptotics for both
and
.) However, it is essential to remark that the two distributions have very different upper tail behavior. In this regard, Le Cam’s [
21] comment that French mathematicians use the term “central” referring to the CLT “because it refers to the center of the distribution as opposed to its tails” is particularly relevant.
The behavior of
is illustrated in the graphs of
Figure 3a,b. The plot in
Figure 3a was generated just as in
Figure 2b except that a finite length cutoff of
letters has been applied. To make this clear, in
Figure 2b the plot shows the top 475,255 word probabilities in
generated using
letters derived from uniform spacings. In contrast, using the same letter probabilities, the plot in
Figure 3a shows the 475,255 word probabilities generated with word length
letters,
i.e., all the values of
. Word length in the case of
unequal letter probabilities is certainly correlated,
but not perfectly, with word probability: words of shorter length will,
on average, have a higher probability than those of longer length, but except in Miller’s degenerate case, there will always be reversals where a longer word will have a higher probability than a shorter word. In natural languages, Zipf [
5] underscored that “the length of a word tends to bear an inverse relationship to its relative frequency”, which he called the
Law of Abbreviation. In many respects, this is the starting point for his
Principle of Least Effort. However, writing
for the ranked values of
, it should be evident that for any rank
r,
and that
when
n is sufficiently large. In short,
inherits its upper tail approximate power law behavior from
, which is illustrated by comparing the top part of the curve in
Figure 3a with the corresponding part of
Figure 2b.
Figure 3b uses the same data points from
Figure 3a graphed as a normal quantile plot, and its roughly linear appearance for the logarithm of word probabilities for
conforms to an approximate Guassian fit, although certainly the bending in the upper half of the distribution departs a bit from the linear trend of the lower half. The fact that a distribution can have a power law tail and lognormal central part and
still look lognormal over essentially its entire range may seem surprising. The discussion in [
22] about
power law mimicry in the upper tail of lognormal distributions helps to explain why this can happen. We will call the distribution for
a Zipf-lognormal distribution.
The monkey model with a fixed word length cutoff can prove useful as a motivating idea. In the next section, we will discuss how models with the same branching tree structure have been proposed many times in the past, typically in settings where something like a finite word length is natural to consider. For the moment, think of the social networking service, Twitter, which allows members to exchange messages limited to at most 140 characters. Define Monkey Twitter with a finite limit of characters. For convenience, in any implementation of a Monkey Twitter random experiment we will assume that monkeys would always fill up their allotted message space of characters. Monkey words still require a terminating space character, so it is possible for a monkey to type (at most) one “non-word”, which can vary in length from 1 to characters, and will always be the last part of a message string. Non-words are discarded, and the probabilities for the legitimate monkey words, varying in length from 0 to n non-space characters plus a terminating space character, will correspond to the values in .
4. Connections to Other Work
The subject of power law distributions is a vast topic of research reaching across diverse scientific disciplines [
17,
22,
23,
24]. Here we provide a brief sketch of how our results connect to some other work and how they can be considered in a much more general light.
The monkey model viewed in terms of its hierarchical tree structure is the starting point for understanding its general nature. Though Miller did not explicitly present his model as a branching tree, several researchers at almost the same time in the mid-to-late 1950s were highlighting this form using the equivalent of equal letter probabilities to motivate the occurrence of empirical power law distribution in other areas. For example, Beckman [
25] introduced this idea in connection with the power law of city populations within a country (“Auerbach’s law” [
26]). Using essentially the same logic as Miller, but in a completely different setting, Beckman assumed that a given community will have
K satellite communities, each with a constant decreasing fraction of the population at a higher level. That is, if
A is the population of the largest city and
, there will be
K nearby satellite communities with population
,
smaller communities nearby to these with population
, and so on. Mandelbrot (p. 226, [
27]) had an apt expression for this pattern: he described it as “compensation between two exponentials” because the number of observations at each level increases geometrically while the mean value decreases geometrically. Beckman then went on to note that if instead of using a constant decreasing fraction
p, one used a random variable
X on
, the population at the
level down would be a random variable of the form
, leading to an approximate lognormal distribution of populations at this level. This corresponds to our discussion of monkey word probabilities in
, although Beckman was not aware of the still stronger statement of lognormality for the probabilities of words of length
in
that we demonstrated using Anscombe’s CLT.
Many more examples of a branching tree structure essentially equivalent to Miller’s monkey model with equiprobable letters have been proposed over the years to motivate the occurrence of a huge variety of empirical power law distributions, including such size distributions as lake areas, island areas (“Korcak’s law”), river lengths, etc. Indeed, Mandelbrot’s [
18] classic book on fractals is a rich source of these. This equiprobability case is so simple to analyze that it has been invoked over and over again to illustrate how power law behavior can arise. However, the more complicated case using unequal letter probabilities (or proportions)
and a finite word length cutoff is what really underscores the close analogy between the monkey Zipf-lognormal distribution and other mixture models exhibiting hybrid power law tails and approximately lognormal central ranges.
Empirical distributions of this form have appeared with great frequency. In fact there is a long history of researchers, including Pareto, first discovering what appears to be an empirical power law over an entire distribution because they start out by looking at only the largest observations (cities, corporations, islands,
etc.). However, when they extend their measurements to the part of the distribution below the upper tail,
which is always more difficult to observe, the power law behavior typically breaks down. Several examples of this are chronicled in [
22]. The power law for city sizes noted above is a perfect illustration. Auerbach in 1913 [
26] looked at the 94 largest German cities from a 1910 census and showed a good power law fit, but because he did not include the thousands of smaller communities, he may not have been aware that this relationship breaks down. Modern studies of the
full distribution of communities, such as Eeckhout’s [
28] analysis of the populations of 25,359 places from U.S. Census 2000, prove with high confidence that the
bulk of the community populations fit an approximate lognormal distribution and that the power law behavior is confined to the upper tail.
Montroll and Shlesinger [
29,
30] explained how to generate a Pareto-lognormal hybrid for income distributions by using a hierarchical mixture model of lognormal distributions. This is motivated from several angles, including the notion that higher classes amplify their income by organizing in such a way as to benefit from the efforts of others. Reed and Hughes [
31] have provided a far-ranging framework for understanding these hybrid distributions across the entire spectrum of disciplines where they have been discovered: “physics, biology, geography, economics, insurance, lexicography, internet ecology,
etc.” In [
31] he and Hughes give a condensed summary showing that “if stochastic processes with exponential growth in expectation are killed (or observed) randomly, the distribution of the killed or observed state exhibits power law behavior in one or both tails”. This work encompasses: (1) geometric Brownian motion (GBM); (2) discrete multiplicative processes; (3) homogeneous birth and death processes; and (4) Galton–Watson branching processes. In one of many variations on this theme, in his GBM model [
32] Reed specifies a stochastic process that uses lognormal distributions varying continuously in time with an exponential mixing distribution and shows that this generates a “Double Pareto-Lognormal Distribution”. This has an asymptotic Pareto upper tail and a central part approximately lognormal; in addition, it exhibits interesting asymptotic behavior in the it lower tail where it is characterized by a direct (rather than an inverse) power law. Reed has gone to great effort to demonstrate the high quality of the fit of this distribution for all ranges of values (low, middle, high) across a wide variety of size distributions such as particle and oil field sizes [
32], incomes [
33], internet file sizes and the sizes of biological general [
31].
In today’s nomenclature, the term
Zipf’s law has come to mean any empirical distribution with an approximate power law form and an exponent close to the value
, not just the word frequency law. (To be clear here, when we refer to power law distributions, we also mean the hybrids with power law tails that we have been considering.) The subset of power laws with this restricted exponent value is surprisingly large and includes the distribution of firm sizes [
34], city sizes [
35], the famous Gutenberg–Richter law for earthquake magnitudes [
36] and many others.
Gabaix [
35] pointed out that while it has long been known that stochastic growth processes could generate power laws, “these studies stopped short of explaining why the Zipf exponent should be 1.” To address this question, he has given a theoretical explanation of the genesis of a
exponent for the Zipf law for city populations, but in fact, his approach can be regarded more generally. His key idea is a variation of Gibrat’s [
37]
Law of Proportion Effect: using a fixed number of normalized city population quantities that sum to 1 and assuming growth processes (expressed as percentages) randomly distributed as i.i.d. variables with a common mean and variance, he proves that a steady state will satisfy Zipf’s law. This proof requires an additional strong assumption in order to reach a steady state, namely, a lower bound for the (normalized) population size. Gabaix goes on to discuss relaxed versions of his model and how it fits into the larger context of similar work done by others.
In their monograph on Zipf’s law Saichev
et al. [
38] modify and extend the core Gabaix idea in numerous ways that render it more realistic. For concreteness, they present their work using the terminology of financial markets and corporate asset values, but they are very clear on how their models are relevant to a broad range of “physical, biological, sociological and other applications”. As with Reed, GBM plays an important role in their investigations, which incorporate birth and death processes, acquisitions and mergers, the subtleties of finite-size effects and other features. Notably, they focus on the sets of conditions that lead to an approximate
exponent.
In both Harrëmoes and Topsøe [
39] and Corominas-Murtra and Solé [
40]
emerges as a consequence of certain information theoretic ideas pertaining to the entropy function under growth assumptions. In [
39] the focus is on language and the expansion of vocabulary size while simultaneously maintaining finite entropy. The perspective in [
40] is broader, but still based on the idea of systems with an expanding number of possible states “characterized by special features on the behavior of the entropy”.
There is another topic area of statistical research that impacts on our work. Natural language word frequency distributions have been characterized as
Large Number of Rare Events (LNRE) distributions because when collections of text are examined, no matter how big, there are always a large number of words that occur very infrequently in the sample [
41]. LNRE behavior indicates a tip-of-the-iceberg phenomenon resulting from a sampling bias that captures the most common words, but necessarily misses a vast quantity of rarely used words. This is a classic and much studied problem encountered in species abundance surveys in ecological research, where a key question becomes estimating the size of the
zero abundance class—the typically great number of species not observed in a sample because of their rarity [
42].
To get an idea of the significance of this issue, consider that
Figure 3a graphs the parent distribution of word probabilities for the Zipf-lognormal distribution, and not the
sample frequencies as would be obtained from actually carrying out the Monkey Twitter experiment. Simple visual inspection of
Figure 3a indicates that the linear part of the graph (
i.e., the power law tail) holds for about 5 logarithmic decades or about the first 100,000 word probabilities out of the total of 475,255 probabilities plotted there. However, these 100,000 probabilities comprise 97.4% of the total probability mass of all 475,255 values. Intuitively, it should be clear that sampling from this Zipf-lognormal population distribution will be dominated by the Zipf part, not the lognormal part. Unless the sample size was astronomically large, so that the large number of low probability words showed up, the underlying structure of the parent distribution would not reveal itself. To carry this idea still further, imagine the situation if the word length cutoff was on the order of
characters, such as with real Twitter. No experiment could be run within any realistic time frame to ever hope to obtain a sample sufficiently large to uncover the true hybrid character of the population distribution—the sample would always appear as a Zipf law, not a Zipf-lognormal law. This kind of
visibility bias has been a constant and recurring theme in all areas where Pareto-Zipf type distributions have been studied [
2,
22]. We believe its significance has been poorly appreciated in relation to this topic.
Indeed, there are many challenging questions in connection with the sampling properties of variations of the monkey model and how they compare to natural language text. Ferrer-i-Concho and Gavaldà [
43] show that even using the simple Miller model, there is a need for careful analysis to properly evaluate the deviations between actual frequency histograms from finite samples and their theoretical expected values. The analysis in [
44] demonstrates that the monkey model of independently combined characters, even when the number of characters and their probabilities are populated with values corresponding to specific real language target text samples, yields sample distributions that depart in systematic ways from that of the target text. Bernhardsson
et al. [
45] and Yan and Minnhagen [
46] have highlighted how real language corpora have frequency distributions whose characteristics vary with the size of the sample and do not correspond to sampling properties of the monkey model. As just one example, in [
46] it is argued that an approximate power law form such as Zipf’s law cannot account for the fact that the scaling exponent typically increases with increasing sample sizes in real language. Their
Figure 2d shows the clear fluctuations in the exponent as a function of text length.
We conclude by emphasizing how the monkey monkey fits into a much larger conceptual scheme than appears at first glance. For example, by viewing the model as a branching tree—Mandelbrot’s lexicographic tree—the self-similar character of its construction becomes clear. In fact, the model can be construed as a fractal object with a fractal dimension of
, where
is the exponent studied in
Section 2 ([
18,
47]). Our application of random spacings is
also very much in the spirit of the enormously fruitful study of
random matrix theory and a class of stochastic systems referred to as the
KPZ universality class. Along these lines Borodin and Gorin [
48] have discussed a variety of probabilistic systems “that can be analyzed by essentially algebraic method”, yet are applicable to a broad array of topics. Miller’s model, analyzable with simple algebra, appears to us as an elementary example perhaps falling into this category.