1. Introduction
The study of natural languages is extremely important not only for the human and social sciences, but also for the sciences that study the development patterns of complex systems (synergetics, cybernetics, etc.). An important section of language science is quantitative linguistics, which uses mathematical methods to establish language laws (note that the objectives and methods of quantitative linguistics go beyond the mere study of linguistic laws, see, e.g., [
1,
2]). At present, several similar laws are considered, and among them, the most famous are Zipf’s law, Herdan’s law, Brevity law, and Menzerath–Altmann’s law [
3,
4,
5,
6,
7]. Such laws, found mostly by statistical methods, indicate existing regularities between various elements of language (phonemes, words, etc.).
The most important element of language is the sentence—the object of this study. According to the Cambridge dictionary, a sentence is a group of words, usually containing a verb, that expresses a thought in the form of a statement, question, instruction, or exclamation. Sentences have semantic completeness; they express a particular thought of a person and serve to communicate it with other people. Based on the above sentence qualities, the study of these structural units is essential for cognitive science, which is of great interest. A metaphor from atomic physics would be very appropriate to illustrate this, especially for representatives of the natural sciences. Many properties of an atom are estimated by radiation (spontaneous and stimulated) that an atom emits and/or absorbs. A person (human brain) also “emits” and perceives elementary flows of thought in the form of sentences and the characteristics of this “human radiation” can reveal a lot about both the person and their environment.
An important quantitative characteristic of a sentence is its length, which can be measured in various ways (the number of letters, words, etc.). The study of sentence lengths does not require special linguistic training and can be easily processed by computer. As a result, this value has been studied for a long time and is used to determine the authorship of a work, the genre of the text, the cognitive development of the author or reader (listener), the level of language proficiency, etc. [
8,
9,
10,
11,
12,
13,
14]. Two regularities are noticed regarding sentence length.
The first regularity is a decrease in the average sentence length over time. The decrease may vary depending on the genre and language of the text [
15,
16,
17,
18]. In particular, according to analysis of English texts [
15]: “fiction sentences are approximately (on average) 6.5 words shorter now than they were in the beginning of the nineteenth century”. The second regularity is the asymmetry of sentence length distribution in the text (their distribution functions are not normal). Various laws are proposed to describe sentence length distribution; log-normal is the most often, but it is also suggested to use others, in particular, gamma and hyperpascal distributions [
4,
6,
19,
20,
21,
22,
23,
24]. These regularities are associated with various factors; in particular, attempts are made to connect the log-normal distribution law with some stochastic multiplicative processes of sentence formation and the central limit theorem in logarithmic space [
22]. There is no single general explanation of the noted regularities to date.
At the same time, the so-called principle of least effort [
25] has existed for a long time in cognitive linguistics. According to this principle, language changes because speakers simplify their speech in various ways. This principle was suggested by G. Zipf. In 1949 he wrote: “a person, in solving his immediate problems, will view these against the background of his future problems, as estimated by himself. Moreover, he will strive to solve his problems in such a way as to minimize the total work that he must expend in solving both his immediate problems and his probable future problems. That in turn means that the person will strive to minimize the probable average rate of his work-expenditure (over time). And in so doing he will be minimizing his effort. Least effort, therefore, is a variant of least work.” [
25]. Note that G. Zipf is not the first to consider this kind of principle. In discussing the close connection between thinking and language, it is necessary to mention E. Mach and his principle of the economy of thought (1864): “when the human mind, with its limited powers, attempts to mirror in itself the rich life of the world, of which it itself is only a small part, and which it can never hope to exhaust, it has every reason for proceeding economically” [
26].
Starting with G. Zipf, the discussed principle of least effort is used to explain the different frequencies of words of various lengths, the origins of scaling in human language, etc. (see, e.g., [
27,
28]). However, even at the sentence level, this principle from a single position allows us to explain the two above-mentioned regularities. In fact, languages have evolved so that language users can communicate using sentences that are relatively easy to produce and comprehend. It is worth quoting a fragment from Ref. [
29] “Various models of human sentence production and comprehension predict that long dependencies are difficult or inefficient to process; minimizing dependency length thus enables effective communication without incurring processing difficulty”. Thus, with a long-term observation of the language, sentence length will decrease. Let us consider the application of this principle for a significantly smaller timescale—creation time of the text by the author. The author strives to express each of his thoughts in the most economical, shortest way. As a result, the author consciously and unconsciously tends to use sentences of the minimum length (
L), among the variety of those that are similar in content {
L1, L2, …, Ln}, i.e.,
L = min{
L1, L2, …, Ln}. It is well known from mathematical statistics [
30,
31] that the distribution of the minima of a random variable corresponds to the Weibull distribution (strictly, if
L = min{
L1, L2, …, Ln},
n→∞ and,
L1, L2, …, Ln being identically distributed random variables equal to zero or larger,
L will obey the Weibull distribution function [
30,
31]). Thus, the principle of least effort unambiguously indicates that sentence lengths, with a sufficiently large sample, should be described by the Weibull distribution, and not by any other distributions. It is interesting to note that the Weibull distribution is a two-parameter asymmetric distribution that generalizes the well-known one-parameter Rayleigh distribution and can be reduced to a gamma distribution by changing the variable.
The purpose of this work is to check the applicability of the Weibull distribution to the distribution of sentence lengths and to discover the law of the sentences length decrease over time. The results can provide additional justification for the applicability of the principle of least effort to elementary units of human speech that carry a particular thought.
2. Data for Analysis
The object of this research was to study the public speeches of politicians. Previously, this has beencarried out several times (see, e.g., [
32,
33,
34]). However, the objectives of those studies were different from the objective of this work (readability and sentiment analysis, letter frequency distribution, etc.). Political speeches are a convenient object of research, since this is a form of oral speech that is well-documented for sufficiently long times. Political speeches are positioned between spoken and written ways of expressing thoughts. Unlike spoken speech, the speech under consideration is more meaningful, prepared ahead of time, and less spontaneous, from the speaker’s point of view. At the same time, in comparison with written speech, political speeches are more focused on the listener, and, therefore, have a greater emotional component and the tendency to be easily understood. As a result, political speeches are extremely valuable research material. Such speeches are usually focused on some “average” citizen—the voter—therefore, the processing of such data reflects the temporal changes in the majority of native speakers.
We analyzed text transcripts of the 59 inaugural speeches of US presidents from 1789 to 2021 and 224 texts of speeches of UK Party leaders from 1895 to 2018 (available in [
35] and [
36], respectively). The studied speeches of US presidents are uniformly distributed every four years. The time distribution of speeches of UK Party leaders was not so uniform (due to copyright, the appearance of a new large party in parliament in 1977, etc.), but much more extensive. Note that no UK speeches were processed for 1898, 1914–1917, 1931, 1938–1940, 1944, 1952–1954, or1959. The list of analyzed speeches is presented in Appendixes.
Sentence length was calculated from period to period, the unit of measurement was the words between spaces (prior to analysis, we replace all question marks, exclamation marks, and ellipses with a period, and also remove all dots used when writing decimal numbers). Note that the selected unit of measure for sentence length is not exclusive. Words were selected as a unit of measure for sentence length primarily because of the simplicity and the great prevalence of this approach. It is necessary to note that according to [
37], sentence length is robust with respect to the selection of the unit of measurement. Thus, the choice of the word (and, e.g., not letters) will not lead to a change in the results of further analysis. The calculation was carried out automatically using a developed and tested computer program (see, example in
Figure 1).
Despite the fact that the studied speeches belonged to a long period of time, the total number of words in the speeches did not change reliably (
Figure 2). The average length of speech in words was 2331 ± 355 for the US and 5434 ± 774 for the UK.
Statistical analysis was performed using the well-known and widespread professional commercial product Statistica 12.0 (TIBCO Software). The data (year of the speech and values corresponding to the processed sentence lengths of speeches) are in open access [
38].
4. Analysis of the Sentence Length Distribution Law
To analyze the sentence length distribution law, a number of speeches of the US presidents were excluded from the initial data. First, small speech texts, containing less than 40 sentences were excluded (these are speech texts of 1789, 1793, 1797, 1813, 1829, 1833, 1849, 1865, 1869, 1905, and 1945). Second, since it is the texts of public oral speeches that are analyzed, the texts of 1953, 1961, 1973, and 1981 were excluded because these speeches were not spoken, but were only written. Third, speech texts of 1801, 1805, 1837, 1877, 1881, 1893, 1941, 1965, and 1969 were not processed, since these speech texts have a multimodal distribution (the reasons for this and the analysis of these distributions could be the subject of a separate work). Thus, the analysis of the distribution law was carried out at 31 inaugural speeches of US presidents. All 224 speeches of the UK party leaders were analyzed. However, 31 texts were excluded due to the low significance level (<0.05) of the results obtained in relation to all tested distribution laws. Single outliers were excluded from the datasets before data analysis.
Six distributions with no more than two parameters, such as log-normal, Weibull, folded normal, half normal (normal), generalized Pareto, Rayleigh were analyzed in order to find the best theoretical distribution that describes the studied empirical distributions. The ranking of these distributions by the quality of data description was carried out according to the Kolmogorov–Smirnov criterion: the larger the
p-level value, the better this distribution describes the empirical data and, accordingly, the higher its place in comparison with others.
Table 2 and
Table 3 show the number of times one of the six listed sentence length distributions was among the top three (see
Appendix A for more information).
Table 2 shows that, in 14 inaugural speeches of the US presidents, the Weibull distribution took first place in terms of significance, in another 14 it took second place, and in 3, it took third place. Thus, the Weibull distribution is the only distribution that adequately describes all speeches and takes the top three places. The average distribution significance level, where Weibull was in the first place, is 0.73, and for the second and third places, it is 0.5 and 0.3, respectively (see
Appendix A). The log-normal distribution, ranked in the top three for 19 speeches, describes the data somewhat worse than Weibull one. The average significance levels for the log-normal distribution are 0.67 for the first place (13 speeches), 0.37 for the second place (5 speeches) and 0.08 for the third place (1 speech). Similar results can be seen for UK speeches (see
Table 3 and
Appendix A).
Thus, according to the performed statistical analysis, the Weibull distribution is the most preferable for describing the studied speeches. The Weibull distribution (cumulative distribution function) has the form 1 − exp(− (x/λ)
k), where λ and k are the scale and shape parameters respectively. Examples of the experimental data description using the Weibull distribution are presented in
Figure 6. Note that the one-parameter Rayleigh distribution, ranked third in the description quality according to the analysis results, is a special case of the Weibull distribution, where the shape parameter is equal to two (see
Table 2 and
Table 3).
The behavior of the parameters of the Weibull distribution over time is shown in
Figure 7 and
Figure 8 (see
Appendix B for numerical details). It follows from the
Figure 7 that the scale parameter reliably decreases over time. Since this parameter is known to be directly proportional to the mean, median, and mode of the Weibull distribution, this once again confirms the above statement about the decrease in the average (and the most probable) sentence length. The shape parameter for US speeches does not reliably change over time and is equal to 1.9 ± 0.1. At the same time, the shape parameter for UK speeches is slightly increasing, changing from 1.5 to 1.8 over the past 100 years. This is the only difference found when comparing sentence lengths for US and UK speeches. Since the change over time is not large (the slope of the line is 0.002 ± 0.001), for reliability, an analysis of this result using additional data is required.
Table 4 summarizes the results on the behavior of the parameters of the Weibull distribution over time.
Thus, the time behavior of the parameters of the Weibull distribution allows us to conclude that, over the past two hundred years, sentence length distribution has become less fuzzy, the width of the peak decreases, and its abscissa is slightly shifted to the left. This is demonstrated in
Figure 9. As a result, over time, speeches become composed of similar in length and shorter sentences, the difference in length decreases. In terms of sentence lengths, the text becomes more ordered. This can be seen in
Figure 10, where information entropy (Shannon entropy) is presented as a function of time. The calculation of this value was based on sentence length distribution histograms containing the probabilities of detecting sentence length in a speech at a certain length interval.
5. Conclusions
Based on the calculation of sentence lengths in the text transcripts of the inaugural speeches of the US presidents for 228 years and the annual speeches of the UK party leaders for 123 years, two main results were obtained:
1. The average sentence length for both US and UK speeches decreases linearly with time with the slope of 0.13 ± 0.03 words/year and, on average, from 1900 to 2000, sentence length decreased with time from 30 to 16 words.
2. Sentence length distribution for both US and UK speeches is better described by the Weibull distribution (in particular, in comparison with the log-normal). The scale parameter of this distribution reliably decreases over time from 35 to 15. The shape parameter for US speeches does not change over time and is equal to 1.9 ± 0.1, and the shape parameter for UК speeches slightly changes over time from 1.5 to 1.8.
These two results are in agreement with the principle of least effort: the speaker, attempting to minimize both their efforts and the listeners’ effort, tends to choose the shortest possible sentence length from a potential set of sentences of approximately the same content. As a result, on the one hand, sentence length distribution begins to correspond to the distribution of minimum values—the Weibull distribution, and on the other hand, at time intervals significantly longer than the speech preparation time, the average sentence length decreases. The detected change over time in the scale parameter of the Weibull distribution and in information entropy indicates that sentence length in public speeches is gradually becoming less diverse; it is being unified and standardized.
Here we highlight the following idea. When establishing the distribution type for empirical data, most important are not statistical tests, but rather the theoretical justification. If we accept the principle of least effort, then the Weibull distribution clearly follows from it. If one assumes that the principle of least effort is not suitable here, then obviously, they must propose some other theoretical justification—their principle—and theoretically derive, for example, gamma or lognormal distributions from it. Currently, we do not see such attempts. The G. Zipf’s principle, in our opinion, is very profound and productive, and many interesting consequences can be obtained from it. It has great potential, which has not yet been fully embraced by modern linguists. Our work and a number of works (see, e.g., [
27,
28,
39,
40,
41,
42,
43]) show how useful it can be.
An interesting continuation of this work can be the verification of the obtained results using breath groups [
6]. Detecting correlations and differences in such a collaborative analysis of breath groups (largely related to human physiology) and sentence lengths (largely related to cognitive processes) is a very interesting task. One of the problems in this direction will be a significantly smaller statistical database for breath groups in comparison with an almost limitless database for sentence lengths. Another interesting development of this work, in the scope of currently well-established directions connected to language complex networks (see, e.g., [
39,
40,
41,
42,
43]), could be an analysis of the data obtained, here, from the position of the principle of compression, which appeared as a development of the ideas of G. Zipf [
43]. It seems to us that the results of this work, combined with the principle of compression and with the use of Kolmogorov complexity ideas (existing inalgorithmic information theory) could be very promising. Found patterns for English also require validation for other languages, including artificial, as well as using other methods and linguistic units (letters, initial characters, words, etc.). In this regard, works [
44,
45,
46] may be useful.
In conclusion, we return to the metaphor from physics given in the introduction. The analysis of the atom emission spectra and the Planck formula for wavelength distribution revolutionized the understanding of atomic properties, leading to the formulation of the laws of the quantum world—quantum mechanics. The distribution law of the lengths of utterances (sentences) “emitted” by the brain, corresponding to the Weibull distribution, is also able to stimulate the development of brain sciences. One of the possible directions related to brain biophysics may be the study of the energetic basis of the origin and development of thought and language. There are a number of works in this direction, in particular [
4,
6,
47,
48,
49,
50,
51,
52,
53]. Considering thought as a complex non-equilibrium process, it can be concluded that its development matches the well-known principle of maximum entropy production. According to this principle, causes (stimuli) generate such responses that maximize the thermodynamic entropy production [
47,
54,
55]. One of these responses in the course of the evolution of human thinking was the origin of the language. This revolutionary bifurcation process led to an abrupt increase in energy consumption and, as a result, an increase in the entropy production in a nonequilibrium system, i.e., in neural networks of humans who have become users of language. Naturally, a spontaneously emerged structure (network) could not be optimal at inception: only a certain basic structure (framework) of language was formed, which had some imperfections. Subsequently, being already at the high level of energy consumption and entropy production achieved after bifurcation, the nonequilibrium system began to evolve for a rather long time, trying to minimize energy consumption [
54,
55]. This minimization will no longer return the system to its previous, pre-bifurcation values of entropy production; however, due to the optimization of the neural network processes responsible for language, a small decrease is possible. According to nonequilibrium thermodynamics, this optimization process is already progressing in accordance with the Prigogine minimum production principle [
47,
54,
55]. Its linguistic analogue can be considered the principle of least effort (least effort assumes less energy spent on communication, and, consequently, less energy dissipation). The information on the “simplification” of language—a decrease in its entropy, discovered in this work, can be considered a confirmation that language is currently going through a second (minimizing) stage of development.