1. Introduction
The words produced in conversational speech often differ substantially from the acoustic signals supposed by canonical dictionary forms [
1,
2]. The extent to which articulated forms deviate from dictionary models correlates with average word frequency, such that there is a general tendency for shorter and faster articulation in more probable words. This property of speech codes is often taken to suggest that human speech is shaped by the competing requirements of maximizing the success of message transmission while minimizing production effort in ways similar to those described by information coding solutions for electronic message transmission. There are, however, some critical differences between speech and the communication model described by information theory [
3]: whereas information theory is concerned with defining the properties of variable length codes optimized for efficient communication in discrete, memoryless systems, human communication codes, at first blush at least, appear neither systematic [
3] nor systematically discrete [
2,
4] or memoryless [
5].
In regard to the first point, systematicity, humans learn to communicate by the gradual discrimination of functional (task-relevant) speech dimensions from the samples to which they are exposed, yet because lexical diversity in language samples increases nonlinearly over space and time, the divergence between the samples individuals are exposed to increases as their experience of the linguistic environment grows [
5]. A system defined by a probabilistic structure would appear to require that events be distributed in a way that allows the relationships between events probabilities to remain stable independent of the sample size, yet the way that words are distributed across language samples suggests that human languages do not satisfy this requirement.
Considering the second point discreteness, although writing conventions lead to some systematic agreements about what linguistic units are such that words are often thought of as standard discrete linguistic units, speech appears to be different. Human intuition on boundaries in speech diverge as exposure increases. When literate adults, nonliterate adults, and children are asked to divide a speech sequence into units, their intuitions on where any given sequence should be split into multiple units exhibit a systematic lack of agreement [
6]; similar effects have been observed when people are asked to discriminate phonetic contrasts [
7].
As for memorylessness, which supposes a distribution of events such that an event’s probability is independent of the way it is sampled, it has been shown that increased exposure to language leads to a decrease in the informativeness of high-frequency tokens relative to words that they co-occur with such that the informativity relationships between words appear to be unstable across cohorts [
5]. For instance, the information that blue provides changes systematically as people successively hear about blue skies, blue eyes, and blue berries, etc. at different rates, an effect that increases nonlinearly with the number of blue covariates that speakers encounter.
To summarize these points, it is clear that adult expectations about events and their probabilities vary with experience. This in turn seems to suggest that the increasing divergence between individual speakers’ models will lead to an increase in communication problems between speakers. Nevertheless, sufficiently successful communication between speakers of different experience levels is not only possible but also relatively common. How?
Recent work by Ramscar [
3] addresses these apparent communication problems from the perspective of discriminative learning and suggests that, unlike the predefined source codes in artificial communication, human communicative codes are subcategorized by systematic patterns of variation in the way words and arguments are employed. The empirical distributions discriminated by these patterns of variation both serve to minimize communicative uncertainty and to make unattested word forms predictable in context, thereby overcoming some of the problems that arise from the way that linguistic codes are sampled. In support of this argument, Ramscar presents evidence that the empirical distributions shaped by communicative contexts are geometric and suggests that the power laws that commonly characterize word token distributions are not in themselves functional but rather result from the aggregation of multiple functionally distinct communicative distributions [
8]. Importantly, unlike power laws, the geometric distribution is sampling invariant and thus directly satisfies many of the constraints defined by information theory [
9,
10]. Perhaps even more importantly, geometric distributions also appear to maximize the likelihood that, independent of exposure, learning will lead to speakers acquiring similar models of distribution of communicative contrasts in context, thereby enabling a high degree of mutual predictability and helping to explain why human communicative codes actually work as well as they do.
A notable finding in this regard comes from an analysis of names (a universal feature of communicative codes that is almost universally ignored by grammatical theories) and in particular the distributions of English and Sinosphere first names [
3,
11]. This analysis shows that, historically, prior to the imposition of name laws that distorted their initial distributions, first names across a range of cultures had near identical geometric distributions. Names are a unique aspect of language in that their composition is highly regulated in virtually all modern nation states. Functionally, name sequences serve to discriminate between individuals, and thus, it follows that fixing distributions of name tokens by law in turn fixes the discriminatory power of those distributions. The 20th century is characterized by large global increases in population sizes; that is, the number of individuals that name distributions must serve to discriminate between has increased. In western cultures, this has had two consequences: first, the fixing of last names has caused the increase in information in the distribution of first names in direct proportion to increases in population [
3,
11]. Second, it has led to an increase in the diversity of regional first name distributions across very large states such as the United States. An interesting consequence of this is that, although the first name distribution in the US as a whole follows a power law, the distribution of names in the individual states still show close fits to the geometric, indicating that the shape of the former distribution may reflect the result of the aggregation of the latter [
3,
8].
These results suggest that, across space and time, discriminative codes somehow seem to respond to the various communicative pressures imposed by the environment in ways that sustain the sampling invariance that seems to be crucial to efficient, systematic communication, a point that name distributions in particular seem to underline in that individual contributions to the name pool appear, at least at first blush, to be somewhat random. These findings offer a new perspective on the apparent similarities and differences between communication in the human and information theoretical sense and raise some interesting questions in regard to speech. To what extent are speech codes shaped by the competing pressures of providing sufficient contrast to communicate the required distinctions while retaining a sufficiently stable structure to allow mutual predictability over the course of learning? Is the variance in the forms people actually articulate a consequence of the uncertainty in the structure of the immediate context in which they are learned and used, and does this variance have a communicative function?
The following sections briefly review the theoretical background to the present analysis.
Section 1.1 reviews some key findings about linguistic distributions that appear to support their communicative function.
Section 1.2 describes some of the implications of these findings for speech, and finally,
Section 1.3 lays out a set of explicit predictions derived from this theoretical analysis. These are then examined in the rest of the article.
1.1. Grammar as Context—Convention Shapes Learning, Learning Shapes Context
It seems clear that human communication codes are not shared in the predefined way that information theory supposes [
3]. Natural languages are learned from exposure to specific, incomplete samples, and these can diverge considerably across cohorts. This in turn suggests that any communicative system operating on global word token probabilities will be inefficient and unsystematic because the bursty/uneven distributions of low-frequency tokens observable in large language samples indicate that a large portion of types will be either over- or underrepresented across the communicative contexts any individual speaker is exposed to. At the same time, the fact that regularities in human languages can be consistently captured and shared through linguistic abstractions at many different levels of description suggests that speech provides speakers (and learners) with probabilistic structures that are sufficiently stable to ensure that most important linguistic conventions will be learnable from samples all speakers are exposed to. For example, Blevins et al. [
12] suggest that the existence of grammatical regularities in the distribution of inflectional forms serves to offset many of the problems that arise from the highly skewed distribution of communicative codes, since the neighborhood support provided by morphological distributions makes forms that are otherwise unlikely to be attested to many speakers inferable from a partial sample of a code.
The fact that pseudowords can be interpreted in context [
13] (for example,
He drank the in one gulp.) offers another illustration of this point. Here, the lexical context provides sufficient support for the inference that
dord is likely a drink of some sort, regardless of whether it is familiar to the speaker or correlated to a real life experience. (In the former case, if
dord were to occur more regularly and in correlation to an actual bottled or cupped substance in the world, it would become a part of the vocabulary, losing its non-word status.) These kinds of context effects appear to rely on the fact that, in the sequences
drink milk,
drink water, and
drink beer,
drink systematically correlates with words that in turn covary with the consumption of fluids, unlike
eat apple,
eat banana, and
eat chicken.
Given the discriminative nature of learning, it follows that exposure to samples containing this kind of systematic covariance structure will lead to the extraction of clusters (subcategories) of items that are less discriminated from any other items that occur in the same covarying contexts than to unrelated items [
3]. Further, there is an abundance of evidence that patterns of systematic covariance of this kind provide a great deal of information, not only at lexical level (where semantically similar words typically share covariance patterns) but also at a grammatical level [
3]. For example, in English, different subcategories of verbs can be discriminated from the extent to which they share argument structures with other verbs. The way that verbs co-occur with their arguments appears to provide a level of systematic covariance that nouns appear to lack [
14]. For instance, the following sentences would be considered grammatical:
- 1.
John murdered Mary’s husband.
- 2.
John ate Mary’s husband.
- 3.
John chewed Mary’s husband.
However, the following sentence would not be considered grammatical:
- 4.
John ran Mary’s husband. (*)
One reason for this difference is that chew, eat, and murder share a similar pattern of argument structures (covary systematically) in a way that run does not. In contrast, the kinds of grammatical context which predicts a noun (noun phrases) appears to allow any noun—the sentence is grammatical—irrespective of its likelihood (although, obviously, these will vary widely according to context).
- 5.
John ate.
- 6.
John ate cheese.
- 7.
John ate cheese slowly with a toothbrush.
In other words, the systematic covariance of verbs in their argument structures appears to constrain their distribution in context far more than is the case for nouns.
- 8.
Mary loved. (*)
- 9.
Mary loved cheese.
- 10.
Mary loved cheese slowly with a toothbrush. (*)
Accordingly, the distributional patterning of verbs thus appears to reduce uncertainty about not only the lexical properties of upcoming parts of a message but also its structure. In other words, because verbs take arguments, there ought to be less variance in their patterns of covariation and this ought to lead to less overall uncertainty in the context of verb arguments. Consistent with this, Seifart et al. [
15] report that slower articulations and more disfluencies precede nouns than verbs across languages, raising further questions about the kind of information that is communicated by variational patterns in speech and, in particular, whether and to what degree, this kind of sublexical variance actually serves a communicative function.
In the next section, we review some evidence that suggests the interactions observed between uncertainty and articulatory variation may indeed be functional.
1.2. Sublexical Variation in Context
It is well established that isolated word snippets extracted from connected speech tend to be suprisingly unintelligible outside of their context. By contrast, when reduced variants are presented to speakers in context, they are able to identify the word without difficulty and to report hearing the full form [
16]. Consistent with this, the effect of frequency on speech production has been shown inconsistent across registers, speakers, lexical classes, and utterance positions and there are opaque interactions between context, lexical class, and frequency range.
At first blush, these inconsistencies would appear to limit the scope of functional accounts of speech sound variance, and to date, the effects that are stable enough to be taken as evidence for functional theories are mostly to be found in preselected content words from the mid-frequency range, such that the effects reported rarely align with effects observed in the remaining (by token count, significantly larger) parts of the distribution.
For example, while function words, high frequency discourse markers, and words at utterance boundaries account for the largest portion of variance in speech, their exclusion from the analysis of speech sound variance is such a common practice that it might be considered a de facto standard [
17]. Against this background, it is noteworthy that Bell et al. [
18] report a divergence in the extent to which the articulation of function and content words across frequency ranges is affected by both frequency and the conditional probability of the collocates. While duration in content words is well predicted by the information provided by the following word but not the preceding word, the effect decreases as the frequency increases and shows a reverse pattern in function words. Similarly, van Son and Pols [
19] report a reversal in the correlation between reduction and segmental information in low-information segments and segments at utterance boundaries. The effect of information content is reported to be limited by a
hard floor in high-frequency segments; that is, most frequent segments fail to support the hypothesis. Standardizing the exclusion of misfits is controversial, especially given that they outnumber the tokens which are typically taken to confirm a hypothesis and account for the largest part of variance in speech [
20,
21].
The seemingly random and noisy variance in the speech signal appears systematically correlated with uncertainty about the upcoming part of the message. As an example, vowel duration in low- and mid-frequency content words is correlated to the information provided by the upcoming word [
18]. Words in less predictable grammatical contexts are on average longer and more disfluent [
22]. These fluctuations in duration and sequence structure have been shown to inform listeners’ responses. For instance, the duration of common segments in word stems differ between singular and plural forms [
23]. Speakers appear to use acoustic differences in word stem as a cue to grammatical context (plural suffix), and incongruence between segmental and durational cues lead to delayed responses in both grammatical number and lexical decision tasks [
24]. Similar effects occur at many other levels of description; for example, disfluent instructions (
the ... uhm ... camel) lead to more fixations to objects not predicted by the discourse context [
25] and facilitate prediction of unfamiliar objects [
26].
The occurrence of silent and filled pauses has been shown to contribute to the perception of fluency [
27] and intelligibility [
28] as well as improved recall [
29]. Importantly however, neither artificially slowed-down speech samples nor samples modified by insertion of pauses are then perceived to be more fluent or intelligible, and indeed, in both cases, these manipulations have been shown to result in impaired performance [
30]. Accordingly, the fact that listeners easily interpret reduced sequences from context and reject speech artificially altered to mimic completeness and fluency indicates that hearers are highly sensitive to violations of their expectations about how natural speech should sound and not that they have a preference for completeness and slow and extreme articulation. However, despite the evidence that sublexical variation shapes speaker expectations about the upcoming content, its contribution to successful communication as an informative part of the signal has remained relatively unexplored to date.
However, it is clear that any quantification of the communicative contributions of sublexical variations in context will depend on a consistent definition of context. That is, in order to address the extent to which the quality of articulation and the observed variance in the signal interact with the remaining uncertainty about the message in general terms, it is necessary to first formalize a consistent subset of higher-level abstractions that systematically covary in the degree to which they contribute to uncertainty reduction. The contrast between these subsets can then allow these effects to be analyzed independent of the specific context of any given utterance.
1.3. The Present Study
In comparison to written language, speech often appears to be messy. Instead of the well-formed word sequences that characterize text, spontaneous speech sequences are typically interrupted by silent and filled pauses, left unfinished, depart from word-order conventions, frequently miss word segments or whole words, and rely on clarifying feedback which tends to be short and grammatically incomplete. In consequence, the token distributions that underlie the information structure of written and spoken language differ substantially.
For instance, nouns are less lexically diverse in spoken English then in writing (based on measures derived from the Corpus of Contemporary American English (COCA)), whereas English adjectives tend to be more lexically diverse in speech. While reading and writing are self-paced, speech gives both speakers and hearers less control over timing. This suggests that the moment-to-moment uncertainty experienced in communication may differ in speech as compared to written language, and it may be that more effort is invested in uncertainty reduction in spoken than in written language. From this perspective, the increase in the lexical variety in prenominal adjectives, which in English reduce uncertainty about upcoming nouns [
31], might be functional in that it may help manage the extra uncertainty in spoken communication. This raises the question of the degree to which these and other variational changes in spoken English are indeed informative and systematic.
These considerations also suggest that the results of previous analyses of the distributional structure of lexical variety in communicative contexts conducted on text corpora can only offer indirect support when it comes to answering questions about the communicative properties of speech. To address this shortcoming, we conducted a corpus analysis of conversational English [
32] to explore the extent to which the distribution and the underlying structure of the grammatical context in which words are embedded interacts with speech signal variation observed across lexical categories. The goal of this analysis was to explore the structural properties of grammatical regularities in speech and their effect on the distributions of the lexical and sublexical contrasts that they discriminate between.
The analysis was conducted in two stages. Part one, presented in
Section 3, addresses the distribution of grammatical and lexical contrast in speech and aims to answer the following questions:
Are distributions of grammatical regularities in speech sampling invariant?
How do recurrence patterns of grammatical categories and speech sequences inform learning?
Are distributions of subcategorization frames and types they distinguish between geometric?
Part two of our analysis, presented in
Section 4, assesses the concrete consequences of the sublexical variation observed in the speech signal and relates these to the results presented in
Section 3, addressing the following questions:
Are the inconsistent effects of frequency on speech sound variation across categories correlated with structural and distributional aspects of the grammatical and lexical contexts they populate?
Finally and perhaps most importantly, is the resulting sublexical variance systematic?
4. From Information Structure to Speech
The results we have described so far suggest that the structure of speech serves to facilitate efficient message transmission over multiple nested levels of description. The distribution of lexical and grammatical contrasts indicates that information structure
depth increases over message sequences, supporting gradual increases in the degree to which low-level sublexical contrasts contribute to resolving uncertainty about a message. Consistent with this, it has been shown experimentally that speech rates are perceived as being faster and target words as being longer when cognitive load is increased [
47], a response pattern that suggests that speakers adapt their response to the relative uncertainty resulting from utterance context.
The notion that the timescale variance captured in speaker and listener performances reflects adaptation to uncertainty is further supported by evidence showing that the sublexical variation in speech sequences increases with sequence length, a phenomenon characterized by the strengthening of word initial consonants and the lengthening of final vowels. While both effects increase cumulatively as a function of utterance length [
48], the interaction between lengthening and strengthening is weak, indicating that hyperarticulation and vowel space expansion are not equally affected by context. Moreover, while low-probability and word initial segments are more likely to be stressed and while segment deletion is more likely in high-frequency phonemes and in latter positions, the frequency effects actually observed in very frequent segments depart from this pattern. Also, the correlation between duration and extreme articulation, and duration and frequency declines as a function of utterance position [
49].
The analyses presented in
Section 3.4 indicate that average grammatical uncertainty peaks in words that are more likely to occur in utterance initial positions and that average lexical uncertainty peaks in categories which are more likely at utterance final positions. It has also been shown that slow-downs in articulation are associated with uncertainty and that uncertainty leads to an increase in articulatory variance. These effects have been observed both within [
23] and across word boundaries [
18] as a consequence of syntactic irregularities [
22] and appear functional in lexical decision [
24] and discourse [
26]. Since our analyses show substantial differences across parts of speech in both the extent to which words are predicted by the previous context and the extent to which they serve to predict the upcoming part of the message across the frequency range, this seems to imply that the apparently inconsistent effects of frequency that have been previously observed are both predictable and systematic with respect to the structure of the grammatical context.
This in turn can be taken to suggest that sublexical variance follows as a consequence of an increase in lexical and grammatical variety in which words are embedded and that the variants we observe aim to increase the efficiency in transmission of informative contrast at multiple levels of description. In the next section, we conduct a statistical analysis of the effects of variation in the collocations of words on the number of distinct forms found in the speech corpus.
4.1. Effects of Frequency and Collocate Diversity on Variation
4.1.1. The Distinct Effects of Collocate Diversity And Frequency
Wedel and colleagues have shown that the number of competing minimal pairs in lexical context predict likelihood of vowel merger [
50] and voice onset time duration [
17], suggesting that what drives speech contrast loss is the extent to which minimal pair competition is resolved in context. In line with this, Piantadosi et al. [
51] observe that the relative probability of a word in a lexical context (defined as word sequences ranging between 2–4 words) is a far better predictor of word length than word frequency.
This raises questions: Does this hold for variance too? Is the diversity of collocate contexts across which a word appears a better predictor of the extent to which a type will vary across a speech sample than frequency?
The probability of a known word appearing in a previously unattested context increases with the average word count so that word frequency and collocate diversity are strongly correlated (
,
). High-frequency words are more likely to be preceded by a larger number of different words and thus tend to appear across a larger number of communicative contexts that vary in size. By implication, there is more variance in the conditional probability between high-frequency words and their collocates. In contrast, words from the mid-frequency range will appear in a smaller number of distinct communicative contexts, leading to less variance in the conditional probability between mid-frequency words and their collocates. In line with this, an analysis by Arnon and Priva [
52] shows that, in contrast to results reported by Bell et al. [
18], duration in content words is affected by both word and multiword frequency as well as the transitional probability of both the preceding and following collocates when high- and low-frequency trigrams; sequences interrupted by pauses and word final sequences are excluded from the analysis. Finally, the increase in lexical diversity over utterance length (
Section 3.3) suggests that low-frequency words tend to appear in a larger number of distinct message contexts, again leading to more variance in the conditional probabilities of low-frequency words at different positions within the sequence with respect to the likelihood of the message.
The discriminative nature of learning predicts that this variance will increase within-context competition over exposure time and that this will minimize the informativeness of contextual cues which predict a large number of lexical contrasts. This in turn predicts more sublexical variation in words that serves as cues to a larger number of collocates, reflecting the uncertainty of the relative context. Taken together, these factors predict distinct patterns of variance across frequency ranges.
4.1.2. Results
To explore the nonlinear effects of frequency and collocate diversity on observed variance, we fitted generalized additive mixed models (GAMM) [
35] using the
mgcv package for R. In baseline model 1, we model the normalized number of observed corpus variants as a function of the smooth over log frequency. In baseline model 2, we model the number of variants as a function of a smooth over collocate diversity, the log normalized number of preceding words we observe in the corpus.
Model 1 counts show a strong, nonlinear effect of frequency (). It yields an of 0.435 and explains % of the deviance in the data (, ). Model 2 shows a strong, nonlinear effect of diversity of collocates in the preceding position (), explaining 74.6% of the variance in the data (, , ).
We assessed the goodness of fit of both models by the Aikake Information Criterion (AIC). Model 2 improved the score by . To contrast the contribution of both predictors, we modeled word variance as a function of smooth over log normalized word frequency and log normalized number of variants observed in the corpus in a combined model 3. Model 3 (, ) reduced the AIC by 228. Both predictors are highly significant ().
Interestingly, the plots show that the frequency effects predicted by the baseline model 1 and the combined model diverge substantially across frequency ranges (see
Figure 7a,c), suggesting that the effect of frequency is largely overestimated in the low-frequency and mid-frequency ranges by the baseline model. It further appears that a large part of the frequency effect is confounded by the correlation between word frequency and the number of collocate contexts a word appears within. There remains, however, a strong effect of frequency observable in high-frequency words. The high-frequency part of the data behind the effect comprises 82 function words, 57 nouns, and 47 verbs, representing 69%, 1%, and 2% of unique types, respectively.
Word frequency appears to influence the extent to which a word varies in form only in high-frequency words and thus holds for type variation across lexical categories to the degree with which the category is represented in the high-frequency tail of the word distribution. We further observe a stronger correlation between collocate diversity and word frequency in function words (, ), than in verbs (, ) and nouns (, ).
Finally, we fitted a set of combined models, adding in the log number of distinct parts of speech following each word for all words (model 4) and adding in lexical category as a covariate factor (model 5). In model 4, we observe a fairly weak effect of frequency (
(see
Figure 8a), while the effects of the context predictors were strong. The AIC score is reduced by 531.
In model 5, the introduction of lexical category as a covariate further reduces the AIC score by 254 points and explains () of deviance observed in the data. The effect of frequency is not significant in verbs () and function words () and is statistically significant but weak in nouns (). Again, all contextual predictors are highly significant in function words, nouns, and verbs. The same pattern was observed for all of the other analyzed categories apart from the following exceptions: filled pauses and numbers show an effect of frequency (); contractions are unaffected by the collocate diversity (); and there is no interaction between modal verbs and the upcoming collocate context (). Modals, numbers, and contractions comprise of the analyzed data set. We observe differences in the effect of preceding collocate diversity between verbs and nouns in that the effect and the confidence interval both increase linearly in nouns while the effect levels off in high-frequency verbs, showing an increase in variance.
A closer examination of the data reveals that the relationship between word frequency and collocate diversity differ significantly across frequency ranges for verbs and nouns. Collocate diversity is much higher in high-frequency verbs and function words than it is in high-frequency nouns. Also, there is far more variance in the effect in high-frequency nouns.
4.1.3. Discussion
The results of this analysis align with the finding that word counts outside of their communicative context contribute little when it comes to explaining variation in articulated forms. Rather, we observe that the largest part of this variance is explained by the diversity of the lexical contexts in which words appear. The remaining effects of frequency are limited to a relatively small number of high-frequency nouns and words from closed categories (numbers, contractions, and filled pauses).
These results are thus consistent with the differences we find in the distributional patterns of lexical categories in that, unlike high-frequency nouns, it would seem that high-frequency verbs are far less likely to be encountered outside of their argument frames (supporting the idea that verbs are encountered as arguments rather than lexical items per se).
Given that our results show that the variance in the observed forms is largely explained by the covariance in the collocate structure and that patterns of covariance are systematic, this finally leads us to the question of the systematicity in the sublexical variance: Is the distribution of the observed contrast geometric?
4.2. Distribution of Word Initial Contrast
4.2.1. Why Word Initial Contrast?
Previous work on sublexical variation shows that the structure of speech sound sequences is such that the probability of speech segments at segment transitions is not independent [
49]. Gating paradigm studies have shown that the informativeness of word medial contrast is mediated by the extent to which both the preceding sentence context and word initial phonetic contrasts have minimized uncertainty about the word [
53,
54]. Accordingly, the entropy in sublexical contrast peaks at word initial boundaries [
49]. This suggests that word initial speech contrasts may serve a distinct communicative function in context.
An initial analysis of word initial phonetic label distributions over both observed and citation forms in the corpus revealed poor fits to both power law and exponential distributions, suggesting that the aggregated distribution of the phonetic labels observed in our corpus may result from mixing the underlying communicative distributions. To examine this, we used parts-of-speech classes to provide a simple, objective method for contextually disaggregating individual communicative distributions from the mixed distribution of phonetic labels in our corpus.
4.2.2. Results
The frequency distributions of word initial phone labels were analyzed by parts-of-speech category considering the observed forms as the empirical distribution and the citation form as its model counterpart. Overall, both empirical and model distributions of phone labels show a better fit to geometric than to power-law distribution (
Figure 9). However, while the fits to geometric in the model distribution show a larger departure from linearity and large differences in slope between different parts of speech, the observed phones across categories converge on nearly identical distributions with close fits to geometric (
Table 1).
In both function and content words, the empirical distributions significantly improve the fit to a geometric. Importantly, despite substantial differences in the type/token ratio of the lexical classes analyzed, all of the categories have nearly identical empirical distributions with minimal differences in slopes. The exception is plural proper nouns where the data is extremely sparse (this category comprises a mere 50 tokens). Further, while we find that initial phones from several small categories (particles, modals, and filled pauses) have poor fits to either a geometric or a power law, in a similar vein, it is debatable whether these small sets of items constitute separate categories in terms of the covariate structures they populate.
Finally, we extracted time bins of initial phone duration centered by phone category to simulate an artificial set of discrete contrasts such that the simulation assumes a low-level subcategorization of phonetic contrast by duration. Again, across the parts-of-speech categories, the cumulative probability distributions of time bins show close fits to the geometric () and poor fits to power law ().
4.2.3. Discussion
Our analysis of word initial phonetic labels across different parts-of-speech categories confirms that they are geometrically distributed. The distribution of duration time bins is also geometric. These results thus suggest that what might appear to be random variance in the production of speech sounds may actually reflect a highly systematic distribution of sublexical contrasts.
While word initial variance is observable in all part of speech categories, we find that the extent to which tokens vary is closely correlated to uncertainty that is modulated by the underlying structure of the category. Importantly, despite large differences in the extent to which initial tokens deviate from the citation form, the probability distributions of tokens arising from this variance converge on nearly identical distributional properties across parts of speech.
Finally, we observe that the distribution of word initial phones assumed by the dictionary models show poor fits to geometric and power law, illustrating that, unlike the aggregate lexical contrasts, mixtures over closed sets of items similar in structure do not result in power laws. Instead, the distributions we observe are characterized by a fast growth in the mid-frequency range.
5. General Discussion
We analyzed distributions of the grammatical, lexical, and sublexical varieties in spontaneous conversational speech produced by 40 speakers of American English [
32] to assess the effects of the statistical structure of speech on the sublexical variance observed in the signal. Our results show that distributions of regularities in co-occurrence patterns, the lexical contrasts they discriminate between, and the sublexical variety observed in the articulated forms result in distributions which are consistent with previous, similar analyses of written English that satisfy many of the communicative constraints described by information theory [
3].
Accordingly, these results also provide further evidence that power law distributions seen in aggregate word frequency distributions are product of mixing functionally relevant distributions that are in themselves geometric [
3,
8].
The distributions in the analyzed sample suggest that, unlike the codes in artificial communication systems, human speech is a highly structured system of nested communicative distributions shaped by learning. In line with the predictions of learning theory, this suggests that speech variation at positions of high uncertainty is driven by the interaction of regular structures at multiple levels of description and that this variance serves to increases the efficiency of communication by increasing the amount of contrast in signals.
Taken together, our results indicate that the variance in the pronounced forms systematically structures the uncertainty discriminated by communicative contexts, supporting the suggestion that empirical distributions of phonetic contrasts in speech are components of a larger, highly structured communication system.