Intelligibility describes how well a listener can understand a speaker's intended message. There are many contributors to intelligibility that have been described in the literature (see
Hustad & Borrie, 2021;
Kent, 1993). However, in the simplest sense, there are three clear sources of variation in intelligibility: speakers, listeners, and words (message). Focusing on children, our group and others have investigated how speaker-level characteristics affect intelligibility with age (
Hustad et al., 2021) and in different clinical populations (
Mahr et al., 2020;
Wild et al., 2018). In hearing research, speech-in-noise studies and research on the cocktail party effect have examined intelligibility in different listening or masking conditions (
Bronkhorst, 2000;
Van Engen et al., 2014). The purpose of this study was to examine the third component: How do word-level or lexical characteristics affect intelligibility?
It is a cornerstone of speech perception research that similar-sounding words compete with one another during spoken word recognition. Different kinds of word similarities have been investigated over the decades (
Vitevitch & Luce, 2016).
Marslen-Wilson and Welsh (1978) postulated that the first 100–200 ms of a word activates a
cohort (words that share the same initial one to two phonemes such as
whistle:
wizard), and the cohort is whittled down as more information arrives. The TRACE model (
McClelland & Elman, 1986) allowed any words that shared overlapping phonemes to compete for activation. Notably, the model predicted activation of cohorts and
rhymes (e.g.,
lizard:
wizard), a finding that was later confirmed in eye-tracking studies (
Allopenna et al., 1998;
McMurray et al., 2010).
Luce and Pisoni (1998) looked at similarity more broadly by studying
phonological neighborhoods (also called
similarity neighborhoods). Indeed, their neighborhood activation model boils down to an equation of the form:
where “similarity” is some measure of the phonetic similarity of a candidate word to the spoken target and where the similarity neighborhood is the set of words that are similar to the target (which includes the target word itself). The number of words in this neighborhood is called the
neighborhood density. This equation implies, holding all other things equal, that higher frequency words should be more recognizable than lower frequency ones, words with more neighbors (denser neighborhoods) should be less recognizable than words with fewer neighbors (sparser neighborhoods), and words with high-frequency neighbors should be less recognizable than words with low-frequency neighbors. Because both frequency and neighborhood density contribute to the denominator in the equation, this number is best thought of as the overall “competitiveness” of a neighborhood.
What defines a similarity neighborhood?
Luce and Pisoni (1998) based their similarities on phoneme confusion probabilities based on an earlier experiment. A simpler and more common approach is to define a word's neighbors as the words that differ by the addition, substitution, or deletion of a single phoneme (more formally, a Levenshtein distance or string-edit distance of 1). For example, the 20-plus-word neighborhood for
sting would include additions such as “st(r)ing” and “s(i)tting,” substitutions such as “s(w)ing” and “st(u)ng,” and deletions such as “s()ing” or “()ting.”
What empirical support is there that these lexical dimensions are relevant for speech intelligibility?
Luce and Pisoni (1998) measured the intelligibility of words under different amounts of noise. Their neighborhood activation model predicted the highest intelligibility for high-frequency words in less competitive (sparser and lower frequency) neighborhoods, lowest intelligibility for more competitive (denser and higher frequency) neighborhoods, and intermediate intelligibility for the other two cases (low-frequency target with low competition and high-frequency target with high competition). Their intelligibility results confirmed these predictions.
Word frequency and neighborhood effects have been demonstrated in adults with speech motor disorders.
Chiu and Forrest (2018) compared word frequency and neighborhood density in 12 speakers with Parkinson's disease and in healthy controls. High-frequency words were consistently more intelligible than low-frequency words, for both speaking groups and both listening conditions. The advantage for low-density neighborhoods over high-density words held only for high-frequency words when presented in noise.
Lehner and Ziegler (2021) examined whether several lexical features predicted the intelligibility of German single words in dysarthric speech. Examining a large sample of words (2,700 tokens of 2,165 words) from a sample of 100 speakers, they found a positive effect of frequency such that more frequent words had a higher expected intelligibility. They also found a negative effect of neighborhood density such that targets with more lexical competitors had a lower expected intelligibility. Finally, their results revealed a negative effect of neighborhood frequency such that targets with higher frequency neighbors had a lower expected intelligibility. Lehner and Ziegler's study found a strong interaction between the target word frequency and neighborhood frequency where the target frequency effect was diminished for words with low-frequency neighbors. That is, the benefit of being a high-frequency word is absent when the average neighborhood competitor is a low-frequency word. As neighborhood frequency increases, target frequency matters more and has a larger effect on intelligibility. This interaction follows from the predictions of the neighborhood activation model.
In addition to the two lexical dimensions (word frequency and neighborhood competition) discussed thus far, we consider two other lexical features in this study. The first is
phonotactic probability, or the notion that some sequences of phonemes in a language are more probable than other sequences. For example, “dwell” is a much less probable word than “drill” because the /dw/ biphone is much less common than /dr/ and because /w/ is less common in the second phoneme position than /r/. Phonotactic probability is especially important for studies with nonwords or novel words where a processing advantage is found for more typical/probable seeming nonwords. For real words, increased phonotactic probability confers a processing
disadvantage because phonotactic probability is correlated with neighborhood density (a highly probable word will be phonologically similar to many other words and thus face higher lexical competition).
Vitevitch et al. (1999) explain the contradictory effects of phonotactic probability by noting that nonwords are not words—they do not compete for activation and are spared from lateral lexical inhibition.
The final lexical feature we consider is
articulatory complexity. Research with speakers with dysarthria has found that motor demands of a word can affect its intelligibility. In these studies, words are assigned a score or rating based on the articulatory demands of the word. In a sample of seven adults with cerebral palsy,
Kim et al. (2010) found that higher complexity consonants were more likely to be incorrectly articulated than lower complexity consonants.
Allison and Hustad (2014) found that for eight 5-year-old children with cerebral palsy and dysarthria, sentences with lower consonant complexity were more intelligible than sentences with greater consonant complexity.
Kuruvilla-Dugdale et al. (2018) compared the intelligibility scores for eight adults with amyotrophic lateral sclerosis and eight healthy controls. They found speakers with mild dysarthria (
n = 4) and healthy controls did not have significantly different intelligibility scores on low complexity words, but the groups did significantly differ for the high complexity words.
For
Lehner and Ziegler (2021), articulatory complexity was the most important predictor of intelligibility. In this German-language study, the complexity effect worked in the opposite direction compared with English-language studies: Words with higher complexity had a higher expected intelligibility score. They attribute this finding to two factors: (a) A high complexity word would still have acoustically salient features when produced by a speaker with dysarthria, and (b) high complexity words have fewer lexical neighbors, which would make them more intelligible for listeners.
In this study, we examined whether the lexical features outlined above—word frequency, neighborhood competition, phonotactic probability, and articulatory complexity—predicted intelligibility in 30- to 47-month-old typically developing children. Specific research questions were as follows.
1.
How much variability in intelligibility exists among children and among the words?
2.
How well do the lexical features based on the neighborhood activation model predict average intelligibility? Do motor complexity and phonotactic probability also predict intelligibility?
3.
What percentage of incorrect listener responses were phonological neighbors or words with a higher frequency than the intended target word?
The first research question examines the main assumption of our statistical analysis approach. We applied an item response analysis, and this model assumes and quantifies two sources of variation in intelligibility scores: children (who vary in ability) and words (which vary in difficulty/easiness). Given these two sources of variation, we can then ask: Are children more variable in ability than words are in easiness? Are these two sources of variation similar? The first part of our analysis compares these two sources of variation.
The second research question examines how the lexical features from the words influence intelligibility. We predicted a facilitative effect of word frequency and a negative effect of neighborhood competition; these hypotheses follow the neighborhood activation model. We also expected a negative effect of articulatory complexity. Children in the 30- to 47-month age range, on average, are expected to have trouble articulating the later-developing (higher complexity) sounds, which should make them less intelligible on average. We did not have any directional predictions regarding phonotactic probability. Prior work suggests that phonotactic probability is more relevant for nonwords than real words, so its inclusion here is exploratory, and we were interested in whether it produced any detectable effects.
The third research question checks the claims of word recognition models by examining incorrect responses. Because word recognition is driven by the competition among similar-sounding words, there is a probability that the target word will lose this competition to a neighbor. The third research question, therefore, examines the incorrect responses to see if these winning competitors are indeed phonological neighbors or have a frequency advantage of the target word. We expected that incorrect responses were more likely to be phonological neighbors than nonneighbors.
Method
Relationship With Prior Work
The participants, speech samples, and listeners described below were previously reported in detail in
Hustad et al. (2020), so we provide a somewhat abbreviated description of them here. The prior study examined the development of
aggregated intelligibility scores during the age range of 30–47 months. Specifically, we estimated age percentiles for intelligibility scores, examined whether boys and girls had different growth patterns (no), and assess whether and at what age mean intelligibility scores differ for single-word versus multiword utterances (yes, a multiword advantage emerges around 45 months). In contrast, this study examines
disaggregated intelligibility scores, specifically whether lexical features of individual words influence intelligibility.
This study was reviewed and approved by the University of Wisconsin–Madison Institutional Review Board (Social and Behavioral Sciences). A parent or legal guardian provided informed consent for all child participants.
Participants
Typically Developing Children
Participants included 165 children (72 boys and 93 girls) between 30 and 47 months in age. Children were evenly distributed across the overall age range: 57 between 30 and 35 months, 51 between 36 and 41 months, and 57 between 42 and 47 months. We chose this overall age range because children show substantial variability in intelligibility around 4 years of age, and these three age bins provide a descriptive device to illustrate coarse 6-month changes in age. Children had no history of speech or language concerns and scored within normal limits on a speech articulation assessment.
Unfamiliar Adult Listeners
Each child had their speech productions transcribed by two adult listeners. These listeners included 92 men and 238 women, and they were predominantly students from the university community, with a mean age of 20;6 (years;months; SD = 3;8). The listeners passed a hearing screening; were native speakers of American English; and did not have self-reported language, learning, or cognitive disabilities.
Materials and Procedure
Speech Samples and Intelligibility Responses
The speech samples were collected in a structured repetition task based on the TOCS+ (
Hodge & Daniels, 2007). Children were presented with an image and a prerecorded prompt, and they were instructed to repeat what they had heard. Prompts included 40 single words; two words were reserved for practice trials, so we analyzed the 38 test items.
Unfamiliar listeners transcribed the children's productions in a sound-attenuated booth with an external speaker calibrated to 75 dB SPL underneath the computer screen; they were played samples and instructed to type the words the child said. Every child had transcriptions by two listeners; each listener only heard productions from one child. Each trial of the listening experiment consisted of a single presentation of one speech sample after which the listener transcribed the production. During the listening experiment, some trials (speech samples) are randomly repeated, but we used only the first transcription of the item. A production was intelligible if a listener correctly transcribed the word or a homophone. Interrater reliability was high for pairs of listeners on children's average intelligibility (proportion of items correctly transcribed). We calculated the interrater reliability using the intraclass correlation coefficient (ICC) estimated using the irr R package (Version 0.84.1;
Gamer et al., 2019). We used an average-score, absolute agreement, one-way random-effects model, and we found substantial agreement among average ratings, ICC(2) = .931, 95% confidence interval [.906, .949].
A complete data set would yield 165 children × 38 items × 2 listeners = 12,540 trials (i.e., intelligibility responses). However, some speech samples proved to be unusable due to noise or data collection problems. We also excluded trials where the listener did not provide a transcription for an item, and we excluded samples where the child had spoken a word other than the prompted word (e.g., saying “sheep” for the prompt sheet). The number of the remaining trials was 12,309. The average number of items per child was 37.5 (SD = 2.3, range: 10–38), and the average number of children per item was 162.8 (SD = 1.6, range = 158–165).
Lexical Features
Table 1 lists the 38 single words, sorted by average intelligibility, along with some of the lexical features described below. We computed orthographic word frequencies based on the same corpus of film and television subtitles that was used by the SubtlexUS word frequency database (
Brysbaert & New, 2009). We chose not to use those published in the frequencies database, and instead, we computed word frequencies anew. SubtlexUS seems to treat contractions such as
can't as two separate words
can +
t so that the frequency for
can is the sum of the frequencies for
can (/kɛn/ and /kæn/) and
can't (/kænt/; similar cases include
won/
won't and
don/
don't). We computed frequencies independently to ensure that these high-frequency, monosyllabic words were properly handled.
After combining these subtitle-based frequencies with pronunciations from the CMU Pronouncing Dictionary (
cmusphinx, 2022), we computed the frequency of the phonological word forms for the items and their phonological neighbors. Recall that we treated transcriptions of the target item or any homophone as a correct response, so that “be,” for instance, was an appropriate transcription of
bee. Thus, the frequency we used for the item
bee was the sum of the frequencies for all of the words with the phonological word form /bi/. For each target word, we used the LexFindR R package (Version 1.0.2;
Li et al., 2022) to find the phonological neighbors of the items. We applied the same process of combining frequencies of homophonic words: For example, the item
bad has a phonological neighbor /æd/ so the frequency of this neighbor is the sum of the frequencies for
add and
ad. For all stimulus prompts and neighbors with more than one pronunciation, we selected a single citation pronunciation for each one to use for this analysis.
For the final frequency calculations, we took the base-10 logarithm of each phonological word-form frequency. The target frequency was the log frequency of the target word form, and following
Magnuson et al. (2013), the “neighborhood competition” for an item was the sum of the log frequencies of each word form in the item's phonological neighborhood. (Each item was a member of its own neighborhood.) Word frequency and neighborhood frequency are this study's analogues to the numerator and denominator, respectively, in the neighborhood activation model equation. Note that we did not analyze neighborhood density (the number of words in a neighborhood) as a stand-alone lexical variable in our analyses. The phonological neighborhoods produced from the large corpus often include infrequent or unfamiliar words. For example, the neighborhood for
type includes
tithe, tine, and
stipe. Intuitively, we would like some way to treat these uncommon neighbors as less relevant than more common neighbors (e.g.,
time, top, and
tape for
type), but neighborhood density assigns all words in a neighborhood equal weight. The neighborhood competition measure we apply above weights each word by its frequency so that infrequent words are indeed treated as less relevant for lexical processing.
We computed each word's total motor complexity score using the scale by
Kuruvilla-Dugdale et al. (2018) based on
Kent (1992). This system assigns a score to each syllable part (onset, nucleus, and coda) based on its articulatory motor demands, ranging from 1 (/ə, ɑ/) to 8 (cluster of three consonants). Finally, phonotactic probability was computed with the IPhOD (The Irvine Phonotactic Online Dictionary) database (
Vaden et al., 2009), using the base-10 logarithm of each word's average biphone probability.
Statistical Analyses
We performed a Bayesian item response analysis (
Bürkner, 2021) using mixed-effects logistic regression. We used logistic regression because the dependent variable is the probability of a word being correctly transcribed. This regression model estimates the expected intelligibility of an average item on an average participant (overall intercept) and adjusts this average using each participant's ability (via by-child random intercepts) and each item's easiness (via by-item random intercept). We included covariates to examine whether age and item-level lexical features predicted intelligibility.
Our baseline regression model included fixed effects of age (in years, centered so x = 0 at age 3 years), target frequency (on a base-10 log scale, mean-centered), neighborhood competition (sum of log scale units, mean-centered), and Target Frequency × Neighborhood Competition interaction. We then augmented this model to include a fixed effect of motor complexity (in complexity points, centered so x = 0 at 10 points) and Age × Motor Complexity interaction. Last, we augmented the baseline model to include a fixed effect for each average biphone probability (on a base-10 log scale, mean-centered).
Analyses were orchestrated using the R programming language (Version 4.2.0;
R Core Team, 2022). We estimated the models using the Stan language (Version 2.29.2;
Carpenter et al., 2017) by way of the brms (Version 2.17.0;
Bürkner, 2017), cmdstanr (Version 0.5.2;
Gabry & Češnovar, 2022), and tidybayes (Version 3.0.2;
Kay, 2022) R packages. The model worked on the logit (log-odds) scale, where a unit change in logits represents a substantial change in intelligibility: 0 logits = 50% intelligibility, 1 logit = 73%, and 2 logits = 88%. Therefore, we use weakly informative prior of Normal(
M 0,
SD 1) for the effects of motor complexity, age, target frequency, and phonotactic probability. We used a Normal(0, 0.1) prior for the effect of neighborhood competition because this covariate involved larger values. These priors are “weakly” informative in that they set limits on plausible effect sizes, but they are centered at zero so that positive and negative effects remain plausible. We used a weakly informative prior of Normal(0, 1) for the child-level and word-level variances. We estimated our models using four MCMC sampling chains with a total of 2,500 post-warmup draws per chain (10,000 posterior draw total). The models all passed available convergence diagnostics. Each posterior draw represents a plausible set of model coefficients, and we summarize these draws by using the median and the 95% quantile intervals for the coefficients. By using all 10,000 draws when computing model predictions, our analysis can average over many sources of uncertainty.
Results
Sources of Variation
Our item response analysis estimated how the expected intelligibility for an average child on an average word changed with age and other lexical covariates. These expectations were adjusted using child-level effects (i.e., child ability) and item-level effects (i.e., item easiness). These two sources of variation were estimated by the model. In the baseline model, the estimated standard deviation for the word-level effects was 1.04 logits, 95% posterior interval [0.84, 1.35], and the estimated standard deviation for the child-level effects was 0.67 logits, [0.59, 0.76]. Thus, there was substantial variation among children and among words, but word-level variation was larger than child-level variation,
SD(word) −
SD(child) = 0.37, [0.15, 0.69].
Figure 1 depicts these two sources of variation on observed child means and word means (box plots) and model predictions (filled intervals). For the model predictions, we simulated intelligibilities for new children for an average word and for new words for an average 3-year-old using the variance parameters estimated by the model. To be clear, to simulate a new child, we estimated the baseline population average (intelligibility of an average item for an average 3-year-old) and then added random draws from Normal(0, child ability variance) to simulate new children around that population mean. For new items, we draw from Normal(0, item ability variance) instead. In both the observed data and the model predictions, the range of intelligibility values is wider for the words than for the children.
Conditional Effects of Lexical Predictors
Our focal point of reference will be the expected intelligibility for an average 3-year-old for a word with average frequency, average neighborhood competition, and a typical complexity score of 10. The expected intelligibility score for this set of predictors was 55%, 95% posterior interval [46, 64]. As we found in our earlier work, there was a clear effect of age: A 6-month increase (from 36 to 42 months) predicted a corresponding increase in intelligibility of 12 percentage points, [9, 15].
Both target word frequency and total neighborhood competition had an effect on expected intelligibility.
Figure 2 visualizes these effects and their interaction. For the focal comparison, a 10× increase in target frequency predicted an increase in average intelligibility of 10 percentage points, 95% posterior interval [1, 19]. A 30-unit increase in total neighborhood competition (approximately 1
SD) predicted a change of −8 percentage points, [−17, 1]. This posterior interval mostly supports a negative effect of overall neighborhood competition, but because the interval includes zero, it is plausible that this change in neighborhood competition has an irrelevant effect for a word of average frequency.
The bottom row of
Figure 2 illustrates the frequency effect at different levels of neighborhood competition, and there are three aspects of the target frequency and neighborhood competition interaction that are worth highlighting. First, the right ends of the regression lines (high-target-frequency words) all land in a similar location (around 75% intelligibility) regardless of the amount of neighborhood competition. Second, the left ends of the regression lines differ in location, suggesting that the apparent risk or penalty of low-frequency words changes with the amount of neighborhood competition. For our focal comparison, a 1/100 decrease in target word frequency yields an expected intelligibility of 34%, [18, 55]. From here, an additional 30-point decrease in neighborhood competition provides an intelligibility increase of 14 percentage points, [−4, 33]. The change is a larger effect than the one reported earlier, but the posterior interval still indicates that it is plausible that neighborhood competition has an irrelevant effect on intelligibility. Third, part of this uncertainty could stem from some gaps in these two lexical measures. For example, there are no low-competition words with a target frequency greater than 1,000 per million words nor are there high-competition words with a frequency below 10 per million words.
From the baseline intelligibility value, a 2-point increase in a word's motor complexity (i.e., approximately 1 SD) predicted a change in intelligibility of −4 percentage points, 95% posterior interval [−12, 5]. This effect from additional complexity was probably negative with a posterior probability of .794, but the large uncertainty here means that the data mainly “suggest” a complexity penalty. Despite the clear main effect at 36 months, there was a statistically unclear interaction between age and motor complexity where the apparent complexity penalty was larger at younger ages. For example, for an average child and average word, a 2-point increase in complexity at 30 months predicted a corresponding change in intelligibility of −5 percentage points, [−13, 4], and at 42 months, the predicted change in intelligibility was −2 percentage points, [−10, 6].
For phonotactic probability, both positive and negative effects were plausible. The effect of a 10× increase in average biphone probability had a 95% posterior interval of [−0.84, 0.61] logits or, equivalently, an odds ratio of [0.43, 1.84].
Exploratory Analysis of Incorrect Responses
The preceding analyses looked at the probability that a listener correctly transcribed a child's production of a word as a function of age and word-level features. These lexical features applied to the target word—but not the target that the listener actually transcribed. As an additional exploratory analysis, we examined the words that listeners generate as incorrect responses. First, we asked how many of these words were phonological neighbors and nonneighbors with the target. Second, we asked whether these incorrect responses had a higher frequency than the target word.
Table 2 provides supporting information for this analysis, along with example responses for the target words.
To calculate the proportion of phonological neighbors among the responses, we classified responses into three types: target (correct transcription), neighbor (a phonological neighbor), and other (a nonneighbor). We then computed the proportion of responses per target word that fell into each type. The average percentage of neighbor responses was 17.6%, SD = 14.8. The words with the lowest number of neighbor responses were house (1.2%), bow (1.2%), and yawn (1.5%), and the words with the highest number of neighbor responses were chew (44.8%), sheet (46.6%), and dee (65.8%). House and yawn were among the words with the lowest amounts of total neighborhood competition, and dee and chew both had high amounts of neighborhood competition. Indeed, the proportion of neighbor responses correlated with the amount of neighborhood competition at r = .31 and with neighborhood density (number of neighbors) at r = .21, but correlated with the log-10 frequency of the target word at r = −.04.
The average percentage of other responses was 23.0%, SD = 16.1. The words with the lowest number of other responses were hot (4.9%), no (5.8%), and bee (5.8%), and the words with the highest number of other responses were rock (69.3%), jar (66.1%), and lock (53.1%). The other responses for rock /rɑk/ and lock /lɑk/ were dominated by walk /wɔk/ responses. The most frequent other responses for jar were door, dog, jaw, star, and draw. These are not dissimilar words from the target: They all have at least two points of similarity with the target (a shared place of articulation for the initial consonant, a shared vowel location or a final /r/ sound). We discuss these examples further in the Limitations section.
To assess the effect of frequency on incorrect responses, we computed frequency as we did earlier in the study: as the sum of a word of the frequency for an orthographic word and its homophones (e.g., the frequency for add is the sum of the frequencies for add and ad). For a word with multiple pronunciations, we used the higher frequency pronunciation. We looked at the incorrect responses for each word and computed the proportion of response words that featured a higher frequency than the target. There were 22 words where more than 50% of incorrect responses used a more frequent word than the target. The words with the most higher frequency incorrect transcriptions were yawn (99.2%), sheet (96.6%), and beanie (94.7%). These were infrequent words, so many responses would readily have a higher frequency than the target word. For some very high-frequency words such as come or no, less than 1% of responses used a more frequent word.
Because the above results reproduce the fact that the words varied in frequency, we computed the weighted average frequency of the responses for each target word. The frequencies were weighted by the number of times that a response appeared, and we included the target word as one of the responses. If we use token counts (i.e., the number of appearances in the corpus) for frequencies, then the average frequency of the responses had a higher frequency than the target in 31 words. The seven words where the target frequency had a higher frequency than the average response frequency were the six highest frequency words in the study plus the word hot. If instead we use the base-10 logarithm of the token counts—computing the average magnitude of the frequencies of the responses to each word—then the average response frequency is larger than the target word in just 20 cases.
Discussion
In this study, we examined whether four lexical features—target word frequency, total phonological neighborhood competition, motor complexity, and phonotactic probability—predicted the intelligibility of single words produced by young children (2;6–3;11). The first two features are based on the neighborhood activation model of word recognition (
Luce & Pisoni, 1998), and this model makes strong predictions for word recognition. The feature of motor complexity addressed whether more articulatorily demanding words were less intelligible. Like motor complexity, phonotactic probability provides a way to score a word's phonological structure, but the score here is based on how typical or wordlike a word is. We knew from our prior work that intelligibility shows rapid growth during the age range and that there was considerable variability among children (
Hustad et al., 2020), so this age effect provided a benchmark for the effect sizes of these lexical features.
We estimated the amount of variation for children's abilities and words' difficulties, and we found a larger degree of variation among words than among children. This is particularly interesting since our earlier work showed such a large range of variability among children. The large degree of word-level variability reveals a parallel notion of individual differences for words and the need to account for word-level measures such as frequency.
We found a robust positive effect of word frequency such that higher frequency words were more intelligible on average than lower frequency words. The median effect of frequency was about half the size of the median age effect, meaning that a 10× increase in word frequency was equivalent to a 6-month increase in age in terms of intelligibility improvements. Importantly, the words that the listener
did not hear also influenced intelligibility: We found a probable negative effect of a neighborhood competition where words with more phonological neighbors and more frequent phonological neighbors were less intelligible on average. That is, a word with phonological similarity to many other words and/or to more frequent words is less intelligible, on average, because it is from a more competitive neighborhood. The word frequency effect, as in the study of
Lehner and Ziegler (2021), may have been moderated by the amount of competition from neighboring words. For words with relatively little lexical competition, changes in frequency do not lead to large changes in intelligibility, but for words with high competition, changes in frequency may matter and thus lead to substantial changes in intelligibility.
It is worth remarking on the success of the neighborhood activation model when applied to children's speech (this study), to laboratory stimuli (e.g.,
Luce & Pisoni, 1998), and to German dysarthric adults (
Lehner & Ziegler, 2021). Although the speakers in these studies are very different, the neighborhood activation model is a description of the perceptual processes of typical listeners, and listeners are fairly consistent in their speech perception. Thus, word frequency and neighborhood competition are relevant factors for intelligibility studies, despite the wide range in variability in typical, developing, or disordered speech.
Contrary to our intuitions, there was not a statistically clear effect of motor complexity on intelligibility over and above the effects of frequency and neighborhood competition. The scale we used assigns a complexity score for each part of a syllable based on developmental and motor production constraints (
Kent, 1992;
Kuruvilla-Dugdale et al., 2018). Because the speakers in this study are in the midst of articulatory development, they should be less intelligible on more complex words. Our analysis found suggestive evidence for this hypothesis: There was an 80% probability that the sign of the complexity effect was negative, and there was an Age × Complexity interaction in which the complexity penalty was smaller in older children. This hypothesis, however, is framed around the child's speech: We expect more phonological errors from more complex words and for younger children. However, this speech is run through a listener who can accommodate or adjust for a child's phonological errors. In other words, typically developing children tend to make common types of phonological errors, and a listener can use their knowledge of how children speak to work around those errors. Thus, it is conceivable that a clinical measure such as percentage of consonants correct would be sensitive to motor complexity, but the functional measure of intelligibility can be robust to motor complexity because of some error correction by the listener.
Phonotactic probability, as measured by average biphone probability of a word, did not have a clear effect on intelligibility. This result is in line with prior work where phonotactic probability effects are most relevant for nonwords. A nonword has no frequency and no neighborhood competition, so its typicality or wordlikeness drives how it is produced or processed (
Vitevitch et al., 1999). Children repeat higher probability sequences in nonwords more accurately compared with lower probability sequences (
Edwards et al., 2004) and learn higher probability nonwords more successfully compared with lower probability ones (
Storkel, 2001). For our task (repetition of real words), phonotactic probability is not a relevant factor on intelligibility.
One novel aspect of this study was our exploratory analysis of incorrect responses. In particular, we found the proportion of responses for phonological neighbors increased with the amount neighborhood competition, but this proportion was not correlated with the target word frequency. Neighborhood competition combines two pieces of information: neighborhood density (the number of neighbors) and the frequency of those neighbors. A higher neighborhood density means that there is a larger pool of phonological neighbors that may show up in a listener's response, but on the other hand, some of these neighbors may be rare or unknown words. Neighborhood competition appropriately weights phonological neighbors by their frequency so that rare words play a smaller role in the competition. Thus, although we found that both neighborhood competition and neighborhood density positively correlated with the proportion of neighbor responses, the correlation was larger for neighborhood competition. We advise that future studies use the neighborhood competition (or some kind of frequency-adjusted neighborhood density measure) instead of neighborhood density whenever possible.
We did not find any clear indication that listeners erred on the side of choosing higher frequency words than the target word. For infrequent words, listeners tended to use a higher frequency word in an incorrect transcription, and for high-frequency words, listeners tended to use a lower frequency word in an incorrect transcription. For 20 of the 38 words, the average log frequency of responses was larger than the log frequency of the target word, but this response frequency effect was not observed for any of the 15 most frequent words. Thus, listeners only seem to select higher frequency words when the target word has a relatively low frequency (i.e., less than 100 tokens per million).
Limitations
This study was the first of its kind to consider lexical factors related to speech intelligibility development in children. However, our ability to estimate the effects of lexical predictors of words was limited by the number of unique words under consideration. Our items included just 38 words, whereas
Luce and Pisoni (1998) studied 811 consonant–vowel–consonant words and
Lehner and Ziegler (2021) included over 2,100 different words. The limited word set resulted in some gaps in the “grid” of target frequency × neighborhood competition values (e.g., very low frequency × high competition or very high frequency × low competition). The youngest children in this study were less than 3 years of age, limiting the pool of familiar or picturable words; nevertheless, future work should include a larger set of words.
In this study, we used the DAS (deletion, addition, or substitution of one phoneme) definition of phonological similarity. This approach is convenient and methodologically tractable, requiring just a pronunciation dictionary, but target similarity is an all-or-nothing feature of a neighboring word in this scheme. As observed with the incorrect responses, the word
walk might not be a neighbor of
rock or
lock based on the vowel conventions or normative dialect of the pronunciation dictionary. Our definition of neighborhood competition and our estimate of its role on word recognition was sensitive to the definition of phonological similarity. It is plausible, however, that the neighborhood effects would be stronger when we apply a more fine-grained definition of similarity, such as phoneme confusion probabilities (as in
Luce & Pisoni, 1998) and phonetic similarity (where matching phonemes can differ by one phonetic feature, meaning
thumb–fawn and
veer–bull are neighbors; see
Luce et al., 2000).
Clinical Implications and Future Directions
One way to think of a word's frequency and its effect on word recognition is as an expectation, a bias, or a preactivated response that bets on some words being more likely than other. Now, suppose that we had the functional goal of maximizing a child's intelligibility on various words. Our work here shows that we can rely on listeners' expectations to support the child's intelligibility. In particular, a listener will be more likely to understand a more frequently used word than a less infrequently used word. In fact, the benefit of a high-frequency word is much stronger than the penalty for a word with a higher motor complexity.
Further work is needed to see if this frequency–complexity finding holds for children with speech motor impairment and children with other speech sound disorders, where there are irregular or unexpected differences in production features of speech. As discussed above, a listener can accommodate typical phonological errors made by typically developing children, but we do not know how lexical features of words interact with the presence of atypical errors. It is plausible that the penalty from increased motor complexity leads to productions that are too dissimilar from the target for the listener's expectations to recover their message. The models of word recognition reviewed earlier assume that speakers accurately produce speech sounds and these sounds in turn activate lexical representations. However, these lexical dynamics are liable to change if the speaker produces ambiguous or distorted speech sounds. It could be the case that frequency becomes even more important (i.e., a listener relies even more on top-down knowledge or expectations to account for the poor production).
Ultimately, findings from this study further highlight the complexity of speech intelligibility as a multidimensional construct. Notably, quantifying the extent to which lexical features of words impact speech intelligibility in children provides important information for clinical consideration and for future study. One potential clinical direction for this work may be consideration of treatment targets for intelligibility-focused intervention. This may take the form of communication strategies for capitalizing on use of words that maximize beneficial lexical features to advance functional intelligibility. Future studies are necessary to continue to investigate specific strategies for improving intelligibility in children who have speech disorders.
Data Availability Statement
Data are available by request from the authors.
Acknowledgments
This study was funded by National Institute on Deafness and Other Communication Disorders Grants R01DC015653 and R01DC009411, awarded to Katherine C. Hustad. Support was also provided by National Institute of Child Health and Human Development Grant U54HD090256. The authors thank the families and children who participated in this study. They also thank Ashley Sakash and Phoebe Natzke for their assistance with this project.