Speech Perception in Complex Acoustic Environments: Developmental Effects

Purpose: The ability to hear and understand speech in complex acoustic environments follows a prolonged time course of development. The purpose of this article is to provide a general overview of the literature describing age effects in susceptibility to auditory masking in the context of speech recognition, including a summary of findings related to the maturation of processes thought to facilitate segregation of target from competing speech. Method: Data from published and ongoing studies are discussed, with a focus on synthesizing results from studies that address age-related changes in the ability to perceive speech in the presence of a small number of competing talkers. Conclusions: This review provides a summary of the current state of knowledge that is valuable for researchers and clinicians. It highlights the importance of considering listener factors, such as age and hearing status, as well as stimulus factors, such as masker type, when interpreting masked speech recognition data.

C hildren live and learn in complex acoustic environments, which contain multiple sources of competing sounds. Acoustic waveforms generated by these sources may be relatively steady in frequency and intensity over time, or they may be more dynamic. For example, a child in a science classroom might be exposed to speech produced by his or her teacher, speech produced by other children, and noise produced by an aquarium. Given the high prevalence of competing sounds in children's natural listening environments (e.g., Ambrose, VanDam, & Moeller, 2014) and the mounting evidence linking exposure to competing sounds to delays in language development and learning (e.g., Shield & Dockrell, 2008), it is essential that we understand how and when the ability to hear and understand speech in complex acoustic environments develop. This is not a trivial problem; the ability to recognize speech in the presence of competing sounds relies on accurate and efficient processing across multiple stages within the auditory and cognitive systems.
The goal of this review is to (a) provide a simple model describing stages of auditory processing; (b) differentiate between energetic and informational masking; (c) review the literature describing developmental effects in susceptibility to speech-in-speech masking; (4) introduce the hypothesis that the ability to take advantage of acoustic voice characteristics that facilitate segregation of talker from masker speech requires extensive experience with sound; and (5) consider how congenital hearing loss may impact experience with sound, thus altering the maturation of speech-in-speech perception skills.

Stages of Auditory Processing
Figure 1 depicts several stages of auditory processing required to recognize speech in multisource environments. The child in the science classroom must listen to his teacher's lecture while disregarding speech produced by his classmates and noise generated by the aquarium's filter and pump. What reaches the child's ears is a mixture of acoustic waveforms produced by all three sources. In order to "hear out" the teacher's instructions, the basic spectral, temporal, and intensity properties of her speech must first be encoded by the child's peripheral auditory system. The fidelity of this peripheral encoding is compromised by the presence of the competing sounds. The representation of waveforms associated with the competing speech and noise may overlap on the basilar membrane with those of the target speech, thus degrading the neural representation of the teacher's spoken message transmitted to the child's central auditory system. This phenomenon is often referred to in the literature as energetic masking.
The ability to hear speech in the presence of competing sounds also relies on central auditory and cognitive processes that allow listeners to group sounds into separate auditory objects and allocate attention to a particular object while discounting other objects (e.g., Best, Ozmeral, & Shinn-Cunningham, 2007;Bregman, 1990;Bronkhorst, 2000). In addition to degrading the peripheral representation of target speech, competing sounds may also impact speech perception by disrupting this higher-level processing. This disruption often reduces the extent to which listeners disentangle target speech from competing sounds, even when the fidelity with which the peripheral auditory system encodes the target speech is sufficient. These difficulties are most pronounced when the target speech and the competing masker are perceptually similar, such as speech recognition in a masker composed of a small number of speech streams (e.g., Brungart, 2001;Carhart, Tillman, & Greetis, 1969;Freyman, Balakrishnan, & Helfer, 2004). This phenomenon is often referred to in the literature as informational masking.

Competing Noise Versus Competing Speech
The majority of studies investigating masked speech perception have examined speech recognition in the presence of relatively steady-state sounds, such as babble (≥ 4 talkers), Gaussian noise, or speech-shaped noise (e.g., Dubno, Dirks, & Morgan, 1984;Frisina & Frisina, 1997;Gravel, Fausel, Liskow, & Chobot, 1999;Lunner & Sundewall-Thorén, 2007). Not surprisingly, these relatively steady-state sounds have commonly been included as a masker for clinical speechin-noise tests, for example, Quick Speech-in-Noise Test (Killion, Niquette, Gudmundsen, Revit, & Banerjee, 2004) and Hearing in Noise Test (Nilsson, Soli, & Sullivan, 1994;Niquette et al., 2003). At least for young adults, steady noise is expected to produce primarily energetic masking by physically interfering with encoding of all or parts of the target speech representation at the periphery (reviewed by Brungart, 2005). In their seminal study, for example, Miller and Nicely (1955) measured adults' identification of consonant-vowel syllables in broadband noise using a closedset task. Their findings showed that adults have more difficulty identifying some consonants than others in noise (e.g., manner information) and that error patterns are generally uniform across listeners. Subsequent work demonstrated that consonant error patterns in noise are influenced by spectral characteristics of the masking noise (e.g., Phatak, Lovitt, & Allen, 2008) and signal-to-noise ratio (SNR; e.g., Miller & Nicely, 1955;Phatak & Allen, 2007;Woods, Yund, Herron, & Cruadhlaoich, 2010). These findings provide compelling Figure 1. This illustration highlights three stages of auditory processing. In the first stage, a combination of acoustic waveforms produced by three sources (a science teacher, students working on a project, the pump and filter of an aquarium) reaches the child's ear. In the second stage, represented as a spectrogram, the peripheral auditory system encodes the temporal, spectral, and intensity characteristics of these waveforms into a pattern of neural activity across auditory nerve fibers that is transmitted to higher levels within the auditory system. In the third stage, top-down auditory-perceptual, cognitive, and linguistic processing facilitate reconstruction of the auditory scene. evidence that the factors responsible for consonant identification in noise are related to features of the stimuli and how those features are encoded by the peripheral auditory system.
There has been a more recent emphasis in the literature on understanding how maskers composed of a small number of speech streams impact speech recognition (e.g., Freyman et al., 2004). It is well documented that speech recognition in a single stream of competing speech is typically easier for young adults than listening in steady noise, in part because fluctuations within the competing speech stream provide listeners with an opportunity to "glimpse" portions of the target speech (e.g., Cooke, 2006;Howard-Jones & Rosen, 1993). However, speech recognition in a masker composed of two to three streams of speech is often more difficult for listeners than when the masker is steady noise (e.g., Carhart, Johnson, & Goodman, 1975;Freyman et al., 2004). For example, Carhart et al. (1975) estimated young adults' spondee recognition thresholds in the presence of white noise, speech-shaped noise, and combinations of speech produced by one, two, three, 16, 32, 64, or 128 talkers. Overall masker level was held constant across all masker conditions. Considering the speech maskers, an increase in masking was observed as the number of talkers increased from one to three. Interestingly, no further increases in masking were observed as additional talkers were added beyond three. It has been suggested that this pattern of results reflects an increase in informational masking, as number of streams increases from one to two or three as opportunities for glimpsing decrease, followed by a reduction in informational masking and an increase in energetic masking due to a decrease in target/masker similarity as additional talkers are added to the masker stream (e.g., Freyman et al. 2004).

Masked Speech Recognition in Children
Findings from multiple laboratories provide converging evidence that children require higher SNRs than young adults to achieve similar performance on a wide range of speech-in-noise measures (e.g., Corbin, Bonino, Buss, & Leibold, 2016;Elliott, Connors, Kille, Levin, Ball, & Katz, 1979). In many studies, mature performance has been observed by about 9-10 years of age (e.g., Corbin et al., 2016;Nishi, Lewis, Hoover, Choi, & Stelmachowicz, 2010), providing evidence that the ability to perceptually segregate target speech from a noise masker may be immature early during the school-age years but is adultlike by adolescence.  examined children's and young adults' consonant identification performance in the presence of speech-shaped noise at a fixed SNR of 0 dB. Percent correct scores are shown in the left panel of Figure 2, with data for children split into three age groupings (5-7, 8-10, and 11-13 years). The two youngest age groups of children performed significantly worse than the older three groups of listeners in the noise masker, performing an average of 11 percentage points more poorly than young adults. In contrast, 8-to 10-year-olds and 11-to 13-year-olds performed as well as the adults.
Although children have more difficulty in recognizing speech in noise than adults, substantially larger and longer-lasting developmental effects have been observed in the presence of one or two streams of competing speech (e.g., Corbin et al., 2016;Hall, Grose, Buss, & Dev, 2002;Wightman & Kistler, 2005). The right panel of Figure 2 shows percent correct scores for consonant recognition in the presence of a continuous, two-talker speech masker. Striking child-adult differences in performance are evident in these data, including a 36 percentage point decrement in performance for 5to 7-year-old children relative to 19-to 34-year-old adults. Similar age effects have been reported for word (e.g., Hall, Grose, Buss, & Dev, 2002) and sentence (e.g., Calandruccio, Leibold, & Buss, 2016) recognition in a two-talker masker.
In addition to the substantially larger child-adult differences observed for speech-in-speech compared with speech-in-noise recognition, masked speech recognition appears to mature at different rates in competing speech versus noise. This trend is evident in the masked consonant identification data shown in Figure 2; although 11to 13-year-olds performed as well as young adults in the speech-shaped noise masker, they performed 10 percentage points poorer than adults in the two-talker speech masker. Recently, Corbin et al. (2016) assessed word recognition in the presence of speech-shaped noise or twotalker speech in over 50 school-age children ranging in age from 5 to 16 years. Young adults (19-40 years) were also tested to provide an estimate of mature performance. Findings indicated a more prolonged time course of development Figure 2. Group average percent correct scores for consonant identification are presented for 5-to 7-year-olds (circles), 8-to 10-year-olds (squares), 11-to 13-year-olds (triangles), and young adults (hexagons), as adapted from . Error bars are ± 1 SEM. Data on the left show performance in a speechshaped noise masker, and data on the right show performance in a two-talker speech masker. Note the magnitude of child-adult differences in the two-talker masker relative to the speech-shaped noise masker.
for speech recognition in two-talker speech than in speechshaped noise. Speech recognition thresholds in speechshaped noise improved steadily until about 10 years of age, but thresholds in two-talker speech did not reach adultlike levels until 13-14 years of age. Two additional findings reported by Corbin et al. (2016) are worth highlighting. First, an abrupt improvement in speech recognition thresholds was observed in the two-talker masker around 13-14 years of age. Whereas few children younger than 13 years of age had thresholds in the range observed for young adults, almost all children over 14 years of age demonstrated mature performance. The mechanisms responsible for this complex pattern of development are unclear. In Corbin et al. (2016), we posited that maturation of cognitive processing related to executive functioning may underlie the rapid improvement in speech-in-speech recognition observed between 13 and 14 years of age, and we highlight the need for future experiments targeting development across adolescence. The second notable observation from Corbin et al. (2016) is that speech recognition thresholds obtained from the same children in the presence of speech-shaped noise and two-talker speech were uncorrelated. The lack of an association between thresholds in the two masker conditions provides further evidence that speech-in-noise and speechin-speech perception abilities mature at divergent rates, reflecting contributions from different underlying factors.
Although the data are somewhat mixed, mounting evidence supports the idea that child-adult differences in speech-in-speech recognition partly reflect immature glimpsing abilities. Initial studies investigating speech recognition in temporally modulated noise found no child-adult differences in the amount of benefit derived from masker modulation (Stuart, 2008;Stuart, Givens, Walker, & Elangovan, 2006). However, results from more recent work involving complex noise or speech maskers with both spectral and temporal modulations (e.g., Hall, Buss, Grose, & Roush, 2012;Buss, Leibold, Porter, & Grose, 2017) and/or reverberation (e.g., Wróblewski, Lewis, Valente, & Stelmachowicz, 2012) indicate an immature ability to benefit from available glimpses in a fluctuating masker relative to adults.

Age Effects in the Ability to Utilize Acoustic Differences Between Talkers to Separate Speech Streams
There is growing interest among researchers in the field to characterize the specific factors responsible for children's increased susceptibility to speech-in-speech masking relative to young adults (e.g., Calandruccio et al., 2016;Newman, Morini, Ahsan, & Kidd, 2015;Wightman, Kistler, & Brungart, 2006). In our laboratory, for example, we have evaluated the extent to which children benefit from the introduction of acoustic differences in vocal characteristics between talkers to segregate target from masker speech (e.g., Calandruccio, Buss, & Leibold, 2013;Flaherty, Leibold, & Buss, 2017;Leibold, Taylor, Hillock-Dunn, & Buss, 2013). This approach is based on results from experiments involving young adults showing that target/ masker segregation is aided by the presence of robust acoustic differences between speech produced by different talkers (e.g., Bronkhorst 2000;Brungart, Simpson, Ericson, & Scott, 2001;Darwin, Brungart, & Simpson, 2003). These vocal characteristics, primarily fundamental frequency (F0) and formant frequencies, are associated with the length of the vocal folds and the size and length of the vocal tract, respectively (e.g., Fitch & Giedd, 1999).
F0 and formant frequencies vary across talkers, with male voices tending to be lower in frequency than female voices. Consistent with the hypothesis that between-talker differences in these vocal characteristics facilitate target/ masker segregation, young adults typically show substantially better speech-in-speech recognition performance for conditions in which target and masker speech are mismatched in sex than when they are matched in sex (e.g., Brungart, 2001;Festen & Plomp, 1990;Freyman et al., 2004). For example, Brungart (2001) compared 21-to 55-yearold adults' speech-in-speech recognition abilities at a fixed SNR using the Coordinate Response Measure Test (Bolia, Nelson, Ericson, & Simpson, 2000) across conditions in which target and masker speech was produced by the same talker, by different talkers matched in sex, or by different talkers mismatched in sex. For sex-matched conditions, performance was 15-20 percentage points higher when target and masker phrases were produced by different talkers. Performance improved by an additional 15-20 percentage points when target and masker phrases were spoken by talkers mismatched in sex. It has been suggested that these findings are the consequence of a decrease in both energetic and informational masking driven by the relatively large acoustic differences between male and female speech productions, making it easier for listeners to segregate target and masker speech streams relative to when target and masker speech is produced by talkers of the same sex (e.g., Freyman et al., 2004).
Results from ongoing experiments in our laboratory suggest that the ability to exploit even large acoustic differences in vocal characteristics between talkers takes many years to fully develop. Calandruccio et al. (2013) compared children's (5-10-years-old) and young adults' speech recognition thresholds in two-talker speech between sex-matched and sex-mismatched target/masker conditions. A similar sex mismatch benefit was observed for children and adults. In a related study, however, Leibold, Taylor, et al. (2013) observed no sex mismatch benefit for 7-to 13-month-old infants in the context of speech-in-speech detection. Although the methods used to test infants and school-age children differ, thresholds for young adults tested by Leibold, Taylor, et al. (2013) using the infant paradigm were considerably lower for sex-mismatched than sex-matched target/masker conditions, a finding consistent with Calandruccio et al. (2013). In sharp contrast, infant thresholds for sex-matched and sex-mismatched conditions were similar. The pattern of results observed across these two studies is consistent with the hypothesis that the ability to take advantage of acoustic differences between male and female speech productions is not established at birth and develops between infancy and the school-age years.
Although school-age children tested by Calandruccio et al. (2013) showed a robust sex mismatch benefit, they continued to be more susceptible to speech-in-speech masking than young adults even when target and masker speech was mismatched in sex. A possible explanation for this finding is that the ability to utilize less redundant and/or more subtle differences in voice characteristics between talkers of the same sex follows a prolonged time course of development. Flaherty et al. (2017) recently examined this possibility by testing a wide age range of school-age children (5-15-years-old) and young adults on speech-inspeech conditions in which only F0 was manipulated. An adaptive procedure was used to estimate the SNR required for 70.7% word recognition in a two-talker speech masker. The target and masker speech was produced by the same talker. The rationale for using target and masker speech produced by the same talker was to accentuate informational masking effects (e.g., Brungart et al., 2001) and to isolate the influence of target/masker differences in F0 on speech-in-speech recognition. In separate conditions, the F0 of the target speech was either matched to the masker's F0 (i.e., unaltered) or shifted higher in frequency by three, six, or nine semitones. The F0 of the masker speech remained constant across experimental conditions. Preliminary data are presented in Figure 3, which shows thresholds estimated using unaltered target words (open circles) and target words shifted up by six semitones (shaded triangles). The vertical lines represent the benefit of introducing the relatively large target/masker F0 difference. Consistent with previous findings (e.g., Darwin et al., 2003;Mackersie, Dewey, & Guthrie, 2011), thresholds for all of the young adult listeners were considerably lower when the target F0 was shifted higher in frequency than the masker F0 relative to when the target and masker F0s were matched. The same general pattern of results was observed for children 10 years of age and older. Surprisingly, however, children younger than about 10 years of age did not take advantage of target/ masker F0 differences. This age effect suggests that learning the skills required to utilize target/masker F0 differences requires a decade of auditory experience and/or neural maturation.

Influence of Hearing Loss on Speech-in-Speech Recognition
It has been known for many years that sensory/neural hearing loss often reduces the fidelity with which the peripheral auditory system encodes sound (e.g., Buss, Hall, & Grose, 2004;Glasberg & Moore, 1986;Moore & Carlyon, 2005). Moreover, multiple studies involving young adults (e.g., Fu, Shannon, & Wang, 1998;Peters, Moore, &Baer, 1998) andchildren (e.g., Hall, Buss, Grose, &Roush, 2012) have shown that peripheral encoding deficits negatively impact speech recognition in the presence of nominally steady noise or babble. Damage to sensory hair cells and other structures located within the auditory periphery are likely to also interfere with speech recognition in the presence of competing speech because of reduced access to acoustic cues that facilitate the segregation of target from masker speech as well as poorer representations of temporal and spectral changes over time (e.g., Qin & Oxenham, 2003).
Although less studied, an additional factor that appears to influence speech-in-speech recognition outcomes for children who are hard of hearing is auditory experience. Specifically, results from a growing number of studies indicate that children with sensory/neural hearing loss often have reduced and/or less consistent auditory experience than peers with normal hearing (reviewed by . For example, many young children who are hard of hearing do not wear their hearing aids for more than 6 hr per day (e.g., Muñoz, Preston, & Hicken, 2014;Walker et al., 2014). In addition, it has been estimated that approximately a third of pediatric hearing aids may not provide optimal audibility (e.g., McCreery et al., 2014). The critical problem that arises from these two issues is that both hearing aid use and aided audibility moderate language outcomes for children who are hard of hearing (e.g., Tomblin et al., 2015;Tomblin, Oleson, Ambrose, Walker, & Moeller, 2014).
Based on the emerging data indicating that language outcomes for children who are hard of hearing are influenced by experience with sound, Leibold, Hillock-Dunn, Duncan, Roush, and Buss (2013) tested the hypothesis that reduced auditory experience associated with congenital sensory/neural hearing loss negatively impacts the development of perceptual abilities related to the segregation and selection of target from background speech. Children with bilateral sensory/neural hearing loss (9-17-years-old) Figure 3. Estimates of the signal-to-noise ratio (SNR) required to obtain 70.7% word recognition in a two-talker masker are plotted as a function of age for individual children and young adults tested by Flaherty et al. (2017). Thresholds estimated using target and masker speech with the same fundamental frequency (unaltered F0) are shown by the open circles, and thresholds estimated using target speech shifted up by six semitones (shifted F0) are shown by the shaded triangles. Whereas older children and adults benefitted from a target/masker F0 difference, children younger than about 9 years did not. and age-matched peers with normal hearing completed an adaptive spondee recognition task in two-talker speech and in speech-shaped noise. The children who were hard of hearing wore their hearing aids during testing. In the speech-shaped noise masker condition, which was expected to produce energetic masking, children who were hard of hearing required an additional 3.5 dB SNR relative to their peers with normal hearing to achieve comparable performance. This disadvantage increased to 8.1 dB SNR in the two-talker speech masker, which was expected to produce both energetic and informational masking. In a follow-up study, Hillock-Dunn, Taylor, Buss, and Leibold (2015) observed that performance in two-talker, but not in speechshaped noise, was correlated with parental reports of their children's everyday communication and speech understanding abilities. Interestingly, Corbin et al. (2016) failed to observe a correlation between speech reception thresholds in a two-talker speech and a speech-shaped noise masker in a related study involving over 50 school-age children with normal hearing. Considered together, these findings have important clinical implications because they suggest that measures of speech-in-speech recognition may be more predictive of children's functional hearing skills than conventional clinical assessments made in quiet or in steady noise.

Provisional Conclusions and Future Directions
Children are more susceptible to auditory masking than young adults, requiring a more advantageous SNR to achieve comparable performance. Although child-adult differences in masked speech recognition are evident in the presence of relatively steady noise, this performance gap is larger, and the time course of development is prolonged, in the presence of a small number of competing speech streams. Using the practical example of the child in the science classroom, speech produced by the student's classmates likely has a more detrimental effect on hearing and understanding the teacher's instruction than the noise produced by the aquarium. Emerging data indicate that, although child-adult differences are substantial for children with normal hearing, children who are hard of hearing are particularly vulnerable to speech-in-speech masking. Considering all of the currently available data, we propose the working hypothesis that maturation of the perceptual skills required to segregate target speech in real-world, multisource environments requires years of exposure to high-quality auditory input.
Several avenues of future research have emerged as important steps toward understanding the development of hearing in complex acoustic environments. First, isolating the factors responsible for children's increased susceptibility to speech-in-speech masking is paramount. Second, it is critical that we design rigorous, theoretically motivated experiments to evaluate the influence of early auditory experience on the development of masked speech perception skills. Finally, we are now in a position to create a new generation of clinical speech recognition tools that more closely approximate the types of complex acoustic environments children encounter in their everyday lives.