Abstract
Purpose
Visual cues from a speaker's face may benefit perceptual adaptation to degraded speech, but current evidence is limited. We aimed to replicate results from previous studies to establish the extent to which visual speech cues can lead to greater adaptation over time, extending existing results to a real-time adaptation paradigm (i.e., without a separate training period). A second aim was to investigate whether eye gaze patterns toward the speaker's mouth were related to better perception, hypothesizing that listeners who looked more at the speaker's mouth would show greater adaptation.
Method
A group of listeners (n = 30) was presented with 90 noise-vocoded sentences in audiovisual format, whereas a control group (n = 29) was presented with the audio signal only. Recognition accuracy was measured throughout and eye tracking was used to measure fixations toward the speaker's eyes and mouth in the audiovisual group.
Results
Previous studies were partially replicated: The audiovisual group had better recognition throughout and adapted slightly more rapidly, but both groups showed an equal amount of improvement overall. Longer fixations on the speaker's mouth in the audiovisual group were related to better overall accuracy. An exploratory analysis further demonstrated that the duration of fixations to the speaker's mouth decreased over time.
Conclusions
The results suggest that visual cues may not benefit adaptation to degraded speech as much as previously thought. Longer fixations on a speaker's mouth may play a role in successfully decoding visual speech cues; however, this will need to be confirmed in future research to fully understand how patterns of eye gaze are related to audiovisual speech recognition. All materials, data, and code are available at https://osf.io/2wqkf/.

References
-
Adank, P., & Janse, E. (2010). Comprehension of a novel accent by young and older listeners.Psychology and Aging, 25(3), 736–740. https://doi.org/10.1037/a0020054 -
Alsius, A., Paré, M., & Munhall, K. G. (2018). Forty years after hearing lips and seeing voices: The McGurk effect revisited.Multisensory Research, 31(1–2), 111–144. https://doi.org/10.1163/22134808-00002565 -
Banks, B., Gowen, E., Munro, K. J., & Adank, P. (2015a). Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.Frontiers in Human Neuroscience, 9, 422. https://doi.org/10.3389/fnhum.2015.00422 -
Banks, B., Gowen, E., Munro, K. J., & Adank, P. (2015b). Cognitive predictors of perceptual adaptation to accented speech.The Journal of the Acoustical Society of America, 137(4), 2015–2024. https://doi.org/10.1121/1.4916265 -
Barenholtz, E., Mavica, L., & Lewkowicz, D. J. (2016). Language familiarity modulates relative attention to the eyes and mouth of a talker.Cognition, 147, 100–105. https://doi.org/10.1016/j.cognition.2015.11.013 -
Bench, J., Kowal, Å., & Bamford, J. (1979). The BKB (Bamford–Kowal–Bench) sentence lists for partially-hearing children.British Journal of Audiology, 13(3), 108–112. https://doi.org/10.3109/03005367909078884 -
Bernstein, L. E., Auer, E. T., Jr., Eberhardt, S. P., & Jiang, J. (2013). Auditory perceptual learning for speech perception can be enhanced by audiovisual training.Frontiers in Neuroscience, 7, 34. https://doi.org/10.3389/fnins.2013.00034 -
Birmingham, E., & Kingstone, A. (2009). Human social attention.Annals of the New York Academy of Sciences, 1156(1), 118–140. https://doi.org/10.1111/j.1749-6632.2009.04468.x -
Birulés, J., Bosch, L., Pons, F., & Lewkowicz, D. J. (2020). Highly proficient L2 speakers still need to attend to a talker's mouth when processing L2 speech.Language, Cognition and Neuroscience, 35(10), 1314–1325. https://doi.org/10.1080/23273798.2020.1762905 -
Blackburn, C. L., Kitterick, P. T., Jones, G., Sumner, C. J., & Stacey, P. C. (2019). Visual speech benefit in clear and degraded speech depends on the auditory intelligibility of the talker and the number of background talkers.Trends in Hearing, 23, 233121651983786. https://doi.org/10.1177/2331216519837866 -
Boersma, P., & Weenink, D. (2018). Praat: Doing phonetics by computer[Computer program] . Retrieved October 2017, from http://www.praat.org/ -
Bradlow, A. R., & Bent, T. (2008). Perceptual adaptation to non-native speech.Cognition, 106(2), 707–729. https://doi.org/10.1016/j.cognition.2007.04.005 -
Brown, V. A., McLaughlin, D. J., Strand, J. F., & Van Engen, K. J. (2020). Rapid adaptation to fully intelligible nonnative-accented speech reduces listening effort.Quarterly Journal of Experimental Psychology, 73(9), 1431–1443. https://doi.org/10.1177/1747021820916726 -
Brysbaert, M., & Stevens, M. (2018). Power analysis and effect size in mixed effects models: A tutorial.Journal of Cognition, 1(1), 9. https://doi.org/10.5334/joc.10 -
Buchan, J. N., Paré, M., & Munhall, K. G. (2007). Spatial statistics of gaze fixations during dynamic face processing.Social Neuroscience, 2(1), 1–13. https://doi.org/10.1080/17470910601043644 -
Buchan, J. N., Paré, M., & Munhall, K. G. (2008). The effect of varying talker identity and listening conditions on gaze behavior during audiovisual speech perception.Brain Research, 1242, 162–171. https://doi.org/10.1016/j.brainres.2008.06.083 -
Calvert, G. A., & Campbell, R. (2003). Reading speech from still and moving faces: The neural substrates of visible speech.Journal of Cognitive Neuroscience, 15(1), 57–70. https://doi.org/10.1162/089892903321107828 -
Christianson, S. A., Loftus, E. F., Hoffman, H., & Loftus, G. R. (1991). Eye fixations and memory for emotional events.Journal of Experimental Psychology: Learning, Memory, and Cognition, 17(4), 693–701. https://doi.org/10.1037/0278-7393.17.4.693 -
Cronin, D. A., Peacock, C. E., & Henderson, J. M. (2020). Visual and verbal working memory loads interfere with scene-viewing.Attention, Perception, & Psychophysics, 82(6), 2814–2820. https://doi.org/10.3758/s13414-020-02076-1 -
Davis, C., & Kim, J. (2004). Audio–visual interactions with intact clearly audible speech.The Quarterly Journal of Experimental Psychology Section A, 57(6), 1103–1121. https://doi.org/10.1080/02724980343000701 -
Davis, M. H., Johnsrude, I. S., Hervais-Adelman, A., Taylor, K., & McGettigan, C. (2005). Lexical information drives perceptual learning of distorted speech: Evidence from the comprehension of noise-vocoded sentences.Journal of Experimental Psychology: General, 134(2), 222–241. https://doi.org/10.1037/0096-3445.134.2.222 -
Dorman, M. F., Loizou, P. C., & Rainey, D. (1997). Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs.The Journal of the Acoustical Society of America, 102(4), 2403–2411. https://doi.org/10.1121/1.419603 -
Dupoux, E., & Green, K. (1997). Perceptual adjustment to highly compressed speech: Effects of talker and rate changes.Human Perception and Performance, 23(3), 914–927. https://doi.org/10.1037/0096-1523.23.3.914 -
Erber, N. P. (1975). Auditory-visual perception of speech.Journal of Speech and Hearing Disorders, 40(4), 481–492. https://doi.org/10.1044/jshd.4004.481 -
Everdell, I. T., Marsh, H., Yurick, M. D., Munhall, K. G., & Paré, M. (2007). Gaze behaviour in audiovisual speech perception: Asymmetrical distribution of face-directed fixations.Perception, 36(10), 1535–1545. https://doi.org/10.1068/p5852 -
Faulkner, A., Rosen, S., & Smith, C. (2000). Effects of the salience of pitch and periodicity information on the intelligibility of four-channel vocoded speech: Implications for cochlear implants.The Journal of the Acoustical Society of America, 108(4), 1877–1887. https://doi.org/10.1121/1.1310667 -
Golomb, J. D., Peelle, J. E., & Wingfield, A. (2007). Effects of stimulus variability and adult aging on adaptation to time-compressed speech.The Journal of the Acoustical Society of America, 121(3), 1701–1708. https://doi.org/10.1121/1.2436635 -
Greenwood, D. D. (1990). A cochlear frequency-position function for several species—29 years later.The Journal of the Acoustical Society of America, 87(6), 2592–2605. https://doi.org/10.1121/1.399052 -
Gurler, D., Doyle, N., Walker, E., Magnotti, J., & Beauchamp, M. (2015). A link between individual differences in multisensory speech perception and eye movements.Attention, Perception, & Psychophysics, 77(4), 1333–1341. https://doi.org/10.3758/s13414-014-0821-1 -
Hazan, V., Kim, J., & Chen, Y. (2010). Audiovisual perception in adverse conditions: Language, speaker and listener effects.Speech Communication, 52(11–12), 996–1009. https://doi.org/10.1016/j.specom.2010.05.003 -
Henderson, J. M., Malcolm, G. L., & Schandl, C. (2009). Searching in the dark: Cognitive relevance drives attention in real-world scenes.Psychonomic Bulletin & Review, 16(5), 850–856. https://doi.org/10.3758/PBR.16.5.850 -
Hervais-Adelman, A., Davis, M. H., Johnsrude, I. S., & Carlyon, R. P. (2008). Perceptual learning of noise vocoded words: Effects of feedback and lexicality.Journal of Experimental Psychology: Human Perception and Performance, 34(2), 460–474. https://doi.org/10.1037/0096-1523.34.2.460 -
Jerger, S., Damian, M. F., Karl, C., & Abdi, H. (2018). Developmental shifts in detection and attention for auditory, visual, and audiovisual speech.Journal of Speech, Language, and Hearing Research, 61(12), 3095–3112. https://doi.org/10.1044/2018_JSLHR-H-17-0343 -
Kawase, T., Sakamoto, S., Hori, Y., Maki, A., Suzuki, Y., & Kobayashi, T. (2009). Bimodal audio-visual training enhances auditory adaptation process.NeuroReport, 20(14), 1231–1234. https://doi.org/10.1097/WNR.0b013e32832fbef8 -
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). lmerTest Package: Tests in linear mixed effects models.Journal of Statistical Software, 82(13). https://doi.org/10.18637/jss.v082.i13 -
Lansing, C. R., & McConkie, G. W. (2003). Word identification and eye fixation locations in visual and visual-plus-auditory presentations of spoken sentences.Perception & Psychophysics, 65(4), 536–552. https://doi.org/10.3758/BF03194581 -
Lewkowicz, D. J., & Hansen-Tift, A. M. (2012). Infants deploy selective attention to the mouth of a talking face when learning speech.Proceedings of the National Academy of Sciences of the United States of America, 109(5), 1431–1436. https://doi.org/10.1073/pnas.1114783109 -
Loizou, P. C., Dorman, M., & Tu, Z. (1999). On the number of channels needed to understand speech.The Journal of the Acoustical Society of America, 106(4), 2097–2103. https://doi.org/10.1121/1.427954 -
Lusk, L. G., & Mitchel, A. D. (2016). Differential gaze patterns on eyes and mouth during audiovisual speech segmentation.Frontiers in Psychology, 7, 52. https://doi.org/10.3389/fpsyg.2016.00052 -
MacLeod, A., & Summerfield, Q. (1987). Quantifying the contribution of vision to speech perception in noise.British Journal of Audiology, 21(2), 131–141. https://doi.org/10.3109/03005368709077786 -
Malcolm, G. L., Lanyon, L. J., Fugard, A. J. B., & Barton, J. J. S. (2008). Scan patterns during the processing of facial expression versus identity: An exploration of task-driven and stimulus-driven effects.Journal of Vision, 8(8), 2–2. https://doi.org/10.1167/8.8.2 -
Matin, E. (1974). Saccadic suppression: A review and an analysis.Psychological Bulletin, 81(12), 899–917. https://doi.org/10.1037/h0037368 -
Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review.Language and Cognitive Processes, 27(7–8), 953–978. https://doi.org/10.1080/01690965.2012.705006 -
McGettigan, C., Rosen, S., & Scott, S. K. (2014). Lexico-semantic and acoustic-phonetic processes in the perception of noise-vocoded speech: Implications for cochlear implantation.Frontiers in Systems Neuroscience, 8. https://doi.org/10.3389/fnsys.2014.00018 -
Mitchel, A. D., & Weiss, D. J. (2014). Visual speech segmentation: Using facial cues to locate word boundaries in continuous speech.Language, Cognition and Neuroscience, 29(7), 771–780. https://doi.org/10.1080/01690965.2013.791703 -
Morin-Lessard, E., Poulin-Dubois, D., Segalowitz, N., & Byers-Heinlein, K. (2019). Selective attention to the mouth of talking faces in monolinguals and bilinguals aged 5 months to 5 years.Developmental Psychology, 55(8), 1640–1655. https://doi.org/10.1037/dev0000750 -
Paulus, M., Hazan, V., & Adank, P. (2020). The relationship between talker acoustics, intelligibility, and effort in degraded listening conditions.The Journal of the Acoustical Society of America, 147(5), 3348–3359. https://doi.org/10.1121/10.0001212 -
Peelle, J. E., & Wingfield, A. (2005). Dissociations in perceptual learning revealed by adult age differences in adaptation to time-compressed speech.Journal of Experimental Psychology: Human Perception and Performance, 31(6), 1315–1330. https://doi.org/10.1037/0096-1523.31.6.1315 -
Preminger, J. E., Lin, H.-B., Payen, M., & Levitt, H. (1998). Selective visual masking in speechreading.Journal of Speech, Language, and Hearing Research, 41(3), 564–575. https://doi.org/10.1044/jslhr.4103.564 -
Pilling, M., & Thomas, S. (2011). Audiovisual cues and perceptual learning of spectrally distorted speech.Language and Speech, 54(4), 487–497. https://doi.org/10.1177/0023830911404958 -
Rosen, S., Faulkner, A., & Wilkinson, L. (1999). Adaptation by normal listeners to upward spectral shifts of speech: Implications for cochlear implants.The Journal of the Acoustical Society of America, 106(6), 3629–3636. https://doi.org/10.1121/1.428215 -
Rothauser, E. H., Chapman, N. D., Guttman, N., Nordby, K. S., Silbiger, H. R., Urbanek, G. E., & Weinstock, M. (1969). IEEE recommended practice for speech quality measurements.IEEE Transactions on Audio and Electroacoustics, 17(3), 225–246. https://doi.org/10.1109/TAU.1969.1162058 -
Scheinberg, J. S. (1980). Analysis of speechreading cues using an interleaved technique.Journal of Communication Disorders, 13(6), 489–492. https://doi.org/10.1016/0021-9924(80)90048-9 -
Schoof, T., & Rosen, S. (2015). High sentence predictability increases the fluctuating masker benefit.The Journal of the Acoustical Society of America, 138(3), EL181–EL186. https://doi.org/10.1121/1.4929627 -
Scott, S. K., Rosen, S., Lang, H., & Wise, R. J. S. (2006). Neural correlates of intelligibility in speech investigated with noise vocoded speech—A positron emission tomography study.The Journal of the Acoustical Society of America, 120(2), 1075–1083. https://doi.org/10.1121/1.2216725 -
Sebastian-Galles, N., Dupoux, E., Costa, A., & Mehler, J. (2000). Adaptation to time-compressed speech: Phonological determinants.Perception & Psychophysics, 62, 834–842. https://doi.org/10.3758/BF03206926 -
Shannon, R. V., Zeng, F.-G. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues.Science (New York, N.Y.), 270(5234), 303–304. https://doi.org/10.1126/science.270.5234.303 -
Sommers, M. S., Tye-Murray, N., & Spehar, B. (2005). Auditory-visual speech perception and auditory-visual enhancement in normal-hearing younger and older adults.Ear and Hearing, 26(3), 263–275. https://doi.org/10.1097/00003446-200506000-00003 -
Stacey, J. E., Howard, C. J., Mitra, S., & Stacey, P. C. (2020). Audio-visual integration in noise: Influence of auditory and visual stimulus degradation on eye movements and perception of the McGurk effect.Attention, Perception, & Psychophysics, 82(7), 3544–3557. https://doi.org/10.3758/s13414-020-02042-x -
Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise.The Journal of the Acoustical Society of America, 26(2), 212–215. https://doi.org/10.1121/1.1907309 -
Summerfield, Q. (1987). Some preliminaries to a comprehensive account of audio-visual speech perception.InB. Dodd & R. Campbell (Eds.), Hearing by eye: The psychology of lip-reading (pp. 3–51). Erlbaum. -
Thomas, S. M., & Jordan, T. R. (2004). Contributions of oral and extraoral facial movement to visual and audiovisual speech perception.Journal of Experimental Psychology: Human Perception and Performance, 30(5), 873–888. https://doi.org/10.1037/0096-1523.30.5.873 -
Trotter, A., Banks, B., & Adank, P. (2020). Effects of the availability of visual cues during adaptation to noise—Vocoded speech. PsyArXiv. https://doi.org/10.31234/osf.io/jxaeb -
Vatikiotis-Bateson, E., Eigsti, I. M., Yano, S., & Munhall, K. G. (1998). Eye movement of perceivers during audiovisual speech perception.Perception & Psychophysics, 60(6), 926–940. https://doi.org/10.3758/BF03211929 -
Vo, M. L.-H., Smith, T. J., Mital, P. K., & Henderson, J. M. (2012). Do the eyes really have it? Dynamic allocation of attention when viewing moving faces.Journal of Vision, 12(13), 3–3. https://doi.org/10.1167/12.13.3 -
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values.Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105 -
Wang, J., Yumeng, Z., Yu, C., Abdilbar, M., Yu, M., Zhang, J., & Jianwu, D. (2020). An eye-tracking study on audiovisual speech perception strategies adopted by normal-hearing and deaf adults under different language familiarities.Journal of Speech, Language, and Hearing Research, 63(7), 2245–2254. https://doi.org/10.1044/2020_JSLHR-19-00223 -
Wayne, R. V., & Johnsrude, I. S. (2012). The role of visual speech information in supporting perceptual learning of degraded speech.Journal of Experimental Psychology: Applied, 18(4), 419–435. https://doi.org/10.1037/a0031042 -
Wilson, A. H., Alsius, A., Paré, M., & Munhall, K. G. (2016). Spatial frequency requirements and gaze strategy in visual-only and audiovisual speech perception.Journal of Speech, Language, and Hearing Research, 59(4), 601–615. https://doi.org/10.1044/2016_JSLHR-S-15-0092 -
Worster, E., Pimperton, H., Ralph-Lewis, A., Monroy, L., Hulme, C., & MacSweeney, M. (2018). Eye movements during visual speech perception in deaf and hearing children.Language Learning, 68(Suppl. 1), 159–179. https://doi.org/10.1111/lang.12264 -
Yarbus, A. L. (1967). Eye movements and vision. Springer. https://doi.org/10.1007/978-1-4899-5379-7