Abstract
Purpose
Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers.
Method
The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals.
Results
The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too.
Conclusion
The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors.
Supplemental Material

References
-
Baghai-Ravary, L., Grau, S., & Kochanski, G. (2011). Detecting gross alignment errors in the Spoken British National Corpus.arXiv. Retrieved January 01, 2011, from https://ui.adsabs.harvard.edu/abs/2011arXiv1101.1682B -
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4.Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01 -
Beckman, M. E., Plummer, A. R., Munson, B., & Reidy, P. F. (2017). Methods for eliciting, annotating, and analyzing databases for child speech development.Computer Speech & Language, 45, 278–299. https://doi.org/10.1016/j.csl.2017.02.010 -
Boersma, P., & Weenink, D. (2015). Praat: Doing phonetics by computer[Computer program] . Retrieved October 1, 2020, from https://www.praat.org -
Gorman, K., Howell, J., & Wagner, M. (2011). Prosodylab-Aligner: A tool for forced alignment of laboratory speech.Canadian Acoustics, 39(3), 192–193. -
Hodge, M., & Daniels, J. (2007). TOCS+ Intelligibility Measures[Computer software] . University of Alberta. http://www.tocs.plus.ualberta.ca/software_Intelligibility.html -
Kent, R. D. (1992). The biology of phonological development.InC. A. Ferguson, L. Menn, & C. Stoel-Gammon (Eds.), Phonological development: Models, research, implications (pp. 65–90). York Press. -
Keshet, J. (2018). Automatic speech recognition: A primer for speech-language pathology researchers.International Journal of Speech-Language Pathology, 20(6), 599–609. https://doi.org/10.1080/17549507.2018.1510033 -
Knowles, T., Clayards, M., & Sonderegger, M. (2018). Examining factors influencing the viability of automatic acoustic analysis of child speech.Journal of Speech, Language, and Hearing Research, 61(10), 2487–2501. https://doi.org/10.1044/2018_JSLHR-S-17-0275 -
Lenth, R. (2020). emmeans: Estimated Marginal Means, a.k.a. Least-Squares Means. https://CRAN.R-project.org/package=emmeans -
MacKenzie, L., & Turton, D. (2020). Assessing the accuracy of existing forced alignment software on varieties of British English.Linguistics Vanguard, 6(Suppl. 1). https://doi.org/10.1515/lingvan-2018-0061 -
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi.InF. Lacerda (Ed.), Proceedings of Interspeech 2017 (pp. 498–502). International Speech Communication Association. -
McLeod, S., & Crowe, K. (2018). Children's consonant acquisition in 27 languages: A cross-linguistic review.American Journal of Speech-Language Pathology, 27(4), 1546–1571. https://doi.org/10.1044/2018_AJSLP-17-0100 -
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books.V. Clarkson, & J. Manton (Eds.), 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206–5210). IEEE. -
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi Speech Recognition Toolkit.In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE. - R Core Team. (2020). R: A language and environment for statistical computing.In R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
-
Shinozaki, T., & Furui, S. (2003). Hidden mode HMM using Bayesian network for modeling speaking rate fluctuation.InJ. Bilmes, & W. Byrne (Eds.), 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721) (pp. 417–422). IEEE. https://doi.org/10.1109/ASRU.2003.1318477 -
Stuart-Smith, J., Sonderegger, M., Rathcke, T., & Macdonald, R. (2015). The private life of stops: VOT in a real-time corpus of spontaneous Glaswegian.Laboratory Phonology, 6(3–4), 505–549. https://doi.org/10.1515/lp-2015-0015 -
Tu, M., Grabek, A., Liss, J. M., & Berisha, V. (2018). Investigating the role of L1 in automatic pronunciation evaluation of L2 speech. arXiv. https://doi.org/10.21437/Interspeech.2018-1350 -
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Ragni, A., Valtchev, V., Woodland, P., & Zhang, C. (2015). The HTK book for HTK version 3.5. Cambridge University Engineering Department. -
Yuan, J., & Liberman, M. (2008). Speaker identification on the SCOTUS corpus.The Journal of the Acoustical Society of America, 123(5), 3878. https://doi.org/10.1121/1.2935783 -
Yuan, J., & Liberman, M. (2011). Automatic detection of “g-dropping” in American English using forced alignment.In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding (pp. 490–493). IEEE. https://doi.org/10.1109/ASRU.2011.6163980