Abstract
Children’s speech recognition shows poor performance as compared to adult speech. Large amount of data is required for the neural network models to achieve good performance. A very limited amount of children’s speech data is publicly available. A baseline system was developed using adult speech for training and children’s speech for testing. This kind of system suffers from mismatches between training and testing speech data. To overcome one of the mismatches, which is formant frequency locations between adults and children, in this paper we have explored the effect of linear prediction order to modify the formant frequency locations. The explored method studies for narrowband and wideband speech and found that they gave reductions in word error rate (WER) for GMM-HMM, DNN-HMM, and TDNN acoustic models. The TDNN acoustic model gives the best performance as compared to other acoustic models. The best formant modification factor \(\alpha \) is 0.1 for linear prediction order 6 for narrowband speech (WER 13.82%), and \(\alpha \) is 0.1 for linear prediction order 20 for wideband speech (WER 12.19%) for the TDNN acoustic model. Further, we have also compared the method with vocal tract length normalization (VTLN) and speaking rate adaptation (SRA), and it is found that the proposed method gives a better reduction in WERs as compared to VTLN and SRA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmad, W., Shahnawazuddin, S., Kathania, H., Pradhan, G., Samaddar, A.: Improving children’s speech recognition through explicit pitch scaling based on iterative spectrogram inversion. In: Proceedings of INTERSPEECH 2017, pp. 2391–2395 (2017). https://doi.org/10.21437/INTERSPEECH.2017-302
Batliner, A., et al.: The PF_STAR children’s speech corpus. In: Proceedings of INTERSPEECH, pp. 2761–2764 (2005)
Bhardwaj, V., et al.: Automatic speech recognition (ASR) systems for children: a systematic literature review. Appl. Sci. 12(9), 4419 (2022)
Claes, T., Dologlou, I., ten Bosch, L., van Compernolle, D.: A novel feature transformation for vocal tract length normalization in automatic speech recognition. IEEE Trans. Speech Audio Process. 6(6), 549–557 (1998)
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)
Digalakis, V., Rtischev, D., Neumeyer, L.: Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3, 357–366 (1995)
Fainberg, J., Bell, P., Lincoln, M., Renals, S.: Improving children’s speech recognition through out-of-domain data augmentation. In: INTERSPEECH 2016, pp. 1598–1602 (2016). https://doi.org/10.21437/INTERSPEECH.2016-1348
Huber, J., Stathopoulos, E., Curione, G., Ash, T., Johnson, K.: Formants of children, women, and men: the effects of vocal intensity variation. J. Acoust. Soc. Am. 106, 1532–42 (1999). https://doi.org/10.1121/1.427150
Johnson, A., Fan, R., Morris, R., Alwan, A.: LPC augment: an LPC-based ASR data augmentation algorithm for low and zero-resource children’s dialects. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8577–8581 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746281
Kathania, H.K., Ahmad, W., Shahnawazuddin, S., Samaddar, A.B.: Explicit pitch mapping for improved children’s speech recognition. Circ. Syst. Signal Process. 32, 2021–2044 (2018)
Kathania, H.K., Ghai, S., Sinha, R.: Soft-weighting technique for robust children speech recognition under mismatched condition. In: 2013 Annual IEEE India Conference (INDICON), pp. 1–6 (2013)
Kathania, H.K., Shahnawazuddin, S., Adiga, N., Ahmad, W.: Role of prosodic features on children’s speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5519–5523 (2018)
Kathania, H.K., Shahnawazuddin, S., Ahmad, W., Adiga, N., Jana, S.K., Samaddar, A.B.: Improving children’s speech recognition through time scale modification based speaking rate adaptation. In: 2018 International Conference on Signal Processing and Communications (SPCOM) (2018)
Kathania, H.K., Shahnawazuddin, S., Sinha, R.: Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition. In: 2014 International Conference on Signal Processing and Communications (SPCOM), pp. 1–5 (2014)
Kathania, H.K., Kadiri, S.R., Alku, P., Kurimo, M.: A formant modification method for improved ASR of children’s speech. Speech Commun. 136, 98–106 (2022)
Kumar Kathania, H., Reddy Kadiri, S., Alku, P., Kurimo, M.: Study of formant modification for children ASR. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7429–7433 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053334
Laine, U.K., Karjalainen, M., Altosaar, T.: Warped linear prediction (WLP) in speech and audio processing. In: Proceedings of ICASSP 1994, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. III-349. IEEE (1994)
Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
Lee, S., Potamianos, A., Narayanan, S.S.: Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soci. Am. 105(3), 1455–1468 (1999)
Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of INTERSPEECH 2018, ISCA, pp. 3743–3747 (2018)
Povey, D., et al.: The Kaldi Speech recognition toolkit. In: Proceedings of ASRU (2011)
Rath, S.P., Povey, D., Veselý, K., Černocký, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH (2013)
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of ICASSP, vol. 1, pp. 81–84 (1995)
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013, pp. 55–59. IEEE (2013)
Schalkwyk, J., et al.: Your word is my command: google search by voice: a case study. In: Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, vol. 4, pp. 61–90 (2010)
Scukanec, G.P., Petrosino, L., Squibb, K.: Formant frequency characteristics of children, young adult, and aged female speakers. Percept. Mot. Skills 73(1), 203–208 (1991)
Serizel, R., Giuliani, D.: Vocal tract length normalisation approaches to DNN-based children’s and adults’ speech recognition. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 135–140 (2014)
Shahnawazuddin, S., Adiga, N., Kathania, H.K.: Effect of prosody modification on children’s ASR. IEEE Signal Process. Lett. 24(11), 1749–1753 (2017)
Shahnawazuddin, S., Dey, A., Sinha, R.: Pitch-adaptive front-end features for robust children’s ASR. In: INTERSPEECH (2016)
Shivakumar, P.G., Georgiou, P.: Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations. Comput. Speech Lang. 63, 101077 (2020). https://doi.org/10.1016/j.csl.2020.101077
Strube, H.W.: Linear prediction on a warped frequency scale. J. Acoust. Soc. Am. 68(4), 1071–1076 (1980)
Yadav, I.C., Shahnawazuddin, S., Govind, D., Pradhan, G.: Spectral smoothing by variational mode decomposition and its effect on noise and pitch robustness of ASR system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5629–5633 (2018)
Yildirim, S., Narayanan, S., Byrd, D., Khurana, S.: Acoustic analysis of preschool children’s speech. In: In ICPhS-2015, pp. 949–952 (2003)
Zhu, X., Beauregard, G.T., Wyse, L.L.: Real-time signal estimation from modified short-time fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kumar, U.L., Kurimo, M., Kathania, H.K. (2023). Effect of Linear Prediction Order to Modify Formant Locations for Children Speech Recognition. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-48309-7_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48308-0
Online ISBN: 978-3-031-48309-7
eBook Packages: Computer ScienceComputer Science (R0)