Effect of Linear Prediction Order to Modify Formant Locations for Children Speech Recognition | SpringerLink
Skip to main content

Effect of Linear Prediction Order to Modify Formant Locations for Children Speech Recognition

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14338))

Included in the following conference series:

  • 746 Accesses

Abstract

Children’s speech recognition shows poor performance as compared to adult speech. Large amount of data is required for the neural network models to achieve good performance. A very limited amount of children’s speech data is publicly available. A baseline system was developed using adult speech for training and children’s speech for testing. This kind of system suffers from mismatches between training and testing speech data. To overcome one of the mismatches, which is formant frequency locations between adults and children, in this paper we have explored the effect of linear prediction order to modify the formant frequency locations. The explored method studies for narrowband and wideband speech and found that they gave reductions in word error rate (WER) for GMM-HMM, DNN-HMM, and TDNN acoustic models. The TDNN acoustic model gives the best performance as compared to other acoustic models. The best formant modification factor \(\alpha \) is 0.1 for linear prediction order 6 for narrowband speech (WER 13.82%), and \(\alpha \) is 0.1 for linear prediction order 20 for wideband speech (WER 12.19%) for the TDNN acoustic model. Further, we have also compared the method with vocal tract length normalization (VTLN) and speaking rate adaptation (SRA), and it is found that the proposed method gives a better reduction in WERs as compared to VTLN and SRA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahmad, W., Shahnawazuddin, S., Kathania, H., Pradhan, G., Samaddar, A.: Improving children’s speech recognition through explicit pitch scaling based on iterative spectrogram inversion. In: Proceedings of INTERSPEECH 2017, pp. 2391–2395 (2017). https://doi.org/10.21437/INTERSPEECH.2017-302

  2. Batliner, A., et al.: The PF_STAR children’s speech corpus. In: Proceedings of INTERSPEECH, pp. 2761–2764 (2005)

    Google Scholar 

  3. Bhardwaj, V., et al.: Automatic speech recognition (ASR) systems for children: a systematic literature review. Appl. Sci. 12(9), 4419 (2022)

    Article  Google Scholar 

  4. Claes, T., Dologlou, I., ten Bosch, L., van Compernolle, D.: A novel feature transformation for vocal tract length normalization in automatic speech recognition. IEEE Trans. Speech Audio Process. 6(6), 549–557 (1998)

    Article  Google Scholar 

  5. Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  6. Digalakis, V., Rtischev, D., Neumeyer, L.: Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3, 357–366 (1995)

    Article  Google Scholar 

  7. Fainberg, J., Bell, P., Lincoln, M., Renals, S.: Improving children’s speech recognition through out-of-domain data augmentation. In: INTERSPEECH 2016, pp. 1598–1602 (2016). https://doi.org/10.21437/INTERSPEECH.2016-1348

  8. Huber, J., Stathopoulos, E., Curione, G., Ash, T., Johnson, K.: Formants of children, women, and men: the effects of vocal intensity variation. J. Acoust. Soc. Am. 106, 1532–42 (1999). https://doi.org/10.1121/1.427150

    Article  Google Scholar 

  9. Johnson, A., Fan, R., Morris, R., Alwan, A.: LPC augment: an LPC-based ASR data augmentation algorithm for low and zero-resource children’s dialects. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8577–8581 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746281

  10. Kathania, H.K., Ahmad, W., Shahnawazuddin, S., Samaddar, A.B.: Explicit pitch mapping for improved children’s speech recognition. Circ. Syst. Signal Process. 32, 2021–2044 (2018)

    Article  MathSciNet  Google Scholar 

  11. Kathania, H.K., Ghai, S., Sinha, R.: Soft-weighting technique for robust children speech recognition under mismatched condition. In: 2013 Annual IEEE India Conference (INDICON), pp. 1–6 (2013)

    Google Scholar 

  12. Kathania, H.K., Shahnawazuddin, S., Adiga, N., Ahmad, W.: Role of prosodic features on children’s speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5519–5523 (2018)

    Google Scholar 

  13. Kathania, H.K., Shahnawazuddin, S., Ahmad, W., Adiga, N., Jana, S.K., Samaddar, A.B.: Improving children’s speech recognition through time scale modification based speaking rate adaptation. In: 2018 International Conference on Signal Processing and Communications (SPCOM) (2018)

    Google Scholar 

  14. Kathania, H.K., Shahnawazuddin, S., Sinha, R.: Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition. In: 2014 International Conference on Signal Processing and Communications (SPCOM), pp. 1–5 (2014)

    Google Scholar 

  15. Kathania, H.K., Kadiri, S.R., Alku, P., Kurimo, M.: A formant modification method for improved ASR of children’s speech. Speech Commun. 136, 98–106 (2022)

    Article  Google Scholar 

  16. Kumar Kathania, H., Reddy Kadiri, S., Alku, P., Kurimo, M.: Study of formant modification for children ASR. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7429–7433 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053334

  17. Laine, U.K., Karjalainen, M., Altosaar, T.: Warped linear prediction (WLP) in speech and audio processing. In: Proceedings of ICASSP 1994, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. III-349. IEEE (1994)

    Google Scholar 

  18. Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)

    Article  Google Scholar 

  19. Lee, S., Potamianos, A., Narayanan, S.S.: Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soci. Am. 105(3), 1455–1468 (1999)

    Article  Google Scholar 

  20. Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)

    Article  Google Scholar 

  21. Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of INTERSPEECH 2018, ISCA, pp. 3743–3747 (2018)

    Google Scholar 

  22. Povey, D., et al.: The Kaldi Speech recognition toolkit. In: Proceedings of ASRU (2011)

    Google Scholar 

  23. Rath, S.P., Povey, D., Veselý, K., Černocký, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH (2013)

    Google Scholar 

  24. Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of ICASSP, vol. 1, pp. 81–84 (1995)

    Google Scholar 

  25. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013, pp. 55–59. IEEE (2013)

    Google Scholar 

  26. Schalkwyk, J., et al.: Your word is my command: google search by voice: a case study. In: Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, vol. 4, pp. 61–90 (2010)

    Google Scholar 

  27. Scukanec, G.P., Petrosino, L., Squibb, K.: Formant frequency characteristics of children, young adult, and aged female speakers. Percept. Mot. Skills 73(1), 203–208 (1991)

    Article  Google Scholar 

  28. Serizel, R., Giuliani, D.: Vocal tract length normalisation approaches to DNN-based children’s and adults’ speech recognition. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 135–140 (2014)

    Google Scholar 

  29. Shahnawazuddin, S., Adiga, N., Kathania, H.K.: Effect of prosody modification on children’s ASR. IEEE Signal Process. Lett. 24(11), 1749–1753 (2017)

    Article  Google Scholar 

  30. Shahnawazuddin, S., Dey, A., Sinha, R.: Pitch-adaptive front-end features for robust children’s ASR. In: INTERSPEECH (2016)

    Google Scholar 

  31. Shivakumar, P.G., Georgiou, P.: Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations. Comput. Speech Lang. 63, 101077 (2020). https://doi.org/10.1016/j.csl.2020.101077

    Article  Google Scholar 

  32. Strube, H.W.: Linear prediction on a warped frequency scale. J. Acoust. Soc. Am. 68(4), 1071–1076 (1980)

    Article  Google Scholar 

  33. Yadav, I.C., Shahnawazuddin, S., Govind, D., Pradhan, G.: Spectral smoothing by variational mode decomposition and its effect on noise and pitch robustness of ASR system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5629–5633 (2018)

    Google Scholar 

  34. Yildirim, S., Narayanan, S., Byrd, D., Khurana, S.: Acoustic analysis of preschool children’s speech. In: In ICPhS-2015, pp. 949–952 (2003)

    Google Scholar 

  35. Zhu, X., Beauregard, G.T., Wyse, L.L.: Real-time signal estimation from modified short-time fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Udara Laxman Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar, U.L., Kurimo, M., Kathania, H.K. (2023). Effect of Linear Prediction Order to Modify Formant Locations for Children Speech Recognition. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48309-7_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48308-0

  • Online ISBN: 978-3-031-48309-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics