Age and Gender Recognition from Speech Using Deep Neural Networks | SpringerLink
Skip to main content

Age and Gender Recognition from Speech Using Deep Neural Networks

  • Conference paper
  • First Online:
Advances in Physical Agents II (WAF 2020)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1285))

Included in the following conference series:

Abstract

This paper deals with joint gender identification and age group classification from speech, aimed at improving the functionalities of Interactive Voice Response Systems. Deep Neural Networks are used, because they have recently demonstrated discriminative and representation capabilities over a wide range of applications, among them, speech processing problems based on features extraction and selection. A comparative study of various neural network architectures and sizes is presented to gather knowledge about performance dependence on the network architecture and the number of free parameters. The classification framework was trained and evaluated using Mozilla’s ‘Common Voice’ dataset, an open and crowdsourced speech corpus. The results are promising, with the best systems achieving a gender identification error lower than 2% and an age group classification error lower than 20%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 17159
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 21449
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abadi, M., Agarwal, A., Barham, P., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/. Software available from tensorflow.org

  2. Badshah, A., Ahmad, J., Rahim, N., Baik, S.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service (PlatCon), pp. 1–5 (2017)

    Google Scholar 

  3. Bahari, M., McLaren, M., Van Leeuwen, D., et al.: Age estimation from telephone speech using i-vectors. In: Proceedings of Interspeech 2012. Portland, USA (2012)

    Google Scholar 

  4. Bai, S., Kolter, J., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arxiv 2018. arXiv preprint arXiv:1803.01271

  5. Bhat, C., Mithum, B., Saxena, V., Kulkarni, V., Kopparapu, S.: Deploying usable speech enabled IVR systems for mass use. In: 2013 IEEE International Conference on Human Computer Interaction (ICHCI), pp. 1–5 (2013)

    Google Scholar 

  6. Cakir, E., Adavanne, S., Parascandolo, G., Drossos, K., Virtanen, T.: Convolutional recurrent neural networks for bird audio detection. In: 2017 25th European Signal Processing Conference (EUSIPCO), pp. 1744–1748 (2017)

    Google Scholar 

  7. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014). arXiv:1406.1078

  8. Chollet, F., et al.: Keras (2015). https://keras.io

  9. Rumelhart, R., et al.: Learning representations by back-propagating errors. Nature 521, 533–536 (1986)

    Article  Google Scholar 

  10. Foundation, M.: Common voice (2019). https://voice.mozilla.org

  11. Gorin, A., Riccardi, G., Wright, J.: How may i help you? Speech Commun. 23(1–2), 113–127 (1997)

    Article  Google Scholar 

  12. Kalluri, S.B., Vijayasenan, D., Ganapathy, S.: A deep neural network based end to end model for joint height and age estimation from short duration speech. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), pp. 6580–6584. IEEE (2019)

    Google Scholar 

  13. Kang, Y., Tsang, K.Y., Wong, K.W.Y.: The effect of speech rate on age estimation in conversational speech. Toronto Working Papers in Linguistics (TWPL) 42, 1–10 (2020)

    Google Scholar 

  14. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 323, 436–444 (2015)

    Article  Google Scholar 

  15. Metze, F., Ajmera, J., Englert, R., Bub, U., et al.: Comparison of four approaches to age and gender recognition for telephone applications. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), vol. 4, pp. IV–1089 (2007)

    Google Scholar 

  16. Minematsu, N., Sekiguchi, M., Hirose, K.: Automatic estimation of one’s age with his/her speech based upon acoustic modeling techniques of speakers. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 1, pp. I–137 (2002)

    Google Scholar 

  17. Mohino-Herranz, I., García-Gómez, J., Utrilla-Manso, M., Rosa-Zurera, M.: Precision maximization in anger detection in interactive voice response systems. In: 145th Convention of the Audio Engineering Society, p. 10090 (2018)

    Google Scholar 

  18. Neumann, M., Vu, N.T.: Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. arXiv preprint arXiv:1706.00612 (2017)

  19. Pandey, A., Wang, D.: TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), pp. 6875–6879 (2019)

    Google Scholar 

  20. Pappas, D., Androutsopoulos, I., Papageorgiou, H.: Anger detection in call center dialogues. In: 2015 6th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), pp. 139–144 (2015)

    Google Scholar 

  21. Park, S.R., Lee, J.: A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132 (2016)

  22. Pitts, W., McCulloch, W.: How we know universals the perception of auditory and visual forms. Bull. Math. Biophys. 9(3), 127–147 (1947). https://doi.org/10.1007/BF02478291

    Article  Google Scholar 

  23. Ranjan, S., Hansen, J.H.: Improved gender independent speaker recognition using convolutional neural network based bottleneck features. In: Proceedings of Interspeech, pp. 1009–1013 (2017)

    Google Scholar 

  24. Sanchez-Hevia, H., Gil-Pita, R., Utrilla-Manso, M., Rosa-Zurera, M.: Convolutional-recurrent neural network for age an gender prediction from speech. In: 2019 Signal Processing Symposium, Krakow (Poland), pp. 246–249. IEEE (2019)

    Google Scholar 

  25. Sengupta, S., et al.: A review of deep learning with special emphasis on architectures, applications and recent trends. Knowl. Based Syst. 194(105596), 1–33 (2020)

    Google Scholar 

  26. Xu, Y., Kong, Q., Wang, W., Plumbley, M.: Large-scale weakly supervised audio classification using gated convolutional neural network. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008), pp. 121–125 (2018)

    Google Scholar 

  27. Zazo, R., Nidadavolu, P., Chen, N., Gonzalez-Rodriguez, J., Dehak, N.: Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE Access 6, 22524–22530 (2018)

    Article  Google Scholar 

  28. Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3212–3232 (2019)

    Article  Google Scholar 

Download references

Acknowledgement

This work has been partially funded by the Spanish Ministry of Economy, Industry and Competitiveness, with project RTC-2016-4687-7 and the Spanish Ministry of Science, Innovation and Universities, with project RTI2018-098085-B-C42 (MSIU/FEDER).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuel Rosa-Zurera .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sánchez-Hevia, H.A., Gil-Pita, R., Utrilla-Manso, M., Rosa-Zurera, M. (2021). Age and Gender Recognition from Speech Using Deep Neural Networks. In: Bergasa, L.M., Ocaña, M., Barea, R., López-Guillén, E., Revenga, P. (eds) Advances in Physical Agents II. WAF 2020. Advances in Intelligent Systems and Computing, vol 1285. Springer, Cham. https://doi.org/10.1007/978-3-030-62579-5_23

Download citation

Publish with us

Policies and ethics