Age and Gender Recognition from Speech Using Deep Neural Networks

Sánchez-Hevia, Héctor A.; Gil-Pita, Roberto; Utrilla-Manso, Manuel; Rosa-Zurera, Manuel

doi:10.1007/978-3-030-62579-5_23

Héctor A. Sánchez-Hevia¹⁹,
Roberto Gil-Pita¹⁹,
Manuel Utrilla-Manso¹⁹ &
…
Manuel Rosa-Zurera¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1285))

Included in the following conference series:

Workshop of Physical Agents

869 Accesses

Abstract

This paper deals with joint gender identification and age group classification from speech, aimed at improving the functionalities of Interactive Voice Response Systems. Deep Neural Networks are used, because they have recently demonstrated discriminative and representation capabilities over a wide range of applications, among them, speech processing problems based on features extraction and selection. A comparative study of various neural network architectures and sizes is presented to gather knowledge about performance dependence on the network architecture and the number of free parameters. The classification framework was trained and evaluated using Mozilla’s ‘Common Voice’ dataset, an open and crowdsourced speech corpus. The results are promising, with the best systems achieving a gender identification error lower than 2% and an age group classification error lower than 20%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 17159; Price includes VAT (Japan)

Softcover Book: JPY 21449; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Age group classification and gender recognition from speech with temporal convolutional neural networks

Article Open access 13 January 2022

Automatic Recognition of Speaker Age and Gender Based on Deep Neural Networks

Accent and Gender Recognition from English Language Speech and Audio Using Signal Processing and Deep Learning

References

Abadi, M., Agarwal, A., Barham, P., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/. Software available from tensorflow.org
Badshah, A., Ahmad, J., Rahim, N., Baik, S.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service (PlatCon), pp. 1–5 (2017)
Google Scholar
Bahari, M., McLaren, M., Van Leeuwen, D., et al.: Age estimation from telephone speech using i-vectors. In: Proceedings of Interspeech 2012. Portland, USA (2012)
Google Scholar
Bai, S., Kolter, J., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arxiv 2018. arXiv preprint arXiv:1803.01271
Bhat, C., Mithum, B., Saxena, V., Kulkarni, V., Kopparapu, S.: Deploying usable speech enabled IVR systems for mass use. In: 2013 IEEE International Conference on Human Computer Interaction (ICHCI), pp. 1–5 (2013)
Google Scholar
Cakir, E., Adavanne, S., Parascandolo, G., Drossos, K., Virtanen, T.: Convolutional recurrent neural networks for bird audio detection. In: 2017 25th European Signal Processing Conference (EUSIPCO), pp. 1744–1748 (2017)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014). arXiv:1406.1078
Chollet, F., et al.: Keras (2015). https://keras.io
Rumelhart, R., et al.: Learning representations by back-propagating errors. Nature 521, 533–536 (1986)
Article Google Scholar
Foundation, M.: Common voice (2019). https://voice.mozilla.org
Gorin, A., Riccardi, G., Wright, J.: How may i help you? Speech Commun. 23(1–2), 113–127 (1997)
Article Google Scholar
Kalluri, S.B., Vijayasenan, D., Ganapathy, S.: A deep neural network based end to end model for joint height and age estimation from short duration speech. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), pp. 6580–6584. IEEE (2019)
Google Scholar
Kang, Y., Tsang, K.Y., Wong, K.W.Y.: The effect of speech rate on age estimation in conversational speech. Toronto Working Papers in Linguistics (TWPL) 42, 1–10 (2020)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 323, 436–444 (2015)
Article Google Scholar
Metze, F., Ajmera, J., Englert, R., Bub, U., et al.: Comparison of four approaches to age and gender recognition for telephone applications. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), vol. 4, pp. IV–1089 (2007)
Google Scholar
Minematsu, N., Sekiguchi, M., Hirose, K.: Automatic estimation of one’s age with his/her speech based upon acoustic modeling techniques of speakers. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 1, pp. I–137 (2002)
Google Scholar
Mohino-Herranz, I., García-Gómez, J., Utrilla-Manso, M., Rosa-Zurera, M.: Precision maximization in anger detection in interactive voice response systems. In: 145th Convention of the Audio Engineering Society, p. 10090 (2018)
Google Scholar
Neumann, M., Vu, N.T.: Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. arXiv preprint arXiv:1706.00612 (2017)
Pandey, A., Wang, D.: TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), pp. 6875–6879 (2019)
Google Scholar
Pappas, D., Androutsopoulos, I., Papageorgiou, H.: Anger detection in call center dialogues. In: 2015 6th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), pp. 139–144 (2015)
Google Scholar
Park, S.R., Lee, J.: A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132 (2016)
Pitts, W., McCulloch, W.: How we know universals the perception of auditory and visual forms. Bull. Math. Biophys. 9(3), 127–147 (1947). https://doi.org/10.1007/BF02478291
Article Google Scholar
Ranjan, S., Hansen, J.H.: Improved gender independent speaker recognition using convolutional neural network based bottleneck features. In: Proceedings of Interspeech, pp. 1009–1013 (2017)
Google Scholar
Sanchez-Hevia, H., Gil-Pita, R., Utrilla-Manso, M., Rosa-Zurera, M.: Convolutional-recurrent neural network for age an gender prediction from speech. In: 2019 Signal Processing Symposium, Krakow (Poland), pp. 246–249. IEEE (2019)
Google Scholar
Sengupta, S., et al.: A review of deep learning with special emphasis on architectures, applications and recent trends. Knowl. Based Syst. 194(105596), 1–33 (2020)
Google Scholar
Xu, Y., Kong, Q., Wang, W., Plumbley, M.: Large-scale weakly supervised audio classification using gated convolutional neural network. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008), pp. 121–125 (2018)
Google Scholar
Zazo, R., Nidadavolu, P., Chen, N., Gonzalez-Rodriguez, J., Dehak, N.: Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE Access 6, 22524–22530 (2018)
Article Google Scholar
Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3212–3232 (2019)
Article Google Scholar

Download references

Acknowledgement

This work has been partially funded by the Spanish Ministry of Economy, Industry and Competitiveness, with project RTC-2016-4687-7 and the Spanish Ministry of Science, Innovation and Universities, with project RTI2018-098085-B-C42 (MSIU/FEDER).

Author information

Authors and Affiliations

Signal Theory and Communications Department, University of Alcalá, 28805, Alcalá de Henares, Madrid, Spain
Héctor A. Sánchez-Hevia, Roberto Gil-Pita, Manuel Utrilla-Manso & Manuel Rosa-Zurera

Authors

Héctor A. Sánchez-Hevia
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Gil-Pita
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Utrilla-Manso
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Rosa-Zurera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel Rosa-Zurera .

Editor information

Editors and Affiliations

Electronics Department, University of Alcalá, Madrid, Spain
Luis M. Bergasa
Electronics Department, University of Alcalá, Madrid, Spain
Manuel Ocaña
Electronics Department, University of Alcalá, Madrid, Spain
Rafael Barea
Electronics Department, University of Alcalá, Madrid, Spain
Elena López-Guillén
Electronics Department, University of Alcalá, Madrid, Spain
Pedro Revenga

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sánchez-Hevia, H.A., Gil-Pita, R., Utrilla-Manso, M., Rosa-Zurera, M. (2021). Age and Gender Recognition from Speech Using Deep Neural Networks. In: Bergasa, L.M., Ocaña, M., Barea, R., López-Guillén, E., Revenga, P. (eds) Advances in Physical Agents II. WAF 2020. Advances in Intelligent Systems and Computing, vol 1285. Springer, Cham. https://doi.org/10.1007/978-3-030-62579-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-62579-5_23
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62578-8
Online ISBN: 978-3-030-62579-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics