Abstract
This paper deals with joint gender identification and age group classification from speech, aimed at improving the functionalities of Interactive Voice Response Systems. Deep Neural Networks are used, because they have recently demonstrated discriminative and representation capabilities over a wide range of applications, among them, speech processing problems based on features extraction and selection. A comparative study of various neural network architectures and sizes is presented to gather knowledge about performance dependence on the network architecture and the number of free parameters. The classification framework was trained and evaluated using Mozilla’s ‘Common Voice’ dataset, an open and crowdsourced speech corpus. The results are promising, with the best systems achieving a gender identification error lower than 2% and an age group classification error lower than 20%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abadi, M., Agarwal, A., Barham, P., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/. Software available from tensorflow.org
Badshah, A., Ahmad, J., Rahim, N., Baik, S.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service (PlatCon), pp. 1–5 (2017)
Bahari, M., McLaren, M., Van Leeuwen, D., et al.: Age estimation from telephone speech using i-vectors. In: Proceedings of Interspeech 2012. Portland, USA (2012)
Bai, S., Kolter, J., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arxiv 2018. arXiv preprint arXiv:1803.01271
Bhat, C., Mithum, B., Saxena, V., Kulkarni, V., Kopparapu, S.: Deploying usable speech enabled IVR systems for mass use. In: 2013 IEEE International Conference on Human Computer Interaction (ICHCI), pp. 1–5 (2013)
Cakir, E., Adavanne, S., Parascandolo, G., Drossos, K., Virtanen, T.: Convolutional recurrent neural networks for bird audio detection. In: 2017 25th European Signal Processing Conference (EUSIPCO), pp. 1744–1748 (2017)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014). arXiv:1406.1078
Chollet, F., et al.: Keras (2015). https://keras.io
Rumelhart, R., et al.: Learning representations by back-propagating errors. Nature 521, 533–536 (1986)
Foundation, M.: Common voice (2019). https://voice.mozilla.org
Gorin, A., Riccardi, G., Wright, J.: How may i help you? Speech Commun. 23(1–2), 113–127 (1997)
Kalluri, S.B., Vijayasenan, D., Ganapathy, S.: A deep neural network based end to end model for joint height and age estimation from short duration speech. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), pp. 6580–6584. IEEE (2019)
Kang, Y., Tsang, K.Y., Wong, K.W.Y.: The effect of speech rate on age estimation in conversational speech. Toronto Working Papers in Linguistics (TWPL) 42, 1–10 (2020)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 323, 436–444 (2015)
Metze, F., Ajmera, J., Englert, R., Bub, U., et al.: Comparison of four approaches to age and gender recognition for telephone applications. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), vol. 4, pp. IV–1089 (2007)
Minematsu, N., Sekiguchi, M., Hirose, K.: Automatic estimation of one’s age with his/her speech based upon acoustic modeling techniques of speakers. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 1, pp. I–137 (2002)
Mohino-Herranz, I., García-Gómez, J., Utrilla-Manso, M., Rosa-Zurera, M.: Precision maximization in anger detection in interactive voice response systems. In: 145th Convention of the Audio Engineering Society, p. 10090 (2018)
Neumann, M., Vu, N.T.: Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. arXiv preprint arXiv:1706.00612 (2017)
Pandey, A., Wang, D.: TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), pp. 6875–6879 (2019)
Pappas, D., Androutsopoulos, I., Papageorgiou, H.: Anger detection in call center dialogues. In: 2015 6th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), pp. 139–144 (2015)
Park, S.R., Lee, J.: A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132 (2016)
Pitts, W., McCulloch, W.: How we know universals the perception of auditory and visual forms. Bull. Math. Biophys. 9(3), 127–147 (1947). https://doi.org/10.1007/BF02478291
Ranjan, S., Hansen, J.H.: Improved gender independent speaker recognition using convolutional neural network based bottleneck features. In: Proceedings of Interspeech, pp. 1009–1013 (2017)
Sanchez-Hevia, H., Gil-Pita, R., Utrilla-Manso, M., Rosa-Zurera, M.: Convolutional-recurrent neural network for age an gender prediction from speech. In: 2019 Signal Processing Symposium, Krakow (Poland), pp. 246–249. IEEE (2019)
Sengupta, S., et al.: A review of deep learning with special emphasis on architectures, applications and recent trends. Knowl. Based Syst. 194(105596), 1–33 (2020)
Xu, Y., Kong, Q., Wang, W., Plumbley, M.: Large-scale weakly supervised audio classification using gated convolutional neural network. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008), pp. 121–125 (2018)
Zazo, R., Nidadavolu, P., Chen, N., Gonzalez-Rodriguez, J., Dehak, N.: Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE Access 6, 22524–22530 (2018)
Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3212–3232 (2019)
Acknowledgement
This work has been partially funded by the Spanish Ministry of Economy, Industry and Competitiveness, with project RTC-2016-4687-7 and the Spanish Ministry of Science, Innovation and Universities, with project RTI2018-098085-B-C42 (MSIU/FEDER).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sánchez-Hevia, H.A., Gil-Pita, R., Utrilla-Manso, M., Rosa-Zurera, M. (2021). Age and Gender Recognition from Speech Using Deep Neural Networks. In: Bergasa, L.M., Ocaña, M., Barea, R., López-Guillén, E., Revenga, P. (eds) Advances in Physical Agents II. WAF 2020. Advances in Intelligent Systems and Computing, vol 1285. Springer, Cham. https://doi.org/10.1007/978-3-030-62579-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-62579-5_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62578-8
Online ISBN: 978-3-030-62579-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)