Method for Reducing the Feature Space Dimension in Speech Emotion Recognition Using Convolutional Neural Networks

Iskhakova, A. O.; Vol’f, D. A.; Meshcheryakov, R. V.

doi:10.1134/S0005117922060042

Method for Reducing the Feature Space Dimension in Speech Emotion Recognition Using Convolutional Neural Networks

THEMATIC ISSUE
Published: 05 July 2022

Volume 83, pages 857–868, (2022)
Cite this article

Automation and Remote Control Aims and scope Submit manuscript

A. O. Iskhakova¹,
D. A. Vol’f¹ &
R. V. Meshcheryakov¹

154 Accesses
2 Citations
Explore all metrics

Abstract

We consider the architectures of convolutional neural networks used to assess the emotional state of a person by their speech. The problem of increasing the efficiency of emotion recognition by reducing the computational complexity of this process is solved. To this end, we propose a method transforming the input data into a form suitable for machine learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Recognition by Conventional Machine Learning and Deep Learning

Emotion Recognition from Speech Using Convolutional Neural Networks

An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning

Article 20 January 2021

REFERENCES

Meshcheryakov, R.V. and Bondarenko, V.P., Dialog as a basis for constructing speech systems, Kibern. Sist. Anal., 2008, no. 2, pp. 30–41.
Papakotas, M., Siantikos, G., Giannakopoulos, T., et al., IoT applications with 5G connectivity in medical tourism sector management: third-party service scenarios, GeNeDis 2016. Adv. Exp. Med. Biol., 2016, vol. 989, pp. 155–164. https://doi.org/10.1007/978-3-319-57348-9_12
Article Google Scholar
Okhapkin, V., Okhapkina, E., Iskhakova, A. et. al., Application of neural network modeling in the task of destructive content detecting, CEUR Workshop Proc. Proc. 3rd Int. Conf. R. Piotrowski’s Read. Lang. Eng. Appl. Linguist., PRLEAL 2019 (St. Petersburg, Russia, 2020), pp. 85–94.
Iskhakova, A., Iskhakov, A., and Meshcheryakov, R., Research of the estimated emotional components for the content analysis, J. Phys.: Conf. Series, 2019, vol. 1203, pp. 1–10. https://doi.org/10.1088/1742-6596/1203/1/012065
Article Google Scholar
Scheirer, E. and Slaney, M., Construction and evaluation of a robust multifeature speech/music discriminator, IEEE Int. Conf. Acoust. Speech Signal Process. (Munich, Germany, 2002), pp. 1331–1334. https://doi.org/10.1109/ICASSP.1997.596192
Hossan, M.A., Memon, S., and Gregory, M.A., A novel approach for MFCC feature extraction, 2010 4th Int. Conf. Signal Process. Commun. Syst. (Gold Coast, QLD, Australia, 2010), pp. 1–5. https://doi.org/10.1109/ICSPCS.2010.5709752
Logan, B., Mel frequency cepstral coefficients for music modeling. https://ismir2000.ismir.net/papers/logan_abs.pdf .
Rabiner, L.R. and Juang, B.H., Fundamental of Speech Recognition, Prentice Hall, 1993.
Nwe, T.L., Foo, S.W., and Silva, L.C., Speech emotion recognition using hidden Markov models, Speech Commun., 2003, vol. 41, no. 4, pp. 603–623. https://doi.org/10.1016/S0167-6393(03)00099-2
Article Google Scholar
Zou, D., Niu, Y., He, Z., and Tan, H., A breakthrough in speech emotion recognition using deep retinal convolution neural networks. .
Lim, W., Jang, D., and Lee, T., Speech emotion recognition using convolutional and recurrent neural networks, 2016 Asia-Pac. Signal Inf. Process. Assoc. Annu. Summit Conf. (APSIPA) (Jeju, Korea (South), 2016), pp. 1–4. https://doi.org/10.1109/APSIPA.2016.7820699
Prasomphan, S., Improvement of speech emotion recognition with neural network classifier by using speech spectrogram, 2015 Int. Conf. Syst. Signals Image Process. (IWSSIP) (London, UK, 2015), pp. 73–76. https://doi.org/10.1109/IWSSIP.2015.7314180
Pakoci, E., Popovic, B., and Pekar, D., Improvements in Serbian speech recognition using sequence-trained deep neural networks, SPIIRAS Proc., 2018, vol. 3(58), pp. 53–76. https://doi.org/10.15622/sp.58.3
Article Google Scholar
Bengio, Y. and Hinton, G., Deep learning, Nature, 2015, vol. 521, pp. 436–444. https://doi.org/10.1038/nature14539
Article Google Scholar
Valenti, M., Squartini, S., Diment, A., et. al., A convolutional neural network approach for acoustic scene classification, 2017 Int. Joint Conf. Neural Networks (IJCNN) (Anchorage, AK, 2017), pp. 1547–1554. https://doi.org/10.1109/IJCNN.2017.7966035
Hajarolasvadi, N. and Demirel, H., 3D CNN-based speech emotion recognition using K-means clustering and spectrograms, Entropy, 2019, vol. 21(5) 479, pp. 1–17. https://doi.org/10.3390/e21050479
Article Google Scholar
Niu, Y., Zou, D., Niu, Y., He, Z., and Tan, H., A breakthrough in speech emotion recognition using deep retinal convolution neural networks, Preprint. .
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., and Weiss, B., A database of German emotional speech, INTERSPEECH 2005—Eurospeech. 9th Eur. Conf. Speech Commun. Technol. (Lisbon, Portugal, 2005), pp. 1–4. https://doi.org/10.21437/Interspeech.2005-446
Haq, S., Jackson, P.J.B., and Edge, J.D., Audio-visual feature selection and reduction for emotion, Proc. Int. Conf. Auditory-Visual Speech Process. (Tangalooma Wild Dolphin Resort, Moreton Island, Queensland, Australia, 2008), pp. 185–190.
Haq, S. and Jackson, P.J.B., Speaker-dependent audio-visual emotion recognition, Proc. Int. Conf. Auditory-Visual Speech Process. (Norwich, UK, 2009), pp. 53–58.
Huang, Z., Dong, M., Mao, Q., and Zhan, Y., Speech emotion recognition using CNN, MM’14: Proc. 22nd ACM Int. Conf. Multimedia (Orlando, Florida, USA, 2014), pp. 801–804. https://doi.org/10.1145/2647868.2654984
Prasomphan, S., Improvement of speech emotion recognition with neural network classifier by using speech spectrogram, 2015 IEEE Int. Conf. Syst. Signals Image Process. (London, UK, 2015), pp. 73–76. https://doi.org/10.1109/IWSSIP.2015.7314180
Semwal, N., Kumar, A., and Narayanan, S., Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models, 2017 IEEE Int. Conf. Identity Secur. Behav. Anal. (ISBA) (New Delhi, India, 2017), pp. 1–6.
Chu, R., Speech emotion recognition with convolutional neural network, 2019. https://towardsdatascience.com/speech-emotion-recognition-with-convolution-neuralnetwork-1e6bb7130ce3.
Jianfeng, Z., Mao, X., and Chen, L., Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomed. Signal Process. Control, 2019, vol. 47, pp. 312–323. https://doi.org/10.1016/j.bspc.2018.08.035
Article Google Scholar
Rajan, V., 1D speech emotion recognition, 2021. https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition .
Livingstone, S.R. and Russo, F.A., The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English, PLoS ONE, 2018, vol. 13(5), pp. 1–35. https://doi.org/10.1371/journal.pone.0196391
Article Google Scholar
Dupuis, K. and Pichora-Fuller, M.K., Toronto emotional speech set (TESS). https://doi.org/10.5683/SP2/E8H2MF
Cao, H., Cooper, D.G., Keutmann, M.K., and et., al., CREMA-D: crowd-sourced emotional multimodal actors dataset, IEEE Trans. Affective Comput., 2014, vol. 5(4), pp. 377–390. https://doi.org/10.1109/TAFFC.2014.2336244
Article Google Scholar
Franti, E., Ispas, I., Dragomir, V., et al., Voice based emotion recognition with convolutional neural networks for companion robots, Rom. J. Inf. Sci. Technol., 2018, vol. 20(3), pp. 222–240.
Google Scholar
Iskhakova, A., Wolf, D., an Meshcheryakov, R., Automated destructive behavior state detection on the 1D CNN-based voice analysis, Speech Comput. SPECOM 2020. Lect. Notes Comput. Sci., 2020, vol. 12335, pp. 184–193. https://doi.org/10.1007/978-3-030-60276-5_19
Iskhakova, A.O., Wolf, D.A., and Iskhakov, A.Yu., Noninvasive brain–computer interface for robot control, Vysokoproizvod. Vychisl. Sist. Tekhnol., 2021, vol. 5, no. 1, pp. 166–171.
Google Scholar

Download references

Funding

The study was supported by the Russian Foundation for Basic Research, project no. 18-29-22104.

Author information

Authors and Affiliations

Trapeznikov Institute of Control Sciences, Russian Academy of Sciences, Moscow, 117997, Russia
A. O. Iskhakova, D. A. Vol’f & R. V. Meshcheryakov

Authors

A. O. Iskhakova
View author publications
You can also search for this author in PubMed Google Scholar
D. A. Vol’f
View author publications
You can also search for this author in PubMed Google Scholar
R. V. Meshcheryakov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to A. O. Iskhakova, D. A. Vol’f or R. V. Meshcheryakov.

Additional information

Translated by V. Potapchouck

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iskhakova, A.O., Vol’f, D.A. & Meshcheryakov, R.V. Method for Reducing the Feature Space Dimension in Speech Emotion Recognition Using Convolutional Neural Networks. Autom Remote Control 83, 857–868 (2022). https://doi.org/10.1134/S0005117922060042

Download citation

Received: 17 November 2021
Revised: 19 January 2022
Accepted: 26 January 2022
Published: 05 July 2022
Issue Date: June 2022
DOI: https://doi.org/10.1134/S0005117922060042

Keywords