Abstract
We consider the architectures of convolutional neural networks used to assess the emotional state of a person by their speech. The problem of increasing the efficiency of emotion recognition by reducing the computational complexity of this process is solved. To this end, we propose a method transforming the input data into a form suitable for machine learning algorithms.
Similar content being viewed by others
REFERENCES
Meshcheryakov, R.V. and Bondarenko, V.P., Dialog as a basis for constructing speech systems, Kibern. Sist. Anal., 2008, no. 2, pp. 30–41.
Papakotas, M., Siantikos, G., Giannakopoulos, T., et al., IoT applications with 5G connectivity in medical tourism sector management: third-party service scenarios, GeNeDis 2016. Adv. Exp. Med. Biol., 2016, vol. 989, pp. 155–164. https://doi.org/10.1007/978-3-319-57348-9_12
Okhapkin, V., Okhapkina, E., Iskhakova, A. et. al., Application of neural network modeling in the task of destructive content detecting, CEUR Workshop Proc. Proc. 3rd Int. Conf. R. Piotrowski’s Read. Lang. Eng. Appl. Linguist., PRLEAL 2019 (St. Petersburg, Russia, 2020), pp. 85–94.
Iskhakova, A., Iskhakov, A., and Meshcheryakov, R., Research of the estimated emotional components for the content analysis, J. Phys.: Conf. Series, 2019, vol. 1203, pp. 1–10. https://doi.org/10.1088/1742-6596/1203/1/012065
Scheirer, E. and Slaney, M., Construction and evaluation of a robust multifeature speech/music discriminator, IEEE Int. Conf. Acoust. Speech Signal Process. (Munich, Germany, 2002), pp. 1331–1334. https://doi.org/10.1109/ICASSP.1997.596192
Hossan, M.A., Memon, S., and Gregory, M.A., A novel approach for MFCC feature extraction, 2010 4th Int. Conf. Signal Process. Commun. Syst. (Gold Coast, QLD, Australia, 2010), pp. 1–5. https://doi.org/10.1109/ICSPCS.2010.5709752
Logan, B., Mel frequency cepstral coefficients for music modeling. https://ismir2000.ismir.net/papers/logan_abs.pdf .
Rabiner, L.R. and Juang, B.H., Fundamental of Speech Recognition, Prentice Hall, 1993.
Nwe, T.L., Foo, S.W., and Silva, L.C., Speech emotion recognition using hidden Markov models, Speech Commun., 2003, vol. 41, no. 4, pp. 603–623. https://doi.org/10.1016/S0167-6393(03)00099-2
Zou, D., Niu, Y., He, Z., and Tan, H., A breakthrough in speech emotion recognition using deep retinal convolution neural networks. .
Lim, W., Jang, D., and Lee, T., Speech emotion recognition using convolutional and recurrent neural networks, 2016 Asia-Pac. Signal Inf. Process. Assoc. Annu. Summit Conf. (APSIPA) (Jeju, Korea (South), 2016), pp. 1–4. https://doi.org/10.1109/APSIPA.2016.7820699
Prasomphan, S., Improvement of speech emotion recognition with neural network classifier by using speech spectrogram, 2015 Int. Conf. Syst. Signals Image Process. (IWSSIP) (London, UK, 2015), pp. 73–76. https://doi.org/10.1109/IWSSIP.2015.7314180
Pakoci, E., Popovic, B., and Pekar, D., Improvements in Serbian speech recognition using sequence-trained deep neural networks, SPIIRAS Proc., 2018, vol. 3(58), pp. 53–76. https://doi.org/10.15622/sp.58.3
Bengio, Y. and Hinton, G., Deep learning, Nature, 2015, vol. 521, pp. 436–444. https://doi.org/10.1038/nature14539
Valenti, M., Squartini, S., Diment, A., et. al., A convolutional neural network approach for acoustic scene classification, 2017 Int. Joint Conf. Neural Networks (IJCNN) (Anchorage, AK, 2017), pp. 1547–1554. https://doi.org/10.1109/IJCNN.2017.7966035
Hajarolasvadi, N. and Demirel, H., 3D CNN-based speech emotion recognition using K-means clustering and spectrograms, Entropy, 2019, vol. 21(5) 479, pp. 1–17. https://doi.org/10.3390/e21050479
Niu, Y., Zou, D., Niu, Y., He, Z., and Tan, H., A breakthrough in speech emotion recognition using deep retinal convolution neural networks, Preprint. .
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., and Weiss, B., A database of German emotional speech, INTERSPEECH 2005—Eurospeech. 9th Eur. Conf. Speech Commun. Technol. (Lisbon, Portugal, 2005), pp. 1–4. https://doi.org/10.21437/Interspeech.2005-446
Haq, S., Jackson, P.J.B., and Edge, J.D., Audio-visual feature selection and reduction for emotion, Proc. Int. Conf. Auditory-Visual Speech Process. (Tangalooma Wild Dolphin Resort, Moreton Island, Queensland, Australia, 2008), pp. 185–190.
Haq, S. and Jackson, P.J.B., Speaker-dependent audio-visual emotion recognition, Proc. Int. Conf. Auditory-Visual Speech Process. (Norwich, UK, 2009), pp. 53–58.
Huang, Z., Dong, M., Mao, Q., and Zhan, Y., Speech emotion recognition using CNN, MM’14: Proc. 22nd ACM Int. Conf. Multimedia (Orlando, Florida, USA, 2014), pp. 801–804. https://doi.org/10.1145/2647868.2654984
Prasomphan, S., Improvement of speech emotion recognition with neural network classifier by using speech spectrogram, 2015 IEEE Int. Conf. Syst. Signals Image Process. (London, UK, 2015), pp. 73–76. https://doi.org/10.1109/IWSSIP.2015.7314180
Semwal, N., Kumar, A., and Narayanan, S., Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models, 2017 IEEE Int. Conf. Identity Secur. Behav. Anal. (ISBA) (New Delhi, India, 2017), pp. 1–6.
Chu, R., Speech emotion recognition with convolutional neural network, 2019. https://towardsdatascience.com/speech-emotion-recognition-with-convolution-neuralnetwork-1e6bb7130ce3.
Jianfeng, Z., Mao, X., and Chen, L., Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomed. Signal Process. Control, 2019, vol. 47, pp. 312–323. https://doi.org/10.1016/j.bspc.2018.08.035
Rajan, V., 1D speech emotion recognition, 2021. https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition .
Livingstone, S.R. and Russo, F.A., The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English, PLoS ONE, 2018, vol. 13(5), pp. 1–35. https://doi.org/10.1371/journal.pone.0196391
Dupuis, K. and Pichora-Fuller, M.K., Toronto emotional speech set (TESS). https://doi.org/10.5683/SP2/E8H2MF
Cao, H., Cooper, D.G., Keutmann, M.K., and et., al., CREMA-D: crowd-sourced emotional multimodal actors dataset, IEEE Trans. Affective Comput., 2014, vol. 5(4), pp. 377–390. https://doi.org/10.1109/TAFFC.2014.2336244
Franti, E., Ispas, I., Dragomir, V., et al., Voice based emotion recognition with convolutional neural networks for companion robots, Rom. J. Inf. Sci. Technol., 2018, vol. 20(3), pp. 222–240.
Iskhakova, A., Wolf, D., an Meshcheryakov, R., Automated destructive behavior state detection on the 1D CNN-based voice analysis, Speech Comput. SPECOM 2020. Lect. Notes Comput. Sci., 2020, vol. 12335, pp. 184–193. https://doi.org/10.1007/978-3-030-60276-5_19
Iskhakova, A.O., Wolf, D.A., and Iskhakov, A.Yu., Noninvasive brain–computer interface for robot control, Vysokoproizvod. Vychisl. Sist. Tekhnol., 2021, vol. 5, no. 1, pp. 166–171.
Funding
The study was supported by the Russian Foundation for Basic Research, project no. 18-29-22104.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Translated by V. Potapchouck
Rights and permissions
About this article
Cite this article
Iskhakova, A.O., Vol’f, D.A. & Meshcheryakov, R.V. Method for Reducing the Feature Space Dimension in Speech Emotion Recognition Using Convolutional Neural Networks. Autom Remote Control 83, 857–868 (2022). https://doi.org/10.1134/S0005117922060042
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0005117922060042