Time-Continuous Emotion Recognition Using Spectrogram Based CNN-RNN Modelling

Fedotov, Dmitrii; Kim, Bobae; Karpov, Alexey; Minker, Wolfgang

doi:10.1007/978-3-030-26061-3_10

Dmitrii Fedotov^11,12,
Bobae Kim¹¹,
Alexey Karpov¹³ &
…
Wolfgang Minker¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Included in the following conference series:

International Conference on Speech and Computer

1388 Accesses
1 Citations

Abstract

In area of speech emotion recognition, hand-engineered features are traditionally used as an input. However, it requires an additional step to extract features before the prediction and prior knowledge to select feature set. Thus, recent research has been focused on approaches that predict emotions directly from speech signal to reduce the required efforts for the feature extraction and increase performance of emotion recognition system. Whereas this approach has been applied for prediction of categorical emotions, the study for prediction of continuous dimensional emotions is still rare. This paper presents a method for time-continuous prediction of emotions from speech using spectrogram. Proposed model comprises convolutional neural network (CNN) and Recurrent Neural Network with Long Short-Term Memory (RNN-LSTM). Hyperparameters of CNN are investigated to improve the performance of the our model. After finding the optimal hyperparameters, the performance of the system with waveform and spectrogram as input is compared in terms of concordance correlation coefficient (CCC). Proposed method outperforms the end-to-end emotion recognition system based on waveform and provides CCC of 0.722 predicting arousal on RECOLA database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Speech Based Multiple Emotion Classification Model Using Deep Learning

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech-Based Human Emotion Recognition Using CNN and LSTM Model Approach

References

Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service (PlatCon), pp. 1–5. IEEE (2017)
Google Scholar
Cai, D., Ni, Z., Liu, W., Cai, W., Li, G.: End-to-end deep learning framework for speech paralinguistics detection based on perception aware spectrum. In: Proceedings of Interspeech 2017, pp. 3452–3456 (2017)
Google Scholar
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)
Google Scholar
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Article Google Scholar
Fedotov, D., Ivanko, D., Sidorov, M., Minker, W.: Contextual dependencies in time-continuous multidimensional affect recognition. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2018) (2018). Accepted paper
Google Scholar
Fedotov, D., Sidorov, M., Minker, W.: Context-awared models in time-continuous multidimensional affect recognition. In: Ronzhin, A., Rigoll, G., Meshcheryakov, R. (eds.) ICR 2017. LNCS (LNAI), vol. 10459, pp. 59–66. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66471-2_7
Chapter Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Google Scholar
Khorrami, P., Le Paine, T., Brady, K., Dagli, C., Huang, T.S.: How deep neural networks can improve emotion recognition on video data. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 619–623. IEEE (2016)
Google Scholar
Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)
Article Google Scholar
Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16(8), 2203–2213 (2014)
Article Google Scholar
Mencattini, A., Martinelli, E., Ringeval, F., Schuller, B., Di Natale, C.: Continuous estimation of emotions in speech by dynamic cooperative speaker models. IEEE Trans. Affect. Comput. 8(3), 314–327 (2017)
Article Google Scholar
Miehle, J., Yoshino, K., Pragst, L., Ultes, S., Nakamura, S., Minker, W.: Cultural communication idiosyncrasies in human-computer interaction. In: Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 74–79 (2016)
Google Scholar
Ringeval, F., et al.: Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognit. Lett. 66, 22–30 (2015)
Article Google Scholar
Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the Recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013)
Google Scholar
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Sidorov, M., Schmitt, A., Semenkin, E., Minker, W.: Could speaker, gender or age awareness be beneficial in speech-based emotion recognition? In: (Chair), N.C.C., Choukri, K. et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA), May 2016
Google Scholar
Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)
Google Scholar
Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204. IEEE (2016)
Google Scholar
Valstar, M., et al.: Avec 2016: depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2016)
Google Scholar
Vogt, T.: Real-time automatic emotion recognition from speech (2010)
Google Scholar

Download references

Acknowledgements

The results of this work were used in a master thesis of Bobae Kim. The research was partially financially supported by DAAD, the Government of the Russian Federation (Grant 08-08) and RFBR foundation (project No. 18-07-01407).

Author information

Authors and Affiliations

Ulm University, Ulm, Germany
Dmitrii Fedotov, Bobae Kim & Wolfgang Minker
ITMO University, Saint Petersburg, Russia
Dmitrii Fedotov
SPIIRAS, Saint Petersburg, Russia
Alexey Karpov

Authors

Dmitrii Fedotov
View author publications
You can also search for this author in PubMed Google Scholar
Bobae Kim
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Karpov
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Minker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitrii Fedotov .

Editor information

Editors and Affiliations

Utrecht University, Utrecht, The Netherlands
Albert Ali Salah
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fedotov, D., Kim, B., Karpov, A., Minker, W. (2019). Time-Continuous Emotion Recognition Using Spectrogram Based CNN-RNN Modelling. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-26061-3_10
Published: 24 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Time-Continuous Emotion Recognition Using Spectrogram Based CNN-RNN Modelling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Based Multiple Emotion Classification Model Using Deep Learning

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech-Based Human Emotion Recognition Using CNN and LSTM Model Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Time-Continuous Emotion Recognition Using Spectrogram Based CNN-RNN Modelling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Based Multiple Emotion Classification Model Using Deep Learning

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech-Based Human Emotion Recognition Using CNN and LSTM Model Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation