Time-Continuous Emotion Recognition Using Spectrogram Based CNN-RNN Modelling | SpringerLink
Skip to main content

Time-Continuous Emotion Recognition Using Spectrogram Based CNN-RNN Modelling

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Included in the following conference series:

Abstract

In area of speech emotion recognition, hand-engineered features are traditionally used as an input. However, it requires an additional step to extract features before the prediction and prior knowledge to select feature set. Thus, recent research has been focused on approaches that predict emotions directly from speech signal to reduce the required efforts for the feature extraction and increase performance of emotion recognition system. Whereas this approach has been applied for prediction of categorical emotions, the study for prediction of continuous dimensional emotions is still rare. This paper presents a method for time-continuous prediction of emotions from speech using spectrogram. Proposed model comprises convolutional neural network (CNN) and Recurrent Neural Network with Long Short-Term Memory (RNN-LSTM). Hyperparameters of CNN are investigated to improve the performance of the our model. After finding the optimal hyperparameters, the performance of the system with waveform and spectrogram as input is compared in terms of concordance correlation coefficient (CCC). Proposed method outperforms the end-to-end emotion recognition system based on waveform and provides CCC of 0.722 predicting arousal on RECOLA database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service (PlatCon), pp. 1–5. IEEE (2017)

    Google Scholar 

  2. Cai, D., Ni, Z., Liu, W., Cai, W., Li, G.: End-to-end deep learning framework for speech paralinguistics detection based on perception aware spectrum. In: Proceedings of Interspeech 2017, pp. 3452–3456 (2017)

    Google Scholar 

  3. Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)

    Google Scholar 

  4. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)

    Article  Google Scholar 

  5. Fedotov, D., Ivanko, D., Sidorov, M., Minker, W.: Contextual dependencies in time-continuous multidimensional affect recognition. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2018) (2018). Accepted paper

    Google Scholar 

  6. Fedotov, D., Sidorov, M., Minker, W.: Context-awared models in time-continuous multidimensional affect recognition. In: Ronzhin, A., Rigoll, G., Meshcheryakov, R. (eds.) ICR 2017. LNCS (LNAI), vol. 10459, pp. 59–66. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66471-2_7

    Chapter  Google Scholar 

  7. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

    Google Scholar 

  8. Khorrami, P., Le Paine, T., Brady, K., Dagli, C., Huang, T.S.: How deep neural networks can improve emotion recognition on video data. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 619–623. IEEE (2016)

    Google Scholar 

  9. Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)

    Article  Google Scholar 

  10. Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16(8), 2203–2213 (2014)

    Article  Google Scholar 

  11. Mencattini, A., Martinelli, E., Ringeval, F., Schuller, B., Di Natale, C.: Continuous estimation of emotions in speech by dynamic cooperative speaker models. IEEE Trans. Affect. Comput. 8(3), 314–327 (2017)

    Article  Google Scholar 

  12. Miehle, J., Yoshino, K., Pragst, L., Ultes, S., Nakamura, S., Minker, W.: Cultural communication idiosyncrasies in human-computer interaction. In: Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 74–79 (2016)

    Google Scholar 

  13. Ringeval, F., et al.: Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognit. Lett. 66, 22–30 (2015)

    Article  Google Scholar 

  14. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the Recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013)

    Google Scholar 

  15. Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  16. Sidorov, M., Schmitt, A., Semenkin, E., Minker, W.: Could speaker, gender or age awareness be beneficial in speech-based emotion recognition? In: (Chair), N.C.C., Choukri, K. et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA), May 2016

    Google Scholar 

  17. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  18. Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204. IEEE (2016)

    Google Scholar 

  19. Valstar, M., et al.: Avec 2016: depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2016)

    Google Scholar 

  20. Vogt, T.: Real-time automatic emotion recognition from speech (2010)

    Google Scholar 

Download references

Acknowledgements

The results of this work were used in a master thesis of Bobae Kim. The research was partially financially supported by DAAD, the Government of the Russian Federation (Grant 08-08) and RFBR foundation (project No. 18-07-01407).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitrii Fedotov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fedotov, D., Kim, B., Karpov, A., Minker, W. (2019). Time-Continuous Emotion Recognition Using Spectrogram Based CNN-RNN Modelling. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26061-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26060-6

  • Online ISBN: 978-3-030-26061-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics