An Empirical Experiment on Feature Extractions Based for Speech Emotion Recognition | SpringerLink
Skip to main content

An Empirical Experiment on Feature Extractions Based for Speech Emotion Recognition

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13758))

Included in the following conference series:

  • 809 Accesses

Abstract

In recent years, the virtual assistant has become an essential part of many applications on smart devices. In these applications, users talk to virtual assistants in order to give commands. This makes speech emotion recognition to be a serious problem in improving the service and the quality of virtual assistants. However, speech emotion recognition is not a straightforward task as emotion can be expressed through various features. Having a deep understanding of these features is crucial to achieving a good result in speech emotion recognition. To this end, this paper conducts empirical experiments on three kinds of speech features: Mel-spectrogram, Mel-frequency cepstral coefficients, Tempogram, and their variants for the task of speech emotion recognition. Convolutional Neural Networks, Long Short-Term Memory, Multi-layer Perceptron Classifier, and Light Gradient Boosting Machine are used to build classification models used for the emotion classification task based on the three speech features. Two popular datasets: The Ryerson Audio-Visual Database of Emotional Speech and Song, and The Crowd-Sourced Emotional Multimodal Actors Dataset are used to train these models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 12583
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 15729
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://librosa.org/doc/latest/index.html.

  2. 2.

    https://scikit-learn.org/stable/.

  3. 3.

    https://www.tensorflow.org/.

  4. 4.

    https://lightgbm.readthedocs.io/en/latest/.

References

  1. Luu, S.T., Nguyen, H.P., Van Nguyen, K., Nguyen, N.L.-T.: Comparison between traditional machine learning models and neural network models for Vietnamese hate speech detection. In: 2020 RIVF International Conference on Computing and Communication Technologies (RIVF), pp. 1–6. IEEE (2020)

    Google Scholar 

  2. Van Huynh, T., Nguyen, V.D., Van Nguyen, K., Nguyen, N.L.-T., Nguyen, A.G.-T.: Hate speech detection on Vietnamese social media text using the Bi-GRU-LSTM-CNN model. arXiv preprint arXiv:1911.03644 (2019)

  3. Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markov models. Speech Commun. 41(4), 603–623 (2003)

    Article  Google Scholar 

  4. Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)

    Article  Google Scholar 

  5. Wang, J.-C., Lin, C.-H., Chen, E.-T., Chang, P.-C.: Spectral-temporal receptive fields and MFCC balanced feature extraction for noisy speech recognition. In: 2014 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4. IEEE (2014)

    Google Scholar 

  6. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS ONE 13(5), e0196391 (2018)

    Article  Google Scholar 

  7. Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014)

    Article  Google Scholar 

  8. Mulimani, M., Koolagudi, S.G.: Acoustic event classification using spectrogram features. In: 2018 IEEE Region 10 Conference, TENCON 2018, pp. 1460–1464 (2018). https://doi.org/10.1109/TENCON.2018.8650444

  9. Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)

    Article  Google Scholar 

  10. Tiwari, V.T.: MFCC and its applications in speaker recognition. Int. J. Emerg. Technol. 1, 01 (2010)

    Google Scholar 

  11. Tian, M., Fazekas, G., Black, D.A.A., Sandler, M.: On the use of the tempogram to describe audio content and its application to music structural segmentation. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 419–423. IEEE (2015)

    Google Scholar 

  12. Tran, K.Q., Duong, B.V., Tran, L.Q., Tran, A.L.-H., Nguyen, A.T., Nguyen, K.V.: Machine learning-based empirical investigation for credit scoring in Vietnam’s banking. In: Fujita, H., Selamat, A., Lin, J.C.-W., Ali, M. (eds.) IEA/AIE 2021. LNCS (LNAI), vol. 12799, pp. 564–574. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79463-7_48

    Chapter  Google Scholar 

  13. Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems 30, pp. 3146–3154 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Binh Van Duong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Duong, B.V., Ha, C.N., Nguyen, T.T., Nguyen, P., Do, TH. (2022). An Empirical Experiment on Feature Extractions Based for Speech Emotion Recognition. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21967-2_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21966-5

  • Online ISBN: 978-3-031-21967-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics