Abstract
A speech emotion recognition algorithm based on multi-feature and Multi-lingual fusion is proposed in order to resolve low recognition accuracy caused bylack of large speech dataset and low robustness of acoustic features in the recognition of speech emotion. First, handcrafted and deep automatic features are extractedfrom existing data in Chinese and English speech emotions. Then, the various features are fused respectively. Finally, the fused features of different languages are fused again and trained in a classification model. Distinguishing the fused features with the unfused ones, the results manifest that the fused features significantly enhance the accuracy of speech emotion recognition algorithm. The proposedsolution is evaluated on the two Chinese corpus and two English corpus, and isshown to provide more accurate predictions compared to original solution. As a result of this study, the multi-feature and Multi-lingual fusion algorithm can significantly improve the speech emotion recognition accuracy when the dataset is small.
Similar content being viewed by others
References
André Stuhlsatz, Meyer C, Eyben F et al (2011) Deep neural networks for acoustic emotion recognition: Raising the benchmarks. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, pp 5688–5691. https://doi.org/10.1109/ICASSP.2011.5947651
Bertero D, Fung P (2017) A first look into a Convolutional Neural Network for speech emotion detection. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp 5115–5119. https://doi.org/10.1109/ICASSP.2017.7953131
Busso C, Bulut M, Lee CC et al (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Davis SB (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):65–74. https://doi.org/10.1109/TASSP.1980.1163420
Dora L (2018) Education in professional defense -possibilities of classification of training level with the help of impulse. J Syst Manag Sci 8(1):23–44
EybenF SKR, Schuller BW et al (2015) The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing. IEEE Trans Affect Comput 7(2):1–1. https://doi.org/10.1109/TAFFC.2015.2457417
EybenF SKR, Truong KP et al (2016) The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for speech research and affective computing. IEEE Trans Affect Comput 7(2):190–202. https://doi.org/10.1109/TAFFC.2015.2457417
Gemmeke JF, Ellis DPW, Freedman D et al (2017) Audio Set: An ontology and human-labeled dataset for audio events. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp 776–780. https://doi.org/10.1109/ICASSP.2017.7952261
Huang JT, Li J, Yu D et al (2013) Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7304–7308. https://doi.org/10.1109/ICASSP.2013.6639081
Jackson P, UlHaq S (2011) Surrey Audio-Visual Expressed Emotion (SAVEE) database. University of Surrey, Guildford
Kandali AB, Routray A, Basu TK (2008) Emotion recognition from Assamese speeches using MFCC features and GMM classifier. TENCON 2008–2008 IEEE Region 10 Conference 1–5. https://doi.org/10.1109/TENCON.2008.4766487
Kim J, Saurous R (2018) Emotion recognition from human speech using temporal information and deep learning. Interspeech 937–940. https://doi.org/10.21437/Interspeech.2018-1132
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303. https://doi.org/10.1109/TSA.2004.838534
Lee CM, Narayanan S, Pieraccini R (2001) Recognition of negative emotions from the speech signal. IEEE automatic speech recognition understanding workshop 240–243. https://doi.org/10.1109/ASRU.2001.1034632
Li Y, Tao J, Chao L et al (2016) CHEAVD: a Chinese natural emotional audio–visual database. J Ambient Intell Humaniz Comput 8(6):913–924
Lili F, Yinhong D (2018) Research on internet search data in china’s social problems under the background of big data. J Logist Informat Serv Sci 5(2):55–67
Mao X, Zhang B, Luo YI (2007) Speech emotion recognition based on a hybrid of HMM/ANN. Proceedings of the 7th Conference on 7th WSEAS International Conference on Applied Informatics and Communications 7:367–370
Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. International conference on acoustics, speech, and signal processing, New Orleans, LA, USA. https://doi.org/10.1109/ICASSP.2017.7952552
NedjmaOusidhoum,et al (2019) Multilingual and Multi-Aspect Hate Speech Analysis. International joint conference on natural language processing, pp 4675–4684
Pao TL, Chen YT, Yeh JH (2004) Emotion recognition from Mandarin speech signals, 2004 International Symposium on Chinese Spoken Language Processing, pp 301–304. https://doi.org/10.1109/CHINSL.2004.1409646
Petrushin VA (2000) Emotion recognition in speech signal: experimental study, development, and application. Sixth International Conference on Spoken Language Processing, Beijing, China
Picard RW (1997) Affective Computing. MIT Press, Google Scholar, Cambridge
Voican O (2020) Using data mining methods to solve classification problems in financial-banking institutions. Econ Comput Econ Cybern Stud Res 54(1):159–176. https://doi.org/10.24818/18423264/54.1.20.11
Williams CE, Stevens KN (1972) Emotions and speech: some acoustical correlates. J Acoust Soc Am 52(4):1238–1250. https://doi.org/10.1121/1.1913238
Zhang B, Mower Provost E, Essl G (2019) Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences. IEEE Trans Affect Comput 10(1):85–99. https://doi.org/10.1109/TAFFC.2017.2684799
Acknowledgements
For advice and discussions, we thank Heyan Huang, professor of the School of Computer Science, Beijing Institute of Technology. We also thank anonymous reviewers for their valuable work.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, C., Ren, Y., Zhang, N. et al. Speech emotion recognition based on multi‐feature and multi‐lingual fusion. Multimed Tools Appl 81, 4897–4907 (2022). https://doi.org/10.1007/s11042-021-10553-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-10553-4