Abstract
This paper researches how to use attention mechanism to fuse the time series information of facial expression and speech, and proposes a multi-modal feature fusion emotion recognition model based on attention mechanism. First, facial expression features and speech features are extracted. Facial expression feature extraction, based on C3D-LSTM hybrid model, can effectively obtain the temporal and spatial expression features in videos. For speech feature extraction, Mel Frequency Cepstral Coefficient (MFCC) is used for extracting the initial speech features, and convolution neural network is for further features. Then, a face and speech recognition method based on attention mechanism is proposed. Through the attention analysis of the fusion features, the proposed method can obtain the relationship between the features, so that the features without noise and with strong distinguishability obtain more weight, and reduce the weight of noisy features at the same time. Finally, this method is applied to face expression and speech fusion recognition. The experimental results show that the proposed multi-modal emotion classification model is better than those in other literatures in RML dataset, with an average recognition rate of up to 81.18%.









Similar content being viewed by others
References
Angrick M, Herff C, Johnson G et al (2019) Interpretation of convolutional neural networks for speech spectrogram regression from intracranial recordings [J]. Neurocomputing 342(21):145–151
Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video [C]. In Proc Adv Neural Inf Process Syst 892–900
Bodla N, Zheng J, Xu H et al (2017) Deep heterogeneous feature fusion for template-based face recognition[C]. In Proc IEEE Winter Conf Appl Comput Vis 586–595
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications[J]. J Algorithm 55(1):58–75
Fukui A, Park DH, Yang D et al (2016) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding [C]. In Proc Conf Empir Methods Nat Lang Process 457–468
Guanjun S, Shudong Z, Feigao W (2020) Research on Audio-Visual Dual-Modal Emotion Recognition Fusion Framework [J]. Comput Eng Appl 56(6):140–146
Haritha CV, Thulasidharan PP (2018) Multimodal Emotion Recognition using Deep Neural Network- A Survey [J]. Int J Comput Sci Eng 06(6):95–98
Hu Y, Ren JS, Dai J et al (2015) Deep Multimodal Speaker Naming [C]. In Proc ACM Int Conf Multimed 1107–1110
Li W, Chu M, Qiao J (2019) Design of a hierarchy modular neural network and its application in multimodal emotion recognition [J]. Soft Comput 23(3):11817–11828. https://doi.org/10.1007/s00500-018-03735-0
Lubis N, Lestari D, Sakti S et al (2018) Construction of Spontaneous Emotion Corpus from Indonesian TV Talk Shows and Its Application on Multimodal Emotion Recognition [J]. IEICE Trans Inf Syst E101.D(8):2092–2100
Ma J, Sun Y, Zhang X (2019) Multimodal emotion recognition for the fusion of speech and EEG signals [J]. Xian Dianzi Keji Daxue Xuebao/J Xidian Univ 46(1):143–150
Noroozi F, Marjanovic M, Njegus A et al (2019) Audio-Visual Emotion Recognition in Video Clips [J]. Affective Computing, IEEE Trans Affect Comput 10(1):60–75
Ren J, Hu Y, Tai Y W et al (2016) Look, listen and learn---a multimodal LSTM for speaker identification [C]. In Proc Thirtieth AAAI Conf Artif Intell 3581–3587
Song KS, Nho YH, Seo JH, Kwon DS (2018) Decision-Level Fusion Method for Emotion Recognition using Multimodal Emotion Recognition Information [C]. 2018 15th International Conference on Ubiquitous Robots (UR), Honolulu, HI 472–476. https://doi.org/10.1109/URAI.2018.8441795
Sun B, Xu Q, He J et al (2016) Audio-Video Based Multimodal Emotion Recognition Using SVMs and Deep Learning [C]. Chinese Conf Pattern Recognit 621–631. Springer, Singapore
Wen Y, Zhang K, Li Z et al (2016) A discriminative feature learning approach for deep face recognition [C]. In Proc Eur Conf Comput Vis 499–515
Acknowledgement
This work was supported in part by the Natural Science Foundation of Shandong Province of China under Grant ZR202103020833, Social Science Planning Research Project of Shandong Province under Grant 18CLYJ50, in part by the Shandong Soft Science Research Program under Grant 2018RKB01144, and in part by The Project of Shandong Province Higher Educational Science and Technology Program under Grant J15LN15.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, D., Chen, L., Wang, L. et al. A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism. Multimed Tools Appl 81, 41677–41695 (2022). https://doi.org/10.1007/s11042-021-11260-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11260-w