A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism | Multimedia Tools and Applications Skip to main content
Log in

A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism

  • 1181: Multimedia-based Healthcare Systems using Computational Intelligence
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper researches how to use attention mechanism to fuse the time series information of facial expression and speech, and proposes a multi-modal feature fusion emotion recognition model based on attention mechanism. First, facial expression features and speech features are extracted. Facial expression feature extraction, based on C3D-LSTM hybrid model, can effectively obtain the temporal and spatial expression features in videos. For speech feature extraction, Mel Frequency Cepstral Coefficient (MFCC) is used for extracting the initial speech features, and convolution neural network is for further features. Then, a face and speech recognition method based on attention mechanism is proposed. Through the attention analysis of the fusion features, the proposed method can obtain the relationship between the features, so that the features without noise and with strong distinguishability obtain more weight, and reduce the weight of noisy features at the same time. Finally, this method is applied to face expression and speech fusion recognition. The experimental results show that the proposed multi-modal emotion classification model is better than those in other literatures in RML dataset, with an average recognition rate of up to 81.18%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Angrick M, Herff C, Johnson G et al (2019) Interpretation of convolutional neural networks for speech spectrogram regression from intracranial recordings [J]. Neurocomputing 342(21):145–151

  2. Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video [C]. In Proc Adv Neural Inf Process Syst 892–900

  3. Bodla N, Zheng J, Xu H et al (2017) Deep heterogeneous feature fusion for template-based face recognition[C]. In Proc IEEE Winter Conf Appl Comput Vis 586–595

  4. Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications[J]. J Algorithm 55(1):58–75

    Article  MathSciNet  MATH  Google Scholar 

  5. Fukui A, Park DH, Yang D et al (2016) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding [C]. In Proc Conf Empir Methods Nat Lang Process 457–468

  6. Guanjun S, Shudong Z, Feigao W (2020) Research on Audio-Visual Dual-Modal Emotion Recognition Fusion Framework [J]. Comput Eng Appl 56(6):140–146

  7. Haritha CV, Thulasidharan PP (2018) Multimodal Emotion Recognition using Deep Neural Network- A Survey [J]. Int J Comput Sci Eng 06(6):95–98

    Article  Google Scholar 

  8. Hu Y, Ren JS, Dai J et al (2015) Deep Multimodal Speaker Naming [C]. In Proc ACM Int Conf Multimed 1107–1110

  9. Li W, Chu M, Qiao J (2019) Design of a hierarchy modular neural network and its application in multimodal emotion recognition [J]. Soft Comput 23(3):11817–11828. https://doi.org/10.1007/s00500-018-03735-0

    Article  Google Scholar 

  10. Lubis N, Lestari D, Sakti S et al (2018) Construction of Spontaneous Emotion Corpus from Indonesian TV Talk Shows and Its Application on Multimodal Emotion Recognition [J]. IEICE Trans Inf Syst E101.D(8):2092–2100

  11. Ma J, Sun Y, Zhang X (2019) Multimodal emotion recognition for the fusion of speech and EEG signals [J]. Xian Dianzi Keji Daxue Xuebao/J Xidian Univ 46(1):143–150

    Google Scholar 

  12. Noroozi F, Marjanovic M, Njegus A et al (2019) Audio-Visual Emotion Recognition in Video Clips [J]. Affective Computing, IEEE Trans Affect Comput 10(1):60–75

    Article  Google Scholar 

  13. Ren J, Hu Y, Tai Y W et al (2016) Look, listen and learn---a multimodal LSTM for speaker identification [C]. In Proc Thirtieth AAAI Conf Artif Intell 3581–3587

  14. Song KS, Nho YH, Seo JH, Kwon DS (2018) Decision-Level Fusion Method for Emotion Recognition using Multimodal Emotion Recognition Information [C]. 2018 15th International Conference on Ubiquitous Robots (UR), Honolulu, HI 472–476. https://doi.org/10.1109/URAI.2018.8441795

  15. Sun B, Xu Q, He J et al (2016) Audio-Video Based Multimodal Emotion Recognition Using SVMs and Deep Learning [C]. Chinese Conf Pattern Recognit 621–631. Springer, Singapore

  16. Wen Y, Zhang K, Li Z et al (2016) A discriminative feature learning approach for deep face recognition [C]. In Proc Eur Conf Comput Vis 499–515

Download references

Acknowledgement

This work was supported in part by the Natural Science Foundation of Shandong Province of China under Grant ZR202103020833, Social Science Planning Research Project of Shandong Province under Grant 18CLYJ50, in part by the Shandong Soft Science Research Program under Grant 2018RKB01144, and in part by The Project of Shandong Province Higher Educational Science and Technology Program under Grant J15LN15.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, D., Chen, L., Wang, L. et al. A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism. Multimed Tools Appl 81, 41677–41695 (2022). https://doi.org/10.1007/s11042-021-11260-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11260-w

Keywords