Audio-Video Based Multimodal Emotion Recognition Using SVMs and Deep Learning

Sun, Bo; Xu, Qihua; He, Jun; Yu, Lejun; Li, Liandong; Wei, Qinglan

doi:10.1007/978-981-10-3005-5_51

Bo Sun¹⁶,
Qihua Xu^16,17,
Jun He¹⁶,
Lejun Yu¹⁶,
Liandong Li¹⁶ &
…
Qinglan Wei¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 663))

Included in the following conference series:

Chinese Conference on Pattern Recognition

2659 Accesses
4 Citations

Abstract

In this paper, we explored a multi-feature based classification framework for the Multimodal Emotion Recognition Challenge, which is part of the Chinese Conference on Pattern Recognition (CCPR 2016). The task of the challenge is to recognize one of eight facial emotions in short video segments extracted from Chinese films, TV plays and talk shows. In our framework, both traditional methods and Deep Convolutional Neural Network (DCNN) methods are used to extract various features. With different features, different classifiers are trained to predict video emotion labels. Moreover, a decision-level fusion method is explored to aggregate these different prediction results. According to the results on the competition database, our method shows better effectiveness on Chinese facial emotion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Efficiency Analysis of Pre-trained CNN Models as Feature Extractors for Video Emotion Recognition

Deep learning-based late fusion of multimodal information for emotion classification of music video

Article Open access 17 September 2020

Multimodal modelling of human emotion using sound, image and text fusion

Article 11 August 2023

References

Pantic, M., Rothkrantz, L.J.M.: Automatic analysis of facial expressions: the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1424–1445 (2000)
Article Google Scholar
Ebrahimi Kahou, S., et al.: Recurrent neural networks for emotion recognition in video. In: ACM International Conference on Multimodal Interaction (2015)
Google Scholar
Li, Y., et al.: MEC 2016: the multimodal emotion recognition challenge of CCPR 2016. In: Chinese Conference on Pattern Recognition (CCPR), Chengdu, China (2016)
Google Scholar
Bao, W., et al.: Building a Chinese natural emotional audio-visual database. In: International Conference on Signal Processing (2015)
Google Scholar
Eyben, F., et al.: Recent developments in openSMILE, the munich open-source multimedia feature extractor. In: ACM International Conference on Multimedia (2013)
Google Scholar
Zhao, G., Pietikinen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007)
Article Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. Proc. CVPR 1, 511 (2001)
Google Scholar
Li, H., et al.: A convolutional neural network cascade for face detection. In: Computer Vision and Pattern Recognition (2015)
Google Scholar
Szarvas, M., et al.: Multi-view face detection using deep convolutional neural networks. In: Proceedings of Intelligent Vehicles Symposium. IEEE (2015)
Google Scholar
Kim, B.K., et al.: Hierarchical committee of deep CNNs with exponentially-weighted decision fusion for static facial expression recognition. In: ACM on International Conference on Multimodal Interaction (2015)
Google Scholar
Sun, B., et al.: Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: The International Conference (2014)
Google Scholar
Sun, B., et al.: Combining multimodal features within a fusion network for emotion recognition in the wild. In: ACM on International Conference on Multimodal Interaction (2015)
Google Scholar
Xiong, X., Torre, F.D.L.: Supervised descent method and its applications to face alignment. In: IEEE Conference on Computer Vision & Pattern Recognition (2013)
Google Scholar
Farfade, S.S., Saberian, M.J., Li, L.J.: Multi-view face detection using deep convolutional neural networks. In: Proceedings of Intelligent Vehicles Symposium. IEEE (2015)
Google Scholar
Carrier, P., et al.: FER-2013 face database. Technical report, 1365, Université de Montréal (2013)
Google Scholar
Eyben, F., et al.: The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 12(2), 190–202 (2016)
Article Google Scholar
Fan, R.E., et al.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9(12), 1871–1874 (2010)
MATH Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2006)
Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. Eprint Arxiv, pp. 675–678 (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Chen, S., Jin, Q.: Multi-modal dimensional emotion recognition using recurrent neural networks. In: International Workshop on Audio/visual Emotion Challenge (2015)
Google Scholar
Chollet (2015). GitHub. https://github.com/fchollet/keras. Accessed 10 June 2016
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. Computer Science (2015)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Education Science Twelfth Five-Year Plan Key Issues of the Ministry of Education (DCA140229).

Author information

Authors and Affiliations

College of Information Science and Technology, Beijing Normal University, Beijing, People’s Republic of China
Bo Sun, Qihua Xu, Jun He, Lejun Yu, Liandong Li & Qinglan Wei
School of Business, Northwest Normal University, Lanzhou, People’s Republic of China
Qihua Xu

Authors

Bo Sun
View author publications
You can also search for this author in PubMed Google Scholar
Qihua Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jun He
View author publications
You can also search for this author in PubMed Google Scholar
Lejun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Liandong Li
View author publications
You can also search for this author in PubMed Google Scholar
Qinglan Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun He .

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China
Xuelong Li
Chinese Academy of Sciences, Institute of Computing Technology, Beijing, China
Xilin Chen
Tsinghua University , Beijing, China
Jie Zhou
Nanjing University of Science and Technology, Nanjing, China
Jian Yang
University of Electronic Science and Technology, Chengdu, Sichuan, China
Hong Cheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, B., Xu, Q., He, J., Yu, L., Li, L., Wei, Q. (2016). Audio-Video Based Multimodal Emotion Recognition Using SVMs and Deep Learning. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 663. Springer, Singapore. https://doi.org/10.1007/978-981-10-3005-5_51

Download citation

DOI: https://doi.org/10.1007/978-981-10-3005-5_51
Published: 22 October 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3004-8
Online ISBN: 978-981-10-3005-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Audio-Video Based Multimodal Emotion Recognition Using SVMs and Deep Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Efficiency Analysis of Pre-trained CNN Models as Feature Extractors for Video Emotion Recognition

Deep learning-based late fusion of multimodal information for emotion classification of music video

Multimodal modelling of human emotion using sound, image and text fusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Audio-Video Based Multimodal Emotion Recognition Using SVMs and Deep Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Efficiency Analysis of Pre-trained CNN Models as Feature Extractors for Video Emotion Recognition

Deep learning-based late fusion of multimodal information for emotion classification of music video

Multimodal modelling of human emotion using sound, image and text fusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation