Abstract
Facial Emotion Recognition (FER) is an important and challenging task in computer vision due to different issues such as quality of images, the correlation between same expression, computational complexity, and it requires a large amount of data. This paper presents a novel approach to the FER task. We are motivated by the success of Vision Transformer (ViT) and the Convolutional Neural Network (CNN) on image classification in general and facial emotion recognition.
The Swin Transformer (ST) is a hierarchical transformer that uses shifted windows to compute representation. The advantages of ST include limiting self-attention computing, and has linear computational complexity to image size. This paper studies and compares both ST and Deep CNN architecture when merged by different merging layers. The proposed approach is tested on the FER2013 and CK+ data sets. Experimental results demonstrate the high performance of the Average Merging Layer (AML), and our method outperforms state-of-the-art methods on FER2013 and CK+.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. J. Pers. Soc. Psychol. 17(2), 124–129 (1971). https://doi.org/10.1037/h0030377
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2323 (1998). https://doi.org/10.1109/5.726791
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386
Li, H., Sui, M., Zhao, F., Zha, Z., Wu, F.: MVT: Mask Vision Transformer for Facial Expression Recognition in the Wild (2021). arXiv:2106.04520
Liu, Z., et al.: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (2021). arXiv:2103.14030
Hung, J.C., Lin, K.C., Lai, N.X.: Recognizing learning emotion based on convolutional neural networks and transfer learning. Appl. Soft Comput. J. 84, 105724 (2019). https://doi.org/10.1016/j.asoc.2019.105724
Rzayeva, Z., Alasgarov, E.: Facial emotion recognition using deep convolutional neural networks. Int. J. Adv. Sci. Technol. 29(6 Special Issue), 2020–2025 (2020)
Connie, T., Al-Shabi, M., Cheah, W.P., Goh, M.: Facial expression recognition using a hybrid CNN–SIFT aggregator. In: Phon-Amnuaisuk, S., Ang, S.-P., Lee, S.-Y. (eds.) MIWAI 2017. LNCS (LNAI), vol. 10607, pp. 139–149. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69456-6_12
Alfakih, A., Yang, S., Hu, T.: Distributed computing and artificial intelligence. In: 16th International Conference, Multi-view Cooperative Deep Convolutional Network for Facial Recognition with Small Samples Learning, vol. 290 (2019). https://doi.org/10.1007/978-3-030-23887-2
Aouayeb, M., Hamidouche, W., Soladie, C., Kpalma, K., Seguier, R.: Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition, pp. 1–13 (2021). arXiv:2107.03107
Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, pp. 94–101 (2010). https://doi.org/10.1109/CVPRW.2010.5543262
Riaz, M.N., Shen, Y., Sohail, M., Guo, M.: eXnet: an efficient approach for emotion recognition in the wild. Sensors (Switzerland) 20(4), 1087 (2020). https://doi.org/10.3390/s20041087
Agrawal, A., Mittal, N.: Using CNN for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. Visual Comput. 36(2), 405–412 (2019). https://doi.org/10.1007/s00371-019-01630-9
Wang, Y., Li, Y., Song, Y., Rong, X.: The influence of the activation function in a convolution neural network model of facial expression recognition. Appl. Sci. 10(5), 1897 (2020). https://doi.org/10.3390/app10051897
Huang, Q., Huang, C., Wang, X., Jiang, F.: Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci. (Ny). 580, 35–54 (2021). https://doi.org/10.1016/j.ins.2021.08.043
Acknowledgments
This work was supported by the Ministry of Higher Education, Scientific Research and Innovation, the Digital Development Agency (DDA), and the CNRST of Morocco (project 22).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Bousaid, R., El Hajji, M., Es-Saady, Y. (2022). Facial Expression Recognition Using a Hybrid ViT-CNN Aggregator. In: Fakir, M., Baslam, M., El Ayachi, R. (eds) Business Intelligence. CBI 2022. Lecture Notes in Business Information Processing, vol 449. Springer, Cham. https://doi.org/10.1007/978-3-031-06458-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-06458-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06457-9
Online ISBN: 978-3-031-06458-6
eBook Packages: Computer ScienceComputer Science (R0)