Depth-Aware Dual-Stream Interactive Transformer Network for Facial Expression Recognition

Jiang, Yiben; Yang, Xiao; Fu, Keren; Yang, Hongyu

doi:10.1007/978-981-97-8795-1_38

Yiben Jiang¹⁵,
Xiao Yang^15,16,
Keren Fu^15,16 &
…
Hongyu Yang^15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15041))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

239 Accesses

Abstract

Facial Expression Recognition (FER) is a challenging task in computer vision, especially in the wild where factors like diverse head poses and occlusions can significantly impact recognition performance. Recent developments in RGB-D Facial Recognition (FR) methods have highlighted the superior sensitivity of depth information to occlusion and pose variations, facilitating the capture of finer facial 3D details and consequent performance enhancement. Nevertheless, prevalent FER datasets and application scenarios typically lack depth information, offering only RGB images. Hence, this paper introduces an innovative RGB FER approach grounded in depth-aware feature perception and a dual-stream interactive transformer network. Real depth is not required during inference, so in conditions that only RGB data is available, our method can effectively leverage perceived depth information for recognition. Guided by real depth features from depth images on an RGB-D FR dataset, we design and pre-train an auxiliary encoder called Depth-Aware Encoder (DAEncoder) to perceive and extract depth-aware expression features from RGB faces. Then, we propose a Dual-stream Interactive Transformer (DIT) with cross-attention to interact RGB and depth-aware features. Additionally, the RGB stream integrates self-attention and cross-attention to facilitate information fusion for final facial expression recognition. The experimental findings showcase the promising performance of our method across various Facial Expression Recognition (FER) datasets, including RAF-DB, AffectNet 7, and AffectNet 8.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Facial Expression Recognition with an Attention Network Using a Single Depth Image

A dual stream attention network for facial expression recognition in the wild

Article 23 July 2024

Twinned attention network for occlusion-aware facial expression recognition

Article 17 December 2024

References

Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 4009–4018 (2021)
Google Scholar
Cui, J., Zhang, H., Han, H., Shan, S., Chen, X.: Improving 2d face recognition via discriminative face depth estimation. In: 2018 International Conference on Biometrics (ICB), pp. 140–147. IEEE (2018)
Google Scholar
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 4690–4699 (2019)
Google Scholar
Ding, H., Zhou, P., Chellappa, R.: Occlusion-adaptive deep network for robust facial expression recognition. In: 2020 IEEE International Joint Conference on Biometrics (IJCB), pp. 1–9. IEEE (2020)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 (2020)
Goswami, G., Bharadwaj, S., Vatsa, M., Singh, R.: On rgb-d face recognition using kinect. In: 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), pp. 1–6. IEEE (2013)
Google Scholar
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: Eur. Conf. Comput. Vis (ECCV), pp. 87–102. Springer (2016)
Google Scholar
He, Y., Fu, K., Cheng, P., Zhang, J.: Facial expression recognition with geometric scattering on 3d point clouds. Sensors 22(21), 8293 (2022)
Article Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hu, W.: Improving 2d face recognition via fine-level facial depth generation and rgb-d complementary feature learning. arXiv preprint arXiv:2305.04426 (2023)
Hu, Z., Gui, P., Feng, Z., Zhao, Q., Fu, K., et al.: Boosting depth-based face recognition from a quality perspective. Sensors 19(19), 4124 (2019)
Article Google Scholar
Lai, P., Yin, M., Yin, Y., Xie, M.: Swinfusion: Channel query-response based feature fusion for monocular depth estimation. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 246–258. Springer (2023)
Google Scholar
Li, H., Niu, H., Zhu, Z., Zhao, F.: Cliper: A unified vision-language framework for in-the-wild facial expression recognition. arXiv preprint arXiv:2303.00193 (2023)
Li, S., Deng, W.: Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. (TAC) 13(3), 1195–1215 (2020)
Article MathSciNet Google Scholar
Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 2852–2861 (2017)
Google Scholar
Li, Y., Wang, M., Gong, M., et al.: Fer-former: Multi-modal transformer for facial expression recognition. arXiv preprint arXiv:2303.12997 (2023)
Li, Y., Zeng, J., et al.: Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE T. Image Process (TIP) 28(5), 2439–2450 (2018)
Article MathSciNet Google Scholar
Lo, L., Xie, H., Shuai, H.H., et al.: Facial chirality: From visual self-reflection to robust facial feature learning. IEEE T. Multimedia (TMM) 24, 4275–4284 (2022)
Article Google Scholar
Lu, C., Jiang, Y., Fu, K., Zhao, Q., Yang, H.: Lstpnet: Long short-term perception network for dynamic facial expression recognition in the wild. Image Vis. Comput. (IVC). 142, 104915 (2024)
Google Scholar
Ma, F., Sun, B., Li, S.: Facial expression recognition with visual transformers and attentional selective fusion. IEEE Trans. Affect. Comput. (TAC) 14(2), 1236–1248 (2021)
Article Google Scholar
Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. (TAC) 10(1), 18–31 (2017)
Article Google Scholar
Navneet, D.: Histograms of oriented gradients for human detection. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), vol. 2, pp. 886–893 (2005)
Google Scholar
Peng, S., Zhu, X., Yi, DI, Qian, C., Lei, Z.: Formulating facial mesh tracking as a differentiable optimization problem: a backpropagation-based solution. Vis. Intell. 2(22) (2024)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Int. Conf. Comput. Vis (ICCV), pp. 618–626 (2017)
Google Scholar
Shan, C., Gong, S., et al.: Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput (IVC). 27(6), 803–816 (2009)
Article Google Scholar
She, J., Hu, Y., Shi, H., et al.: Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 6248–6257 (2021)
Google Scholar
Wang, K., Peng, X., Yang, J., Meng, D., Qiao, Y.: Region attention networks for pose and occlusion robust facial expression recognition. IEEE T. Image Process (TIP) 29, 4057–4069 (2020)
Article Google Scholar
Wang, Q., Zhang, P., Xiong, H., Zhao, J.: Face. evolve: A high-performance face recognition library. arXiv preprint arXiv:2107.08621 (2021)
Wu, S.Y., Chiu, C.T., Hsu, Y.C.: Pose aware rgbd-based face recognition system with hierarchical bilinear pooling. In: 2023 21st IEEE Interregional NEWCAS Conference (NEWCAS), pp. 1–5. IEEE (2023)
Google Scholar
Wu, Y., Jia, K., Sun, Z.: Facial expression recognition based on multi-scale feature fusion convolutional neural network and attention mechanism. In: Pattern Recognition and Computer Vision: 4th Chinese Conference, PRCV 2021, Beijing, China, October 29–November 1, 2021, Proceedings, Part II 4, pp. 324–335. Springer (2021)
Google Scholar
Xue, F., Wang, Q., Guo, G.: Transfer: Learning relation-aware facial expression representations with transformers. In: Int. Conf. Comput. Vis (ICCV), pp. 3601–3610 (2021)
Google Scholar
Xue, F., Wang, Q., Tan, Z., et al.: Vision transformer with attentive pooling for robust facial expression recognition. IEEE Trans. Affect. Comput. (TAC) (2022)
Google Scholar
Yan, P., Liu, X., Zhang, P., Lu, H.: Learning convolutional multi-level transformers for image-based person re-identification. Vis. Intell. 1(24) (2023)
Google Scholar
Zeng, D., Lin, Z., et al.: Face2exp: Combating data biases for facial expression recognition. In: IEEE Conf. Comput. Vis. Pattern Recog, pp. 20291–20300 (2022)
Google Scholar
Zhang, J., Gao, K., Fu, K., Cheng, P.: Deep 3d facial landmark localization on position maps. Neurocomputing 406, 89–98 (2020)
Article Google Scholar
Zhang, Y., Wang, C., Ling, X., et al.: Learn from all: Erasing attention consistency for noisy label facial expression recognition. In: Eur. Conf. Comput. Vis (ECCV), pp. 418–434. Springer (2022)
Google Scholar
Zheng, C., Mendieta, M., Chen, C.: Poster: A pyramid cross-fusion transformer network for facial expression recognition. In: Int. Conf. Comput. Vis. Worksh (ICCVW), pp. 3146–3155 (2023)
Google Scholar
Zhou, G., Xie, Y., Tian, W.: Multi loss-based feature fusion and top two voting ensemble decision strategy for facial expression recognition in the wild. arXiv preprint arXiv:2311.03478 (2023)

Download references

Acknowledgement

This work was partly supported by the Sichuan University–Luzhou Municipal Peoples Government Strategic Cooperation Project under Grants No. 2021CDLZ-13, the National Natural Science Foundation of China under Grants No. 62006162, 62176169, and Sichuan Science and Technology Projects (2023ZHCG0007).

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Chengdu, China
Yiben Jiang, Xiao Yang, Keren Fu & Hongyu Yang
National Key Lab of Fundamental Science on Synthetic Vision, Sichuan University, Sichuan, China
Xiao Yang, Keren Fu & Hongyu Yang

Authors

Yiben Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Keren Fu
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Yang .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, Y., Yang, X., Fu, K., Yang, H. (2025). Depth-Aware Dual-Stream Interactive Transformer Network for Facial Expression Recognition. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_38

Download citation

DOI: https://doi.org/10.1007/978-981-97-8795-1_38
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8794-4
Online ISBN: 978-981-97-8795-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics