Abstract
Facial Expression Recognition (FER) is a challenging task in computer vision, especially in the wild where factors like diverse head poses and occlusions can significantly impact recognition performance. Recent developments in RGB-D Facial Recognition (FR) methods have highlighted the superior sensitivity of depth information to occlusion and pose variations, facilitating the capture of finer facial 3D details and consequent performance enhancement. Nevertheless, prevalent FER datasets and application scenarios typically lack depth information, offering only RGB images. Hence, this paper introduces an innovative RGB FER approach grounded in depth-aware feature perception and a dual-stream interactive transformer network. Real depth is not required during inference, so in conditions that only RGB data is available, our method can effectively leverage perceived depth information for recognition. Guided by real depth features from depth images on an RGB-D FR dataset, we design and pre-train an auxiliary encoder called Depth-Aware Encoder (DAEncoder) to perceive and extract depth-aware expression features from RGB faces. Then, we propose a Dual-stream Interactive Transformer (DIT) with cross-attention to interact RGB and depth-aware features. Additionally, the RGB stream integrates self-attention and cross-attention to facilitate information fusion for final facial expression recognition. The experimental findings showcase the promising performance of our method across various Facial Expression Recognition (FER) datasets, including RAF-DB, AffectNet 7, and AffectNet 8.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 4009–4018 (2021)
Cui, J., Zhang, H., Han, H., Shan, S., Chen, X.: Improving 2d face recognition via discriminative face depth estimation. In: 2018 International Conference on Biometrics (ICB), pp. 140–147. IEEE (2018)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 4690–4699 (2019)
Ding, H., Zhou, P., Chellappa, R.: Occlusion-adaptive deep network for robust facial expression recognition. In: 2020 IEEE International Joint Conference on Biometrics (IJCB), pp. 1–9. IEEE (2020)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 (2020)
Goswami, G., Bharadwaj, S., Vatsa, M., Singh, R.: On rgb-d face recognition using kinect. In: 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), pp. 1–6. IEEE (2013)
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: Eur. Conf. Comput. Vis (ECCV), pp. 87–102. Springer (2016)
He, Y., Fu, K., Cheng, P., Zhang, J.: Facial expression recognition with geometric scattering on 3d point clouds. Sensors 22(21), 8293 (2022)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hu, W.: Improving 2d face recognition via fine-level facial depth generation and rgb-d complementary feature learning. arXiv preprint arXiv:2305.04426 (2023)
Hu, Z., Gui, P., Feng, Z., Zhao, Q., Fu, K., et al.: Boosting depth-based face recognition from a quality perspective. Sensors 19(19), 4124 (2019)
Lai, P., Yin, M., Yin, Y., Xie, M.: Swinfusion: Channel query-response based feature fusion for monocular depth estimation. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 246–258. Springer (2023)
Li, H., Niu, H., Zhu, Z., Zhao, F.: Cliper: A unified vision-language framework for in-the-wild facial expression recognition. arXiv preprint arXiv:2303.00193 (2023)
Li, S., Deng, W.: Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. (TAC) 13(3), 1195–1215 (2020)
Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 2852–2861 (2017)
Li, Y., Wang, M., Gong, M., et al.: Fer-former: Multi-modal transformer for facial expression recognition. arXiv preprint arXiv:2303.12997 (2023)
Li, Y., Zeng, J., et al.: Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE T. Image Process (TIP) 28(5), 2439–2450 (2018)
Lo, L., Xie, H., Shuai, H.H., et al.: Facial chirality: From visual self-reflection to robust facial feature learning. IEEE T. Multimedia (TMM) 24, 4275–4284 (2022)
Lu, C., Jiang, Y., Fu, K., Zhao, Q., Yang, H.: Lstpnet: Long short-term perception network for dynamic facial expression recognition in the wild. Image Vis. Comput. (IVC). 142, 104915 (2024)
Ma, F., Sun, B., Li, S.: Facial expression recognition with visual transformers and attentional selective fusion. IEEE Trans. Affect. Comput. (TAC) 14(2), 1236–1248 (2021)
Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. (TAC) 10(1), 18–31 (2017)
Navneet, D.: Histograms of oriented gradients for human detection. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), vol. 2, pp. 886–893 (2005)
Peng, S., Zhu, X., Yi, DI, Qian, C., Lei, Z.: Formulating facial mesh tracking as a differentiable optimization problem: a backpropagation-based solution. Vis. Intell. 2(22) (2024)
Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Int. Conf. Comput. Vis (ICCV), pp. 618–626 (2017)
Shan, C., Gong, S., et al.: Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput (IVC). 27(6), 803–816 (2009)
She, J., Hu, Y., Shi, H., et al.: Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 6248–6257 (2021)
Wang, K., Peng, X., Yang, J., Meng, D., Qiao, Y.: Region attention networks for pose and occlusion robust facial expression recognition. IEEE T. Image Process (TIP) 29, 4057–4069 (2020)
Wang, Q., Zhang, P., Xiong, H., Zhao, J.: Face. evolve: A high-performance face recognition library. arXiv preprint arXiv:2107.08621 (2021)
Wu, S.Y., Chiu, C.T., Hsu, Y.C.: Pose aware rgbd-based face recognition system with hierarchical bilinear pooling. In: 2023 21st IEEE Interregional NEWCAS Conference (NEWCAS), pp. 1–5. IEEE (2023)
Wu, Y., Jia, K., Sun, Z.: Facial expression recognition based on multi-scale feature fusion convolutional neural network and attention mechanism. In: Pattern Recognition and Computer Vision: 4th Chinese Conference, PRCV 2021, Beijing, China, October 29–November 1, 2021, Proceedings, Part II 4, pp. 324–335. Springer (2021)
Xue, F., Wang, Q., Guo, G.: Transfer: Learning relation-aware facial expression representations with transformers. In: Int. Conf. Comput. Vis (ICCV), pp. 3601–3610 (2021)
Xue, F., Wang, Q., Tan, Z., et al.: Vision transformer with attentive pooling for robust facial expression recognition. IEEE Trans. Affect. Comput. (TAC) (2022)
Yan, P., Liu, X., Zhang, P., Lu, H.: Learning convolutional multi-level transformers for image-based person re-identification. Vis. Intell. 1(24) (2023)
Zeng, D., Lin, Z., et al.: Face2exp: Combating data biases for facial expression recognition. In: IEEE Conf. Comput. Vis. Pattern Recog, pp. 20291–20300 (2022)
Zhang, J., Gao, K., Fu, K., Cheng, P.: Deep 3d facial landmark localization on position maps. Neurocomputing 406, 89–98 (2020)
Zhang, Y., Wang, C., Ling, X., et al.: Learn from all: Erasing attention consistency for noisy label facial expression recognition. In: Eur. Conf. Comput. Vis (ECCV), pp. 418–434. Springer (2022)
Zheng, C., Mendieta, M., Chen, C.: Poster: A pyramid cross-fusion transformer network for facial expression recognition. In: Int. Conf. Comput. Vis. Worksh (ICCVW), pp. 3146–3155 (2023)
Zhou, G., Xie, Y., Tian, W.: Multi loss-based feature fusion and top two voting ensemble decision strategy for facial expression recognition in the wild. arXiv preprint arXiv:2311.03478 (2023)
Acknowledgement
This work was partly supported by the Sichuan University–Luzhou Municipal Peoples Government Strategic Cooperation Project under Grants No. 2021CDLZ-13, the National Natural Science Foundation of China under Grants No. 62006162, 62176169, and Sichuan Science and Technology Projects (2023ZHCG0007).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jiang, Y., Yang, X., Fu, K., Yang, H. (2025). Depth-Aware Dual-Stream Interactive Transformer Network for Facial Expression Recognition. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_38
Download citation
DOI: https://doi.org/10.1007/978-981-97-8795-1_38
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8794-4
Online ISBN: 978-981-97-8795-1
eBook Packages: Computer ScienceComputer Science (R0)