Depth-Aware Dual-Stream Interactive Transformer Network for Facial Expression Recognition | SpringerLink
Skip to main content

Depth-Aware Dual-Stream Interactive Transformer Network for Facial Expression Recognition

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15041))

Included in the following conference series:

  • 239 Accesses

Abstract

Facial Expression Recognition (FER) is a challenging task in computer vision, especially in the wild where factors like diverse head poses and occlusions can significantly impact recognition performance. Recent developments in RGB-D Facial Recognition (FR) methods have highlighted the superior sensitivity of depth information to occlusion and pose variations, facilitating the capture of finer facial 3D details and consequent performance enhancement. Nevertheless, prevalent FER datasets and application scenarios typically lack depth information, offering only RGB images. Hence, this paper introduces an innovative RGB FER approach grounded in depth-aware feature perception and a dual-stream interactive transformer network. Real depth is not required during inference, so in conditions that only RGB data is available, our method can effectively leverage perceived depth information for recognition. Guided by real depth features from depth images on an RGB-D FR dataset, we design and pre-train an auxiliary encoder called Depth-Aware Encoder (DAEncoder) to perceive and extract depth-aware expression features from RGB faces. Then, we propose a Dual-stream Interactive Transformer (DIT) with cross-attention to interact RGB and depth-aware features. Additionally, the RGB stream integrates self-attention and cross-attention to facilitate information fusion for final facial expression recognition. The experimental findings showcase the promising performance of our method across various Facial Expression Recognition (FER) datasets, including RAF-DB, AffectNet 7, and AffectNet 8.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 4009–4018 (2021)

    Google Scholar 

  2. Cui, J., Zhang, H., Han, H., Shan, S., Chen, X.: Improving 2d face recognition via discriminative face depth estimation. In: 2018 International Conference on Biometrics (ICB), pp. 140–147. IEEE (2018)

    Google Scholar 

  3. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 4690–4699 (2019)

    Google Scholar 

  4. Ding, H., Zhou, P., Chellappa, R.: Occlusion-adaptive deep network for robust facial expression recognition. In: 2020 IEEE International Joint Conference on Biometrics (IJCB), pp. 1–9. IEEE (2020)

    Google Scholar 

  5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 (2020)

  6. Goswami, G., Bharadwaj, S., Vatsa, M., Singh, R.: On rgb-d face recognition using kinect. In: 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), pp. 1–6. IEEE (2013)

    Google Scholar 

  7. Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: Eur. Conf. Comput. Vis (ECCV), pp. 87–102. Springer (2016)

    Google Scholar 

  8. He, Y., Fu, K., Cheng, P., Zhang, J.: Facial expression recognition with geometric scattering on 3d point clouds. Sensors 22(21), 8293 (2022)

    Article  Google Scholar 

  9. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  10. Hu, W.: Improving 2d face recognition via fine-level facial depth generation and rgb-d complementary feature learning. arXiv preprint arXiv:2305.04426 (2023)

  11. Hu, Z., Gui, P., Feng, Z., Zhao, Q., Fu, K., et al.: Boosting depth-based face recognition from a quality perspective. Sensors 19(19), 4124 (2019)

    Article  Google Scholar 

  12. Lai, P., Yin, M., Yin, Y., Xie, M.: Swinfusion: Channel query-response based feature fusion for monocular depth estimation. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 246–258. Springer (2023)

    Google Scholar 

  13. Li, H., Niu, H., Zhu, Z., Zhao, F.: Cliper: A unified vision-language framework for in-the-wild facial expression recognition. arXiv preprint arXiv:2303.00193 (2023)

  14. Li, S., Deng, W.: Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. (TAC) 13(3), 1195–1215 (2020)

    Article  MathSciNet  Google Scholar 

  15. Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 2852–2861 (2017)

    Google Scholar 

  16. Li, Y., Wang, M., Gong, M., et al.: Fer-former: Multi-modal transformer for facial expression recognition. arXiv preprint arXiv:2303.12997 (2023)

  17. Li, Y., Zeng, J., et al.: Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE T. Image Process (TIP) 28(5), 2439–2450 (2018)

    Article  MathSciNet  Google Scholar 

  18. Lo, L., Xie, H., Shuai, H.H., et al.: Facial chirality: From visual self-reflection to robust facial feature learning. IEEE T. Multimedia (TMM) 24, 4275–4284 (2022)

    Article  Google Scholar 

  19. Lu, C., Jiang, Y., Fu, K., Zhao, Q., Yang, H.: Lstpnet: Long short-term perception network for dynamic facial expression recognition in the wild. Image Vis. Comput. (IVC). 142, 104915 (2024)

    Google Scholar 

  20. Ma, F., Sun, B., Li, S.: Facial expression recognition with visual transformers and attentional selective fusion. IEEE Trans. Affect. Comput. (TAC) 14(2), 1236–1248 (2021)

    Article  Google Scholar 

  21. Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. (TAC) 10(1), 18–31 (2017)

    Article  Google Scholar 

  22. Navneet, D.: Histograms of oriented gradients for human detection. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), vol. 2, pp. 886–893 (2005)

    Google Scholar 

  23. Peng, S., Zhu, X., Yi, DI, Qian, C., Lei, Z.: Formulating facial mesh tracking as a differentiable optimization problem: a backpropagation-based solution. Vis. Intell. 2(22) (2024)

    Google Scholar 

  24. Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Int. Conf. Comput. Vis (ICCV), pp. 618–626 (2017)

    Google Scholar 

  25. Shan, C., Gong, S., et al.: Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput (IVC). 27(6), 803–816 (2009)

    Article  Google Scholar 

  26. She, J., Hu, Y., Shi, H., et al.: Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition. In: IEEE Conf. Comput. Vis. Pattern Recog (CVPR), pp. 6248–6257 (2021)

    Google Scholar 

  27. Wang, K., Peng, X., Yang, J., Meng, D., Qiao, Y.: Region attention networks for pose and occlusion robust facial expression recognition. IEEE T. Image Process (TIP) 29, 4057–4069 (2020)

    Article  Google Scholar 

  28. Wang, Q., Zhang, P., Xiong, H., Zhao, J.: Face. evolve: A high-performance face recognition library. arXiv preprint arXiv:2107.08621 (2021)

  29. Wu, S.Y., Chiu, C.T., Hsu, Y.C.: Pose aware rgbd-based face recognition system with hierarchical bilinear pooling. In: 2023 21st IEEE Interregional NEWCAS Conference (NEWCAS), pp. 1–5. IEEE (2023)

    Google Scholar 

  30. Wu, Y., Jia, K., Sun, Z.: Facial expression recognition based on multi-scale feature fusion convolutional neural network and attention mechanism. In: Pattern Recognition and Computer Vision: 4th Chinese Conference, PRCV 2021, Beijing, China, October 29–November 1, 2021, Proceedings, Part II 4, pp. 324–335. Springer (2021)

    Google Scholar 

  31. Xue, F., Wang, Q., Guo, G.: Transfer: Learning relation-aware facial expression representations with transformers. In: Int. Conf. Comput. Vis (ICCV), pp. 3601–3610 (2021)

    Google Scholar 

  32. Xue, F., Wang, Q., Tan, Z., et al.: Vision transformer with attentive pooling for robust facial expression recognition. IEEE Trans. Affect. Comput. (TAC) (2022)

    Google Scholar 

  33. Yan, P., Liu, X., Zhang, P., Lu, H.: Learning convolutional multi-level transformers for image-based person re-identification. Vis. Intell. 1(24) (2023)

    Google Scholar 

  34. Zeng, D., Lin, Z., et al.: Face2exp: Combating data biases for facial expression recognition. In: IEEE Conf. Comput. Vis. Pattern Recog, pp. 20291–20300 (2022)

    Google Scholar 

  35. Zhang, J., Gao, K., Fu, K., Cheng, P.: Deep 3d facial landmark localization on position maps. Neurocomputing 406, 89–98 (2020)

    Article  Google Scholar 

  36. Zhang, Y., Wang, C., Ling, X., et al.: Learn from all: Erasing attention consistency for noisy label facial expression recognition. In: Eur. Conf. Comput. Vis (ECCV), pp. 418–434. Springer (2022)

    Google Scholar 

  37. Zheng, C., Mendieta, M., Chen, C.: Poster: A pyramid cross-fusion transformer network for facial expression recognition. In: Int. Conf. Comput. Vis. Worksh (ICCVW), pp. 3146–3155 (2023)

    Google Scholar 

  38. Zhou, G., Xie, Y., Tian, W.: Multi loss-based feature fusion and top two voting ensemble decision strategy for facial expression recognition in the wild. arXiv preprint arXiv:2311.03478 (2023)

Download references

Acknowledgement

This work was partly supported by the Sichuan University–Luzhou Municipal Peoples Government Strategic Cooperation Project under Grants No. 2021CDLZ-13, the National Natural Science Foundation of China under Grants No. 62006162, 62176169, and Sichuan Science and Technology Projects (2023ZHCG0007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, Y., Yang, X., Fu, K., Yang, H. (2025). Depth-Aware Dual-Stream Interactive Transformer Network for Facial Expression Recognition. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_38

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8795-1_38

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8794-4

  • Online ISBN: 978-981-97-8795-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics