LMS-VDR: Integrating Landmarks into Multi-scale Hybrid Net for Video-Based Depression Recognition

Yang, Mengyuan; Shang, Yuanyuan; Liu, Jingyi; Shao, Zhuhong; Liu, Tie; Ding, Hui; Li, Hailiang

doi:10.1007/978-981-97-8792-0_21

Mengyuan Yang¹⁵,
Yuanyuan Shang¹⁵,
Jingyi Liu¹⁶,
Zhuhong Shao¹⁵,
Tie Liu¹⁵,
Hui Ding¹⁵ &
…
Hailiang Li¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15040))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

103 Accesses

Abstract

Recent advancements in deep learning have significantly enhanced facial video-based depression recognition. However, existing models encounter critical limitations that hinder their performance. They struggle with spatially localized facial feature extraction, relying on either a single convolution or a convolution operation with a simplistic attention mechanism, leading to inadequate recognition of depression-relevant patterns. Besides, increasing model depth leads to the extraction of ambiguous, abstract features while losing crucial dynamic facial details. To address these issues, we propose LMS-VDR by combining the Multi-Scale Mixed Attention Module (MSMAM) with landmark-based prior knowledge integration. More specifically, MSMAM synergistically merges channel and spatial attention, forming a mixed attention vector block through vector products. It introduces a dense connection mechanism, directly connecting features of each dimension to the final output, thereby enabling multi-scale diversity feature extraction. The integration of landmarks, initially linearly transformed and later combined with temporal feature sequences, enhances dynamic temporal feature extraction using our proposed Cross Multi-head Self-Attention (CMHSA) block based on self-attention. Experiments on AVEC 2013 and AVEC 2014 datasets validate our method’s efficacy, achieving MAE/RMSE of 6.04/7.68 and 5.98/7.59, respectively. Our proposed method offers a promising direction for clinical depression assessment, demonstrating the potential for significant contributions in this critical domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cai, C., Niu, M., Liu, B., Tao, J., Liu, X.: Tdca-net: Time-domain channel attention network for depression detection. In: Interspeech, pp. 2511–2515 (2021)
Google Scholar
Casado, C.Á., Cañellas, M.L., López, M.B.: Depression recognition using remote photoplethysmography from facial videos. IEEE Trans. Affect. Comput. (2023)
Google Scholar
Hammen, C.: Stress and depression. Annu. Rev. Clin. Psychol. 1, 293–319 (2005)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, L., Guo, C., Tiwari, P., Pandey, H.M., Dang, W.: Intelligent system for depression scale estimation with facial expressions and case study in industrial intelligence. Int. J. Intell. Syst. 37(12), 10140–10156 (2022)
Article Google Scholar
He, L., Jiang, D., Sahli, H.: Automatic depression analysis using dynamic facial appearance descriptor and dirichlet process fisher encoding. IEEE Trans. Multimedia 21(6), 1476–1486 (2018)
Article Google Scholar
He, L., Tiwari, P., Lv, C., Wu, W., Guo, L.: Reducing noisy annotations for depression estimation from facial images. Neural Netw. 153, 120–129 (2022)
Article Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Liu, Z., Yuan, X., Li, Y., Shangguan, Z., Zhou, L., Hu, B.: Pra-net: Part-and-relation attention network for depression recognition from facial expression. Comput. Biol. Med. 157, 106589 (2023)
Article Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015)
Google Scholar
de Melo, W.C., Granger, E., Lopez, M.B.: Mdn: A deep maximization-differentiation network for spatio-temporal depression detection. IEEE Trans. Affect. Comput. 14(1), 578–590 (2021)
Article Google Scholar
Niu, M., He, L., Li, Y., Liu, B.: Depressioner: facial dynamic representation for automatic depression level prediction. Expert Syst. Appl. 204, 117512 (2022)
Article Google Scholar
Niu, M., Tao, J., Liu, B., Huang, J., Lian, Z.: Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans. Affect. Comput. 14(1), 294–307 (2020)
Article Google Scholar
Niu, M., Zhao, Z., Tao, J., Li, Y., Schuller, B.W.: Dual attention and element recalibration networks for automatic depression level prediction. IEEE Trans. Affect. Comput. (2022)
Google Scholar
Pan, Y., Shang, Y., Liu, T., Shao, Z., Guo, G., Ding, H., Hu, Q.: Spatial-temporal attention network for depression recognition from facial videos. Expert Syst. Appl. 237, 121410 (2024)
Article Google Scholar
Pan, Y., Shang, Y., Shao, Z., Liu, T., Guo, G., Ding, H.: Integrating deep facial priors into landmarks for privacy preserving multimodal depression recognition. IEEE Trans. Affect. Comput. (2023)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Shang, Y., Pan, Y., Jiang, X., Shao, Z., Guo, G., Liu, T., Ding, H.: Lqgdnet: A local quaternion and global deep network for facial depression recognition. IEEE Trans. Affect. Comput. 14(3), 2557–2563 (2021)
Article Google Scholar
Song, S., Jaiswal, S., Shen, L., Valstar, M.: Spectral representation of behaviour primitives for depression analysis. IEEE Trans. Affect. Comput. 13(2), 829–844 (2020)
Article Google Scholar
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078–10093 (2022)
Google Scholar
Valstar, M., Schuller, B., Smith, K., Almaev, T., Eyben, F., Krajewski, J., Cowie, R., Pantic, M.: Avec 2014: 3d dimensional affect and depression recognition challenge. In: Proceedings of the 4th International Workshop on Audio/visual Emotion Challenge, pp. 3–10 (2014)
Google Scholar
Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., Schnieder, S., Cowie, R., Pantic, M.: Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 3–10 (2013)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Wang, R., Guo, J., Wang, J., He, L., Yang, Y.: A multi-frame rate network with attention mechanism for depression severity estimation. In: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2679–2686. IEEE (2023)
Google Scholar
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Google Scholar
Zhang, S., Zhang, X., Zhao, X., Fang, J., Niu, M., Zhao, Z., Yu, J., Tian, Q.: Mtdan: A lightweight multi-scale temporal difference attention networks for automated video depression detection. IEEE Trans. Affect. Comput. (2023)
Google Scholar
Zhao, Z., Liu, Q.: Former-dfer: Dynamic facial expression recognition transformer. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1553–1561 (2021)
Google Scholar
Zhou, X., Jin, K., Shang, Y., Guo, G.: Visually interpretable representation learning for depression recognition from facial images. IEEE Trans. Affect. Comput. 11(3), 542–552 (2018)
Article Google Scholar
Zhu, Y., Shang, Y., Shao, Z., Guo, G.: Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Trans. Affect. Comput. 9(4), 578–584 (2017)
Article Google Scholar

Download references

Acknowledgement

This study received support from the National Natural Science Foundation of China (Grant No. 61876112) and the Beijing Natural Science Foundation (Grant No. 4242034).

Author information

Authors and Affiliations

College of Information Engineering, Capital Normal University, Beijing, China
Mengyuan Yang, Yuanyuan Shang, Zhuhong Shao, Tie Liu & Hui Ding
School of Mathematical Sciences, Capital Normal University, Beijing, China
Jingyi Liu & Hailiang Li

Authors

Mengyuan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanyuan Shang
View author publications
You can also search for this author in PubMed Google Scholar
Jingyi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhuhong Shao
View author publications
You can also search for this author in PubMed Google Scholar
Tie Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hui Ding
View author publications
You can also search for this author in PubMed Google Scholar
Hailiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuanyuan Shang .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, M. et al. (2025). LMS-VDR: Integrating Landmarks into Multi-scale Hybrid Net for Video-Based Depression Recognition. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15040. Springer, Singapore. https://doi.org/10.1007/978-981-97-8792-0_21

Download citation

DOI: https://doi.org/10.1007/978-981-97-8792-0_21
Published: 09 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8791-3
Online ISBN: 978-981-97-8792-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics