LMS-VDR: Integrating Landmarks into Multi-scale Hybrid Net for Video-Based Depression Recognition | SpringerLink
Skip to main content

LMS-VDR: Integrating Landmarks into Multi-scale Hybrid Net for Video-Based Depression Recognition

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15040))

Included in the following conference series:

  • 103 Accesses

Abstract

Recent advancements in deep learning have significantly enhanced facial video-based depression recognition. However, existing models encounter critical limitations that hinder their performance. They struggle with spatially localized facial feature extraction, relying on either a single convolution or a convolution operation with a simplistic attention mechanism, leading to inadequate recognition of depression-relevant patterns. Besides, increasing model depth leads to the extraction of ambiguous, abstract features while losing crucial dynamic facial details. To address these issues, we propose LMS-VDR by combining the Multi-Scale Mixed Attention Module (MSMAM) with landmark-based prior knowledge integration. More specifically, MSMAM synergistically merges channel and spatial attention, forming a mixed attention vector block through vector products. It introduces a dense connection mechanism, directly connecting features of each dimension to the final output, thereby enabling multi-scale diversity feature extraction. The integration of landmarks, initially linearly transformed and later combined with temporal feature sequences, enhances dynamic temporal feature extraction using our proposed Cross Multi-head Self-Attention (CMHSA) block based on self-attention. Experiments on AVEC 2013 and AVEC 2014 datasets validate our method’s efficacy, achieving MAE/RMSE of 6.04/7.68 and 5.98/7.59, respectively. Our proposed method offers a promising direction for clinical depression assessment, demonstrating the potential for significant contributions in this critical domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cai, C., Niu, M., Liu, B., Tao, J., Liu, X.: Tdca-net: Time-domain channel attention network for depression detection. In: Interspeech, pp. 2511–2515 (2021)

    Google Scholar 

  2. Casado, C.Á., Cañellas, M.L., López, M.B.: Depression recognition using remote photoplethysmography from facial videos. IEEE Trans. Affect. Comput. (2023)

    Google Scholar 

  3. Hammen, C.: Stress and depression. Annu. Rev. Clin. Psychol. 1, 293–319 (2005)

    Article  Google Scholar 

  4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  5. He, L., Guo, C., Tiwari, P., Pandey, H.M., Dang, W.: Intelligent system for depression scale estimation with facial expressions and case study in industrial intelligence. Int. J. Intell. Syst. 37(12), 10140–10156 (2022)

    Article  Google Scholar 

  6. He, L., Jiang, D., Sahli, H.: Automatic depression analysis using dynamic facial appearance descriptor and dirichlet process fisher encoding. IEEE Trans. Multimedia 21(6), 1476–1486 (2018)

    Article  Google Scholar 

  7. He, L., Tiwari, P., Lv, C., Wu, W., Guo, L.: Reducing noisy annotations for depression estimation from facial images. Neural Netw. 153, 120–129 (2022)

    Article  Google Scholar 

  8. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

    Google Scholar 

  9. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

    Google Scholar 

  10. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

    Google Scholar 

  11. Liu, Z., Yuan, X., Li, Y., Shangguan, Z., Zhou, L., Hu, B.: Pra-net: Part-and-relation attention network for depression recognition from facial expression. Comput. Biol. Med. 157, 106589 (2023)

    Article  Google Scholar 

  12. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015)

    Google Scholar 

  13. de Melo, W.C., Granger, E., Lopez, M.B.: Mdn: A deep maximization-differentiation network for spatio-temporal depression detection. IEEE Trans. Affect. Comput. 14(1), 578–590 (2021)

    Article  Google Scholar 

  14. Niu, M., He, L., Li, Y., Liu, B.: Depressioner: facial dynamic representation for automatic depression level prediction. Expert Syst. Appl. 204, 117512 (2022)

    Article  Google Scholar 

  15. Niu, M., Tao, J., Liu, B., Huang, J., Lian, Z.: Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans. Affect. Comput. 14(1), 294–307 (2020)

    Article  Google Scholar 

  16. Niu, M., Zhao, Z., Tao, J., Li, Y., Schuller, B.W.: Dual attention and element recalibration networks for automatic depression level prediction. IEEE Trans. Affect. Comput. (2022)

    Google Scholar 

  17. Pan, Y., Shang, Y., Liu, T., Shao, Z., Guo, G., Ding, H., Hu, Q.: Spatial-temporal attention network for depression recognition from facial videos. Expert Syst. Appl. 237, 121410 (2024)

    Article  Google Scholar 

  18. Pan, Y., Shang, Y., Shao, Z., Liu, T., Guo, G., Ding, H.: Integrating deep facial priors into landmarks for privacy preserving multimodal depression recognition. IEEE Trans. Affect. Comput. (2023)

    Google Scholar 

  19. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

    Google Scholar 

  20. Shang, Y., Pan, Y., Jiang, X., Shao, Z., Guo, G., Liu, T., Ding, H.: Lqgdnet: A local quaternion and global deep network for facial depression recognition. IEEE Trans. Affect. Comput. 14(3), 2557–2563 (2021)

    Article  Google Scholar 

  21. Song, S., Jaiswal, S., Shen, L., Valstar, M.: Spectral representation of behaviour primitives for depression analysis. IEEE Trans. Affect. Comput. 13(2), 829–844 (2020)

    Article  Google Scholar 

  22. Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078–10093 (2022)

    Google Scholar 

  23. Valstar, M., Schuller, B., Smith, K., Almaev, T., Eyben, F., Krajewski, J., Cowie, R., Pantic, M.: Avec 2014: 3d dimensional affect and depression recognition challenge. In: Proceedings of the 4th International Workshop on Audio/visual Emotion Challenge, pp. 3–10 (2014)

    Google Scholar 

  24. Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., Schnieder, S., Cowie, R., Pantic, M.: Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 3–10 (2013)

    Google Scholar 

  25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)

    Google Scholar 

  26. Wang, R., Guo, J., Wang, J., He, L., Yang, Y.: A multi-frame rate network with attention mechanism for depression severity estimation. In: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2679–2686. IEEE (2023)

    Google Scholar 

  27. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

    Google Scholar 

  28. Zhang, S., Zhang, X., Zhao, X., Fang, J., Niu, M., Zhao, Z., Yu, J., Tian, Q.: Mtdan: A lightweight multi-scale temporal difference attention networks for automated video depression detection. IEEE Trans. Affect. Comput. (2023)

    Google Scholar 

  29. Zhao, Z., Liu, Q.: Former-dfer: Dynamic facial expression recognition transformer. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1553–1561 (2021)

    Google Scholar 

  30. Zhou, X., Jin, K., Shang, Y., Guo, G.: Visually interpretable representation learning for depression recognition from facial images. IEEE Trans. Affect. Comput. 11(3), 542–552 (2018)

    Article  Google Scholar 

  31. Zhu, Y., Shang, Y., Shao, Z., Guo, G.: Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Trans. Affect. Comput. 9(4), 578–584 (2017)

    Article  Google Scholar 

Download references

Acknowledgement

This study received support from the National Natural Science Foundation of China (Grant No. 61876112) and the Beijing Natural Science Foundation (Grant No. 4242034).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuanyuan Shang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, M. et al. (2025). LMS-VDR: Integrating Landmarks into Multi-scale Hybrid Net for Video-Based Depression Recognition. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15040. Springer, Singapore. https://doi.org/10.1007/978-981-97-8792-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8792-0_21

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8791-3

  • Online ISBN: 978-981-97-8792-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics