Behavior Capture Based Explainable Engagement Recognition | SpringerLink
Skip to main content

Behavior Capture Based Explainable Engagement Recognition

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15040))

Included in the following conference series:

  • 121 Accesses

Abstract

Engagement recognition aims to identify an individual’s level of participation in a particular activity, which has broad application fields, such as education, healthcare, and driving. However, the performance of engagement recognition in current methods is often compromised by excessive data and distractions. Our Behavior Capture based TRansformer (BCTR) introduces a Transformer-based video analysis approach, emphasizing frame and video level spatiotemporal details to improve engagement recognition. BCTR features dual branches for detecting static and dynamic signs of disengagements, such as eye closure and head down, through refined class tokens. This method allows the model to independently identify critical disengagement indicators, mirroring human observational techniques. As a result, BCTR not only boosts the precision but also enriches the interpretability of engagement assessments by recognizing these signs of disengagements. Extensive experimental results demonstrate that our BCTR model achieves superior performance, particularly in challenging environments rich in distractions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abedi, A., Khan, S.S.: Improving state-of-the-art in detecting student engagement with Resnet and TCN hybrid network. In: 2021 18th Conference on Robots and Vision (CRV), pp. 151–157. IEEE (2021)

    Google Scholar 

  2. Baltrušaitis, T., Robinson, P., Morency, L.P.: Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE (2016)

    Google Scholar 

  3. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)

    Google Scholar 

  4. Binh, H.T., Trung, N.Q., Nguyen, H.A.T., Duy, B.T.: Detecting student engagement in classrooms for intelligent tutoring systems. In: 2019 23rd International Computer Science and Engineering Conference (ICSEC), pp. 145–149. IEEE (2019)

    Google Scholar 

  5. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)

    Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer, Berlin (2020)

    Google Scholar 

  7. Chang, C., Zhang, C., Chen, L., Liu, Y.: An ensemble model using face and body tracking for engagement detection. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 616–622 (2018)

    Google Scholar 

  8. Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 782–791 (2021)

    Google Scholar 

  9. Dhall, A.: EmotiW 2019: automatic emotion, engagement and cohesion prediction tasks. In: 2019 International Conference on Multimodal Interaction, pp. 546–550 (2019)

    Google Scholar 

  10. Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2154 (2014)

    Google Scholar 

  11. Gong, Y., Jia, Y., Leung, T., Toshev, A., Ioffe, S.: Deep convolutional ranking for multilabel image annotation (2013). arXiv:1312.4894

  12. Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recognition: dataset, method, and application. IEEE Trans. Circuits Syst. Video Technol. (2024)

    Google Scholar 

  13. Gupta, A., D’Cunha, A., Awasthi, K., Balasubramanian, V.: Daisee: towards user engagement recognition in the wild (2016). arXiv:1609.01885

  14. Huang, T., Mei, Y., Zhang, H., Liu, S., Yang, H.: Fine-grained engagement recognition in online learning environment. In: 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), pp. 338–341. IEEE (2019)

    Google Scholar 

  15. Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. Adv. Neural. Inf. Process. Syst. 34, 13352–13363 (2021)

    Google Scholar 

  16. Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  17. Liao, J., Liang, Y., Pan, J.: Deep facial spatiotemporal network for engagement prediction in online learning. Appl. Intell. 51(10), 6609–6621 (2021)

    Article  Google Scholar 

  18. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 21–37. Springer, Berlin (2016)

    Google Scholar 

  19. Mehta, N.K., Prasad, S.S., Saurav, S., Saini, R., Singh, S.: Three-dimensional denseNet self-attention neural network for automatic detection of student’s engagement. Appl. Intell. 52(12), 13803–13823 (2022)

    Article  Google Scholar 

  20. Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)

    Google Scholar 

  21. Parmar, P., Tran Morris, B.: Learning to score Olympic events. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–28 (2017)

    Google Scholar 

  22. Saleh, K., Yu, K., Chen, F.: Video-based student engagement estimation via time convolution neural networks for remote learning. In: Australasian Joint Conference on Artificial Intelligence, pp. 658–667. Springer, Berlin (2022)

    Google Scholar 

  23. Sharma, P., Joshi, S., Gautam, S., Maharjan, S., Khanal, S.R., Reis, M.C., Barroso, J., de Jesus Filipe, V.M.: Student engagement detection using emotion analysis, eye tracking and head movement with machine learning. In: International Conference on Technology and Innovation in Learning, Teaching and Education, pp. 52–68. Springer, Berlin (2022)

    Google Scholar 

  24. Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2325–2333 (2016)

    Google Scholar 

  25. Su, D., Michaud, T.L., Estabrooks, P., Schwab, R.J., Eiland, L.A., Hansen, G., DeVany, M., Zhang, D., Li, Y., Pagán, J.A., et al.: Diabetes management through remote patient monitoring: the importance of patient activation and engagement with the technology. Telemed. E-Health 25(10), 952–959 (2019)

    Article  Google Scholar 

  26. Tian, X., Nunes, B.P., Liu, Y., Manrique, R.: Predicting student engagement using sequential ensemble model. IEEE Trans. Learn. Technol. (2023)

    Google Scholar 

  27. Vinyals, O., Bengio, S., Kudlur, M.: Order matters: Sequence to sequence for sets (2015). arXiv:1511.06391

  28. Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)

    Google Scholar 

  29. Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S.: CNN: single-label to multi-label (2014). arXiv:1406.5726

  30. Wu, J., Yang, B., Wang, Y., Hattori, G.: Advanced multi-instance learning method with multi-features engineering and conservative optimization for engagement intensity prediction. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 777–783 (2020)

    Google Scholar 

  31. Yang, J., Wang, K., Peng, X., Qiao, Y.: Deep recurrent multi-instance learning with Spatio-temporal features for engagement intensity prediction. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 594–598 (2018)

    Google Scholar 

  32. Zha, X., Zhu, W., Xun, L., Yang, S., Liu, J.: Shifted chunk transformer for Spatio-temporal representational learning. Adv. Neural. Inf. Process. Syst. 34, 11384–11396 (2021)

    Google Scholar 

  33. Zhang, H., Cheng, L., Hao, Y., Ngo, C.w.: Long-term leap attention, short-term periodic shift for video classification. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 5773–5782 (2022)

    Google Scholar 

  34. Zhang, H., Xiao, X., Huang, T., Liu, S., Xia, Y., Li, J.: An novel end-to-end network for automatic student engagement recognition. In: 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), pp. 342–345. IEEE (2019)

    Google Scholar 

  35. Zhu, B., Lan, X., Guo, X., Barner, K.E., Boncelet, C.: Multi-rate attention based GRU model for engagement prediction. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 841–848 (2020)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the Ningbo Key Research and Development Program (Grant No. 2023Z057), and the Fundamental Research Funds for the Central Universities (226-2024-00058).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lechao Cheng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bei, Y. et al. (2025). Behavior Capture Based Explainable Engagement Recognition. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15040. Springer, Singapore. https://doi.org/10.1007/978-981-97-8792-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8792-0_17

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8791-3

  • Online ISBN: 978-981-97-8792-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics