Abstract
Engagement recognition aims to identify an individual’s level of participation in a particular activity, which has broad application fields, such as education, healthcare, and driving. However, the performance of engagement recognition in current methods is often compromised by excessive data and distractions. Our Behavior Capture based TRansformer (BCTR) introduces a Transformer-based video analysis approach, emphasizing frame and video level spatiotemporal details to improve engagement recognition. BCTR features dual branches for detecting static and dynamic signs of disengagements, such as eye closure and head down, through refined class tokens. This method allows the model to independently identify critical disengagement indicators, mirroring human observational techniques. As a result, BCTR not only boosts the precision but also enriches the interpretability of engagement assessments by recognizing these signs of disengagements. Extensive experimental results demonstrate that our BCTR model achieves superior performance, particularly in challenging environments rich in distractions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abedi, A., Khan, S.S.: Improving state-of-the-art in detecting student engagement with Resnet and TCN hybrid network. In: 2021 18th Conference on Robots and Vision (CRV), pp. 151–157. IEEE (2021)
Baltrušaitis, T., Robinson, P., Morency, L.P.: Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE (2016)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Binh, H.T., Trung, N.Q., Nguyen, H.A.T., Duy, B.T.: Detecting student engagement in classrooms for intelligent tutoring systems. In: 2019 23rd International Computer Science and Engineering Conference (ICSEC), pp. 145–149. IEEE (2019)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer, Berlin (2020)
Chang, C., Zhang, C., Chen, L., Liu, Y.: An ensemble model using face and body tracking for engagement detection. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 616–622 (2018)
Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 782–791 (2021)
Dhall, A.: EmotiW 2019: automatic emotion, engagement and cohesion prediction tasks. In: 2019 International Conference on Multimodal Interaction, pp. 546–550 (2019)
Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2154 (2014)
Gong, Y., Jia, Y., Leung, T., Toshev, A., Ioffe, S.: Deep convolutional ranking for multilabel image annotation (2013). arXiv:1312.4894
Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recognition: dataset, method, and application. IEEE Trans. Circuits Syst. Video Technol. (2024)
Gupta, A., D’Cunha, A., Awasthi, K., Balasubramanian, V.: Daisee: towards user engagement recognition in the wild (2016). arXiv:1609.01885
Huang, T., Mei, Y., Zhang, H., Liu, S., Yang, H.: Fine-grained engagement recognition in online learning environment. In: 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), pp. 338–341. IEEE (2019)
Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. Adv. Neural. Inf. Process. Syst. 34, 13352–13363 (2021)
Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2(1–2), 83–97 (1955)
Liao, J., Liang, Y., Pan, J.: Deep facial spatiotemporal network for engagement prediction in online learning. Appl. Intell. 51(10), 6609–6621 (2021)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 21–37. Springer, Berlin (2016)
Mehta, N.K., Prasad, S.S., Saurav, S., Saini, R., Singh, S.: Three-dimensional denseNet self-attention neural network for automatic detection of student’s engagement. Appl. Intell. 52(12), 13803–13823 (2022)
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)
Parmar, P., Tran Morris, B.: Learning to score Olympic events. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–28 (2017)
Saleh, K., Yu, K., Chen, F.: Video-based student engagement estimation via time convolution neural networks for remote learning. In: Australasian Joint Conference on Artificial Intelligence, pp. 658–667. Springer, Berlin (2022)
Sharma, P., Joshi, S., Gautam, S., Maharjan, S., Khanal, S.R., Reis, M.C., Barroso, J., de Jesus Filipe, V.M.: Student engagement detection using emotion analysis, eye tracking and head movement with machine learning. In: International Conference on Technology and Innovation in Learning, Teaching and Education, pp. 52–68. Springer, Berlin (2022)
Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2325–2333 (2016)
Su, D., Michaud, T.L., Estabrooks, P., Schwab, R.J., Eiland, L.A., Hansen, G., DeVany, M., Zhang, D., Li, Y., Pagán, J.A., et al.: Diabetes management through remote patient monitoring: the importance of patient activation and engagement with the technology. Telemed. E-Health 25(10), 952–959 (2019)
Tian, X., Nunes, B.P., Liu, Y., Manrique, R.: Predicting student engagement using sequential ensemble model. IEEE Trans. Learn. Technol. (2023)
Vinyals, O., Bengio, S., Kudlur, M.: Order matters: Sequence to sequence for sets (2015). arXiv:1511.06391
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)
Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S.: CNN: single-label to multi-label (2014). arXiv:1406.5726
Wu, J., Yang, B., Wang, Y., Hattori, G.: Advanced multi-instance learning method with multi-features engineering and conservative optimization for engagement intensity prediction. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 777–783 (2020)
Yang, J., Wang, K., Peng, X., Qiao, Y.: Deep recurrent multi-instance learning with Spatio-temporal features for engagement intensity prediction. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 594–598 (2018)
Zha, X., Zhu, W., Xun, L., Yang, S., Liu, J.: Shifted chunk transformer for Spatio-temporal representational learning. Adv. Neural. Inf. Process. Syst. 34, 11384–11396 (2021)
Zhang, H., Cheng, L., Hao, Y., Ngo, C.w.: Long-term leap attention, short-term periodic shift for video classification. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 5773–5782 (2022)
Zhang, H., Xiao, X., Huang, T., Liu, S., Xia, Y., Li, J.: An novel end-to-end network for automatic student engagement recognition. In: 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), pp. 342–345. IEEE (2019)
Zhu, B., Lan, X., Guo, X., Barner, K.E., Boncelet, C.: Multi-rate attention based GRU model for engagement prediction. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 841–848 (2020)
Acknowledgement
This work is supported by the Ningbo Key Research and Development Program (Grant No. 2023Z057), and the Fundamental Research Funds for the Central Universities (226-2024-00058).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bei, Y. et al. (2025). Behavior Capture Based Explainable Engagement Recognition. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15040. Springer, Singapore. https://doi.org/10.1007/978-981-97-8792-0_17
Download citation
DOI: https://doi.org/10.1007/978-981-97-8792-0_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8791-3
Online ISBN: 978-981-97-8792-0
eBook Packages: Computer ScienceComputer Science (R0)