A spatiotemporal and motion information extraction network for action recognition

Wang, Wei; Wang, Xianmin; Zhou, Mingliang; Wei, Xuekai; Li, Jing; Ren, Xiaojun; Zong, Xuemei

doi:10.1007/s11276-023-03267-y

A spatiotemporal and motion information extraction network for action recognition

Published: 28 February 2023

Volume 30, pages 5389–5405, (2024)
Cite this article

Wireless Networks Aims and scope Submit manuscript

Wei Wang¹,
Xianmin Wang ORCID: orcid.org/0000-0003-3480-8780²,
Mingliang Zhou¹,
Xuekai Wei³,
Jing Li²,
Xiaojun Ren² &
…
Xuemei Zong⁴

408 Accesses
Explore all metrics

Abstract

With the continuous advancement in Internet-of-Things and deep learning, video action recognition is gradually emerging in daily and industrial applications. Spatiotemporal and motion patterns are two crucial and complementary types of information used for action recognition. However, effectively modelling both types of information in videos remains challenging. In this paper, we propose a spatiotemporal and motion information extraction (STME) network that extracts comprehensive spatiotemporal and motion information from videos for action recognition. First, we design the STME network, which includes three efficient modules: a spatiotemporal extraction (STE) module, a short-term motion extraction (SME) module and a long-term motion extraction (LME) module. The SME and LME modules are used to model short-term and long-term motion representation, respectively. Then, we apply the STE module to capture comprehensive spatiotemporal information which can supplement the video representation for action recognition. According to our experimental results, the STME network achieves significantly better performance than existing methods on several benchmark datasets. Our codes are available at https://github.com/STME-Net/STME.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Manet: motion-aware network for video action recognition

Article Open access 06 February 2025

Stme-net: spatio-temporal motion excitation network for action recognition

Article 25 March 2025

SiamMAST: Siamese motion-aware spatio-temporal network for video action recognition

Article 24 July 2023

References

Zhang, T. (2021). Application of ai-based real-time gesture recognition and embedded system in the design of english major teaching. Wireless Networks. https://doi.org/10.1007/s11276-021-02693-0
Article Google Scholar
He, Y. (2021). Athlete human behavior recognition based on continuous image deep learning and sensors. Wireless Networks. https://doi.org/10.1007/s11276-021-02721-z
Article Google Scholar
Mittal, H., Tripathi, A., Pandey, A., Parameswaran, V., Menon, V., & Pal, R. (2022). A novel fuzzy clustering-based method for human activity recognition in cloud-based industrial iot environment. Wireless Networks. https://doi.org/10.1007/s11276-022-03011-y
Article Google Scholar
Huang, T., Chen, Y., Yao, B., Yang, B., Wang, X., & Li, Y. (2020). Adversarial attacks on deep-learning-based radar range profile target recognition. Information Sciences, 531, 159–176.
Article Google Scholar
Yang, H., Chen, L., Pan, S., Wang, H., & Zhang, P. (2022). Discrete embedding for attributed graphs. Pattern Recognition, 123, 108368.
Article Google Scholar
Liu, Z., Wu, Z., Li, T., Li, J., & Shen, C. (2018). Gmm and cnn hybrid method for short utterance speaker recognition. IEEE Transactions on Industrial informatics, 14(7), 3244–3252.
Article Google Scholar
Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., & Zhao, G (2021). imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10626– 10637). https://doi.org/10.1109/CVPR46437.2021.01049
Yan, H., Chen, Z., & Jia, C. (2019). Ssir: Secure similarity image retrieval in iot. Information Sciences, 479, 153–163.
Article Google Scholar
Meng, Y., Zhu, H., Li, J., Li, J., & Liu, Y. (2020). Liveness detection for voice user interface via wireless signals in iot environment. IEEE Transactions on Dependable and Secure Computing, 18(6), 2996–3011.
Google Scholar
Meng, W., Jiang, L., Wang, Y., Li, J., Zhang, J., & Xiang, Y. (2018). Jfcguard: Detecting juice filming charging attack via processor usage analysis on smartphones. Computers & Security, 76, 252–264.
Article Google Scholar
Dong, C., Wang, Y., Aldweesh, A., McCorry, P., & van Moorsel, A. (2017). Betrayal, distrust, and rationality: Smart counter-collusion contracts for verifiable cloud computing. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security (pp. 211– 227).
Hou, R., Ai, S., Chen, Q., Yan, H., Huang, T., & Chen, K. (2022). Similarity-based integrity protection for deep learning systems. Information Sciences, 601, 255–267.
Article Google Scholar
Peng, Y., Choi, B., & Xu, J. ( 2021). Graph edit distance learning via modeling optimum matchings with constraints. In IJCAI (pp. 1534– 1540).
Li, R., Yu, S., & Yang, X. (2007). Efficient spatio-temporal segmentation for extracting moving objects in video sequences. IEEE Transactions on Consumer Electronics, 53(3), 1161–1167. https://doi.org/10.1109/TCE.2007.4341600
Article Google Scholar
Wu, H., Ma, X., & Li, Y. (2021). Spatiotemporal multimodal learning with 3d cnns for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 1–1.
Phyo, C. N., Zin, T. T., & Tin, P. (2019). Deep learning for recognizing human activities using motions of skeletal joints. IEEE Transactions on Consumer Electronics, 65(2), 243–252. https://doi.org/10.1109/TCE.2019.2908986
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568– 576).
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In 2016 IEEE conference on computer vision and pattern recognition (pp. 1933– 1941).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (pp. 4724– 4733).
Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (pp. 305– 321).
Zhao, Y., Xiong, Y., & Lin, D. (2018). Recognize actions by disentangling components of dynamics. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6566– 6575)). https://doi.org/10.1109/CVPR.2018.00687
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE international conference on computer vision (pp. 4489– 4497).
Stroud, J.C., Ross, D.A., Sun, C., Deng, J., & Sukthankar, R. (2020). D3d: Distilled 3d networks for video action recognition. In 2020 IEEE winter conference on applications of computer vision (pp. 614– 623).
Chen, Y., Kalantidis, Y., Li, J., Yan, S., & Feng, J. (2018). \(a^{2}\)-nets: Double attention networks. Advances in Neural Information Processing Systems, 31, 352–361.
Google Scholar
Hara, K., Kataoka, H., & Satoh, Y. ( 2018). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6546– 6555).
Peng, F., Liao, T., & Long, M. (2022). A semi-fragile reversible watermarking for authenticating 3d models in dual domains based on variable direction double modulation. IEEE Transactions on Circuits and Systems for Video Technology.
Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (pp. 803– 818).
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In 2019 IEEE/CVF international conference on computer vision (pp. 7082– 7092).
Peng, F., Lin, Z.-X., Zhang, X., & Long, M. (2020). A semi-fragile reversible watermarking for authenticating 2d engineering graphics based on improved region nesting. IEEE Transactions on Circuits and Systems for Video Technology, 31(1), 411–424.
Article Google Scholar
Lin, Z.-X., Peng, F., & Long, M. (2018). A low-distortion reversible watermarking for 2d engineering graphics based on region nesting. IEEE Transactions on Information Forensics and Security, 13(9), 2372–2382.
Article Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20– 36). Springer.
Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In 2019 IEEE/CVF International conference on computer vision (pp 2000– 2009).
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In 2020 IEEE/CVF conference on computer vision and pattern recognition (pp. 906– 915).
Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11669–11676.
Article Google Scholar
Wang, Z., She, Q., & Smolic, A. (2021). Action-net: Multipath excitation for action recognition. In 2021 IEEE/CVF conference on computer vision and pattern recognition (pp. 13209– 13218).
Wang, L., Tong, Z., Ji, B., & Wu, G. (2021). Tdn: Temporal difference networks for efficient action recognition. In 2021 IEEE/CVF conference on computer vision and pattern recognition (pp. 1895– 1904).
Rao, Y., & Ni, J. (2021). Self-supervised domain adaptation for forgery localization of jpeg compressed images. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 15034– 15043).
Qin, X., Li, B., Tan, S., Tang, W., & Huang, J. (2022). Gradually enhanced adversarial perturbations on color pixel vectors for image steganography. IEEE Transactions on Circuits and Systems for Video Technology.
Qin, X., Tan, S., Tang, W., Li, B., & Huang, J. (2021). Image steganography based on iterative adversarial perturbations onto a synchronized-directions sub-image. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2705– 2709). IEEE
Horn, B. K., & Schunck, B. G. (1981). Determining optical flow. Artificial Intelligence, 17(1–3), 185–203.
Article Google Scholar
Fernando, B., Gavves, E., Oramas, J., & Ghodrati, A., & Tuytelaars, T. (2017). Rank pooling for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 773–787. https://doi.org/10.1109/TPAMI.2016.2558148
Hu, Y., Liu, M., Su, X., Gao, Z., & Nie, L. (2021). Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing, 30, 4667–4677. https://doi.org/10.1109/TIP.2021.3073867
Article MathSciNet Google Scholar
Gan, C., Wang, N., Yang, Y., Yeung, D.-Y., Hauptmann, A.G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2568– 2577). https://doi.org/10.1109/CVPR.2015.7298872
Jiang, J., & Zhang, Y. (2022). An improved action recognition network with temporal extraction and feature enhancement. IEEE Access, 10, 13926–13935. https://doi.org/10.1109/ACCESS.2022.3144035
Article Google Scholar
Shen, Z., Wu, X.-J., & Xu, T. (2022). Fexnet: Foreground extraction network for human action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(5), 3141–3151. https://doi.org/10.1109/TCSVT.2021.3103677
Article Google Scholar
Long, X., de Melo, G., He, D., Li, F., Chi, Z., Wen, S., & Gan, C. (2022). Purely attention based local feature integration for video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 2140–2154. https://doi.org/10.1109/TPAMI.2020.3029554
Article Google Scholar
Wang, X., Farhadi, A., & Gupta, A. (2016) Actions transformations. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2658– 2667). https://doi.org/10.1109/CVPR.2016.291
Fan, L., Huang, W., Gan, C., Ermon, S., Gong, B., & Huang, J. (2018). End-to-end learning of motion representation for video understanding. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6016– 6025). https://doi.org/10.1109/CVPR.2018.00630
Bilen, H., Fernando, B., Gavves, E., & Vedaldi, A. (2018). Action recognition with dynamic image networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085
Article Google Scholar
Khowaja, S. A., & Lee, S.-L. (2020). Semantic image networks for human action recognition. International Journal of Computer Vision, 128(2), 393–419.
Article Google Scholar
Gao, Z., Guo, L., Guan, W., Liu, A.-A., Ren, T., & Chen, S. (2021). A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Transactions on Image Processing, 30, 767–782. https://doi.org/10.1109/TIP.2020.3038372
Article Google Scholar
Gao, Z., Guo, L., Ren, T., Liu, A.-A., Cheng, Z.-Y., & Chen, S. (2022). Pairwise two-stream convnets for cross-domain action recognition with small data. IEEE Transactions on Neural Networks and Learning Systems, 33(3), 1147–1161. https://doi.org/10.1109/TNNLS.2020.3041018
Article Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018) A closer look at spatiotemporal convolutions for action recognition. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6450– 6459).
Tran, D., Wang, H., Feiszli, M., & Torresani, L. (2019). Video classification with channel-separated convolutional networks. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 5551– 5560). https://doi.org/10.1109/ICCV.2019.00565
Zolfaghari, M., Singh, K., & Brox, T. (2018). Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (pp. 695– 712).
Feichtenhofer, C., Fan, H., Malik, J., He, K. ( 2019). Slowfast networks for video recognition. In 2019 IEEE/CVF international conference on computer vision (pp. 6201– 6210).
Yu, Z., Zhou, B., Wan, J., Wang, P., Chen, H., Liu, X., Li, S. Z., & Zhao, G. (2021). Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition. IEEE Transactions on Image Processing, 30, 5626–5640. https://doi.org/10.1109/TIP.2021.3087348
Article Google Scholar
Liu, X., Shi, H., Hong, X., Chen, H., Tao, D., & Zhao, G. (2020). 3d skeletal gesture recognition via hidden states exploration. IEEE Transactions on Image Processing, 29, 4583–4597. https://doi.org/10.1109/TIP.2020.2974061
Article Google Scholar
Liu, X., & Zhao, G. (2021). 3d skeletal gesture recognition via discriminative coding on time-warping invariant riemannian trajectories. IEEE Transactions on Multimedia, 23, 1841–1854. https://doi.org/10.1109/TMM.2020.3003783
Article Google Scholar
Varol, G., Laptev, I., & Schmid, C. (2018). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1510–1517.
Article Google Scholar
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. ICLR
Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The something something video database for learning and evaluating visual common sense. In 2017 IEEE international conference on computer vision (pp. 5843– 5851).
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 7794– 7803).
Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision (pp. 399– 417)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248– 255).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In 2016 IEEE conference on computer vision and pattern recognition (pp. 2921– 2929).

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No.62072127, No.62002076, No.61906049), Natural Science Foundation of Guangdong Province (No.2023A1515011774, No.2020A1515010423), Project 6142111180404 supported by CNKLSTISS, Science and Technology Program of Guangzhou, China (No.202002030131), Guangdong basic and applied basic research fund joint fund Youth Fund (No.2019A1515110213), Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (No.MJUKF-IPIC202101), Scientific research project for Guangzhou University (No.RP2022003).

Funding

The study was supported in part by the National Natural Science Foundation of China under grant 62176027, 62002076, 61906049; in part by the Natural Science Foundation of Guangdong Province under grant 2023A1515011774, No.2020A1515010423, in part by the Science and Technology Program of Guangzhou, China under grant 202002030131, in part by the Guangdong basic and applied basic research fund joint fund Youth Fund under grant 2019A1515110213, in part by the General Program of the National Natural Science Foundation of Chongqing under grant cstc2020jcyjmsxmX0790; in part by the Human Resources and Social Security Bureau Project of Chongqing under grant cx2020073, Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) under grant MJUKF-IPIC202101, Scientific research project for Guangzhou University under grant RP2022003, Project 6142111180404 supported by CNKLSTISS.

Author information

Authors and Affiliations

School of Computer Science, Chongqing University, Chongqing, 400044, China
Wei Wang & Mingliang Zhou
Institute of Artificial Intelligence and Blockchain, Guangzhou University, Guangzhou, 510030, China
Xianmin Wang, Jing Li & Xiaojun Ren
The State Key Laboratory of Internet of Things for Smart City and Department of Electrical and Computer Engineering, University of Macau, Macao, 999078, China
Xuekai Wei
Jiangsu XCMG Construction Machinery Research Institute LTD, Xuzhou, 221004, China
Xuemei Zong

Authors

Wei Wang
View author publications
You can also search for this author inPubMed Google Scholar
Xianmin Wang
View author publications
You can also search for this author inPubMed Google Scholar
Mingliang Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Xuekai Wei
View author publications
You can also search for this author inPubMed Google Scholar
Jing Li
View author publications
You can also search for this author inPubMed Google Scholar
Xiaojun Ren
View author publications
You can also search for this author inPubMed Google Scholar
Xuemei Zong
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xianmin Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, W., Wang, X., Zhou, M. et al. A spatiotemporal and motion information extraction network for action recognition. Wireless Netw 30, 5389–5405 (2024). https://doi.org/10.1007/s11276-023-03267-y

Download citation

Accepted: 30 January 2023
Published: 28 February 2023
Issue Date: August 2024
DOI: https://doi.org/10.1007/s11276-023-03267-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

A spatiotemporal and motion information extraction network for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Manet: motion-aware network for video action recognition

Stme-net: spatio-temporal motion excitation network for action recognition

SiamMAST: Siamese motion-aware spatio-temporal network for video action recognition

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now