Abstract
With the continuous advancement in Internet-of-Things and deep learning, video action recognition is gradually emerging in daily and industrial applications. Spatiotemporal and motion patterns are two crucial and complementary types of information used for action recognition. However, effectively modelling both types of information in videos remains challenging. In this paper, we propose a spatiotemporal and motion information extraction (STME) network that extracts comprehensive spatiotemporal and motion information from videos for action recognition. First, we design the STME network, which includes three efficient modules: a spatiotemporal extraction (STE) module, a short-term motion extraction (SME) module and a long-term motion extraction (LME) module. The SME and LME modules are used to model short-term and long-term motion representation, respectively. Then, we apply the STE module to capture comprehensive spatiotemporal information which can supplement the video representation for action recognition. According to our experimental results, the STME network achieves significantly better performance than existing methods on several benchmark datasets. Our codes are available at https://github.com/STME-Net/STME.




Similar content being viewed by others
References
Zhang, T. (2021). Application of ai-based real-time gesture recognition and embedded system in the design of english major teaching. Wireless Networks. https://doi.org/10.1007/s11276-021-02693-0
He, Y. (2021). Athlete human behavior recognition based on continuous image deep learning and sensors. Wireless Networks. https://doi.org/10.1007/s11276-021-02721-z
Mittal, H., Tripathi, A., Pandey, A., Parameswaran, V., Menon, V., & Pal, R. (2022). A novel fuzzy clustering-based method for human activity recognition in cloud-based industrial iot environment. Wireless Networks. https://doi.org/10.1007/s11276-022-03011-y
Huang, T., Chen, Y., Yao, B., Yang, B., Wang, X., & Li, Y. (2020). Adversarial attacks on deep-learning-based radar range profile target recognition. Information Sciences, 531, 159–176.
Yang, H., Chen, L., Pan, S., Wang, H., & Zhang, P. (2022). Discrete embedding for attributed graphs. Pattern Recognition, 123, 108368.
Liu, Z., Wu, Z., Li, T., Li, J., & Shen, C. (2018). Gmm and cnn hybrid method for short utterance speaker recognition. IEEE Transactions on Industrial informatics, 14(7), 3244–3252.
Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., & Zhao, G (2021). imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10626– 10637). https://doi.org/10.1109/CVPR46437.2021.01049
Yan, H., Chen, Z., & Jia, C. (2019). Ssir: Secure similarity image retrieval in iot. Information Sciences, 479, 153–163.
Meng, Y., Zhu, H., Li, J., Li, J., & Liu, Y. (2020). Liveness detection for voice user interface via wireless signals in iot environment. IEEE Transactions on Dependable and Secure Computing, 18(6), 2996–3011.
Meng, W., Jiang, L., Wang, Y., Li, J., Zhang, J., & Xiang, Y. (2018). Jfcguard: Detecting juice filming charging attack via processor usage analysis on smartphones. Computers & Security, 76, 252–264.
Dong, C., Wang, Y., Aldweesh, A., McCorry, P., & van Moorsel, A. (2017). Betrayal, distrust, and rationality: Smart counter-collusion contracts for verifiable cloud computing. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security (pp. 211– 227).
Hou, R., Ai, S., Chen, Q., Yan, H., Huang, T., & Chen, K. (2022). Similarity-based integrity protection for deep learning systems. Information Sciences, 601, 255–267.
Peng, Y., Choi, B., & Xu, J. ( 2021). Graph edit distance learning via modeling optimum matchings with constraints. In IJCAI (pp. 1534– 1540).
Li, R., Yu, S., & Yang, X. (2007). Efficient spatio-temporal segmentation for extracting moving objects in video sequences. IEEE Transactions on Consumer Electronics, 53(3), 1161–1167. https://doi.org/10.1109/TCE.2007.4341600
Wu, H., Ma, X., & Li, Y. (2021). Spatiotemporal multimodal learning with 3d cnns for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 1–1.
Phyo, C. N., Zin, T. T., & Tin, P. (2019). Deep learning for recognizing human activities using motions of skeletal joints. IEEE Transactions on Consumer Electronics, 65(2), 243–252. https://doi.org/10.1109/TCE.2019.2908986
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568– 576).
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In 2016 IEEE conference on computer vision and pattern recognition (pp. 1933– 1941).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (pp. 4724– 4733).
Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (pp. 305– 321).
Zhao, Y., Xiong, Y., & Lin, D. (2018). Recognize actions by disentangling components of dynamics. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6566– 6575)). https://doi.org/10.1109/CVPR.2018.00687
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE international conference on computer vision (pp. 4489– 4497).
Stroud, J.C., Ross, D.A., Sun, C., Deng, J., & Sukthankar, R. (2020). D3d: Distilled 3d networks for video action recognition. In 2020 IEEE winter conference on applications of computer vision (pp. 614– 623).
Chen, Y., Kalantidis, Y., Li, J., Yan, S., & Feng, J. (2018). \(a^{2}\)-nets: Double attention networks. Advances in Neural Information Processing Systems, 31, 352–361.
Hara, K., Kataoka, H., & Satoh, Y. ( 2018). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6546– 6555).
Peng, F., Liao, T., & Long, M. (2022). A semi-fragile reversible watermarking for authenticating 3d models in dual domains based on variable direction double modulation. IEEE Transactions on Circuits and Systems for Video Technology.
Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (pp. 803– 818).
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In 2019 IEEE/CVF international conference on computer vision (pp. 7082– 7092).
Peng, F., Lin, Z.-X., Zhang, X., & Long, M. (2020). A semi-fragile reversible watermarking for authenticating 2d engineering graphics based on improved region nesting. IEEE Transactions on Circuits and Systems for Video Technology, 31(1), 411–424.
Lin, Z.-X., Peng, F., & Long, M. (2018). A low-distortion reversible watermarking for 2d engineering graphics based on region nesting. IEEE Transactions on Information Forensics and Security, 13(9), 2372–2382.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20– 36). Springer.
Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In 2019 IEEE/CVF International conference on computer vision (pp 2000– 2009).
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In 2020 IEEE/CVF conference on computer vision and pattern recognition (pp. 906– 915).
Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11669–11676.
Wang, Z., She, Q., & Smolic, A. (2021). Action-net: Multipath excitation for action recognition. In 2021 IEEE/CVF conference on computer vision and pattern recognition (pp. 13209– 13218).
Wang, L., Tong, Z., Ji, B., & Wu, G. (2021). Tdn: Temporal difference networks for efficient action recognition. In 2021 IEEE/CVF conference on computer vision and pattern recognition (pp. 1895– 1904).
Rao, Y., & Ni, J. (2021). Self-supervised domain adaptation for forgery localization of jpeg compressed images. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 15034– 15043).
Qin, X., Li, B., Tan, S., Tang, W., & Huang, J. (2022). Gradually enhanced adversarial perturbations on color pixel vectors for image steganography. IEEE Transactions on Circuits and Systems for Video Technology.
Qin, X., Tan, S., Tang, W., Li, B., & Huang, J. (2021). Image steganography based on iterative adversarial perturbations onto a synchronized-directions sub-image. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2705– 2709). IEEE
Horn, B. K., & Schunck, B. G. (1981). Determining optical flow. Artificial Intelligence, 17(1–3), 185–203.
Fernando, B., Gavves, E., Oramas, J., & Ghodrati, A., & Tuytelaars, T. (2017). Rank pooling for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 773–787. https://doi.org/10.1109/TPAMI.2016.2558148
Hu, Y., Liu, M., Su, X., Gao, Z., & Nie, L. (2021). Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing, 30, 4667–4677. https://doi.org/10.1109/TIP.2021.3073867
Gan, C., Wang, N., Yang, Y., Yeung, D.-Y., Hauptmann, A.G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2568– 2577). https://doi.org/10.1109/CVPR.2015.7298872
Jiang, J., & Zhang, Y. (2022). An improved action recognition network with temporal extraction and feature enhancement. IEEE Access, 10, 13926–13935. https://doi.org/10.1109/ACCESS.2022.3144035
Shen, Z., Wu, X.-J., & Xu, T. (2022). Fexnet: Foreground extraction network for human action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(5), 3141–3151. https://doi.org/10.1109/TCSVT.2021.3103677
Long, X., de Melo, G., He, D., Li, F., Chi, Z., Wen, S., & Gan, C. (2022). Purely attention based local feature integration for video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 2140–2154. https://doi.org/10.1109/TPAMI.2020.3029554
Wang, X., Farhadi, A., & Gupta, A. (2016) Actions transformations. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2658– 2667). https://doi.org/10.1109/CVPR.2016.291
Fan, L., Huang, W., Gan, C., Ermon, S., Gong, B., & Huang, J. (2018). End-to-end learning of motion representation for video understanding. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6016– 6025). https://doi.org/10.1109/CVPR.2018.00630
Bilen, H., Fernando, B., Gavves, E., & Vedaldi, A. (2018). Action recognition with dynamic image networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085
Khowaja, S. A., & Lee, S.-L. (2020). Semantic image networks for human action recognition. International Journal of Computer Vision, 128(2), 393–419.
Gao, Z., Guo, L., Guan, W., Liu, A.-A., Ren, T., & Chen, S. (2021). A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Transactions on Image Processing, 30, 767–782. https://doi.org/10.1109/TIP.2020.3038372
Gao, Z., Guo, L., Ren, T., Liu, A.-A., Cheng, Z.-Y., & Chen, S. (2022). Pairwise two-stream convnets for cross-domain action recognition with small data. IEEE Transactions on Neural Networks and Learning Systems, 33(3), 1147–1161. https://doi.org/10.1109/TNNLS.2020.3041018
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018) A closer look at spatiotemporal convolutions for action recognition. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6450– 6459).
Tran, D., Wang, H., Feiszli, M., & Torresani, L. (2019). Video classification with channel-separated convolutional networks. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 5551– 5560). https://doi.org/10.1109/ICCV.2019.00565
Zolfaghari, M., Singh, K., & Brox, T. (2018). Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (pp. 695– 712).
Feichtenhofer, C., Fan, H., Malik, J., He, K. ( 2019). Slowfast networks for video recognition. In 2019 IEEE/CVF international conference on computer vision (pp. 6201– 6210).
Yu, Z., Zhou, B., Wan, J., Wang, P., Chen, H., Liu, X., Li, S. Z., & Zhao, G. (2021). Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition. IEEE Transactions on Image Processing, 30, 5626–5640. https://doi.org/10.1109/TIP.2021.3087348
Liu, X., Shi, H., Hong, X., Chen, H., Tao, D., & Zhao, G. (2020). 3d skeletal gesture recognition via hidden states exploration. IEEE Transactions on Image Processing, 29, 4583–4597. https://doi.org/10.1109/TIP.2020.2974061
Liu, X., & Zhao, G. (2021). 3d skeletal gesture recognition via discriminative coding on time-warping invariant riemannian trajectories. IEEE Transactions on Multimedia, 23, 1841–1854. https://doi.org/10.1109/TMM.2020.3003783
Varol, G., Laptev, I., & Schmid, C. (2018). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1510–1517.
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. ICLR
Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The something something video database for learning and evaluating visual common sense. In 2017 IEEE international conference on computer vision (pp. 5843– 5851).
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 7794– 7803).
Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision (pp. 399– 417)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248– 255).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In 2016 IEEE conference on computer vision and pattern recognition (pp. 2921– 2929).
Acknowledgements
This work was supported by National Natural Science Foundation of China (No.62072127, No.62002076, No.61906049), Natural Science Foundation of Guangdong Province (No.2023A1515011774, No.2020A1515010423), Project 6142111180404 supported by CNKLSTISS, Science and Technology Program of Guangzhou, China (No.202002030131), Guangdong basic and applied basic research fund joint fund Youth Fund (No.2019A1515110213), Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (No.MJUKF-IPIC202101), Scientific research project for Guangzhou University (No.RP2022003).
Funding
The study was supported in part by the National Natural Science Foundation of China under grant 62176027, 62002076, 61906049; in part by the Natural Science Foundation of Guangdong Province under grant 2023A1515011774, No.2020A1515010423, in part by the Science and Technology Program of Guangzhou, China under grant 202002030131, in part by the Guangdong basic and applied basic research fund joint fund Youth Fund under grant 2019A1515110213, in part by the General Program of the National Natural Science Foundation of Chongqing under grant cstc2020jcyjmsxmX0790; in part by the Human Resources and Social Security Bureau Project of Chongqing under grant cx2020073, Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) under grant MJUKF-IPIC202101, Scientific research project for Guangzhou University under grant RP2022003, Project 6142111180404 supported by CNKLSTISS.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, W., Wang, X., Zhou, M. et al. A spatiotemporal and motion information extraction network for action recognition. Wireless Netw 30, 5389–5405 (2024). https://doi.org/10.1007/s11276-023-03267-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11276-023-03267-y