A spatiotemporal and motion information extraction network for action recognition | Wireless Networks Skip to main content

Advertisement

Log in

A spatiotemporal and motion information extraction network for action recognition

  • Published:
Wireless Networks Aims and scope Submit manuscript

Abstract

With the continuous advancement in Internet-of-Things and deep learning, video action recognition is gradually emerging in daily and industrial applications. Spatiotemporal and motion patterns are two crucial and complementary types of information used for action recognition. However, effectively modelling both types of information in videos remains challenging. In this paper, we propose a spatiotemporal and motion information extraction (STME) network that extracts comprehensive spatiotemporal and motion information from videos for action recognition. First, we design the STME network, which includes three efficient modules: a spatiotemporal extraction (STE) module, a short-term motion extraction (SME) module and a long-term motion extraction (LME) module. The SME and LME modules are used to model short-term and long-term motion representation, respectively. Then, we apply the STE module to capture comprehensive spatiotemporal information which can supplement the video representation for action recognition. According to our experimental results, the STME network achieves significantly better performance than existing methods on several benchmark datasets. Our codes are available at https://github.com/STME-Net/STME.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Zhang, T. (2021). Application of ai-based real-time gesture recognition and embedded system in the design of english major teaching. Wireless Networks. https://doi.org/10.1007/s11276-021-02693-0

    Article  Google Scholar 

  2. He, Y. (2021). Athlete human behavior recognition based on continuous image deep learning and sensors. Wireless Networks. https://doi.org/10.1007/s11276-021-02721-z

    Article  Google Scholar 

  3. Mittal, H., Tripathi, A., Pandey, A., Parameswaran, V., Menon, V., & Pal, R. (2022). A novel fuzzy clustering-based method for human activity recognition in cloud-based industrial iot environment. Wireless Networks. https://doi.org/10.1007/s11276-022-03011-y

    Article  Google Scholar 

  4. Huang, T., Chen, Y., Yao, B., Yang, B., Wang, X., & Li, Y. (2020). Adversarial attacks on deep-learning-based radar range profile target recognition. Information Sciences, 531, 159–176.

    Article  Google Scholar 

  5. Yang, H., Chen, L., Pan, S., Wang, H., & Zhang, P. (2022). Discrete embedding for attributed graphs. Pattern Recognition, 123, 108368.

    Article  Google Scholar 

  6. Liu, Z., Wu, Z., Li, T., Li, J., & Shen, C. (2018). Gmm and cnn hybrid method for short utterance speaker recognition. IEEE Transactions on Industrial informatics, 14(7), 3244–3252.

    Article  Google Scholar 

  7. Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., & Zhao, G (2021). imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10626– 10637). https://doi.org/10.1109/CVPR46437.2021.01049

  8. Yan, H., Chen, Z., & Jia, C. (2019). Ssir: Secure similarity image retrieval in iot. Information Sciences, 479, 153–163.

    Article  Google Scholar 

  9. Meng, Y., Zhu, H., Li, J., Li, J., & Liu, Y. (2020). Liveness detection for voice user interface via wireless signals in iot environment. IEEE Transactions on Dependable and Secure Computing, 18(6), 2996–3011.

    Google Scholar 

  10. Meng, W., Jiang, L., Wang, Y., Li, J., Zhang, J., & Xiang, Y. (2018). Jfcguard: Detecting juice filming charging attack via processor usage analysis on smartphones. Computers & Security, 76, 252–264.

    Article  Google Scholar 

  11. Dong, C., Wang, Y., Aldweesh, A., McCorry, P., & van Moorsel, A. (2017). Betrayal, distrust, and rationality: Smart counter-collusion contracts for verifiable cloud computing. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security (pp. 211– 227).

  12. Hou, R., Ai, S., Chen, Q., Yan, H., Huang, T., & Chen, K. (2022). Similarity-based integrity protection for deep learning systems. Information Sciences, 601, 255–267.

    Article  Google Scholar 

  13. Peng, Y., Choi, B., & Xu, J. ( 2021). Graph edit distance learning via modeling optimum matchings with constraints. In IJCAI (pp. 1534– 1540).

  14. Li, R., Yu, S., & Yang, X. (2007). Efficient spatio-temporal segmentation for extracting moving objects in video sequences. IEEE Transactions on Consumer Electronics, 53(3), 1161–1167. https://doi.org/10.1109/TCE.2007.4341600

    Article  Google Scholar 

  15. Wu, H., Ma, X., & Li, Y. (2021). Spatiotemporal multimodal learning with 3d cnns for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 1–1.

  16. Phyo, C. N., Zin, T. T., & Tin, P. (2019). Deep learning for recognizing human activities using motions of skeletal joints. IEEE Transactions on Consumer Electronics, 65(2), 243–252. https://doi.org/10.1109/TCE.2019.2908986

    Article  Google Scholar 

  17. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568– 576).

  18. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In 2016 IEEE conference on computer vision and pattern recognition (pp. 1933– 1941).

  19. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (pp. 4724– 4733).

  20. Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (pp. 305– 321).

  21. Zhao, Y., Xiong, Y., & Lin, D. (2018). Recognize actions by disentangling components of dynamics. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6566– 6575)). https://doi.org/10.1109/CVPR.2018.00687

  22. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE international conference on computer vision (pp. 4489– 4497).

  23. Stroud, J.C., Ross, D.A., Sun, C., Deng, J., & Sukthankar, R. (2020). D3d: Distilled 3d networks for video action recognition. In 2020 IEEE winter conference on applications of computer vision (pp. 614– 623).

  24. Chen, Y., Kalantidis, Y., Li, J., Yan, S., & Feng, J. (2018). \(a^{2}\)-nets: Double attention networks. Advances in Neural Information Processing Systems, 31, 352–361.

    Google Scholar 

  25. Hara, K., Kataoka, H., & Satoh, Y. ( 2018). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6546– 6555).

  26. Peng, F., Liao, T., & Long, M. (2022). A semi-fragile reversible watermarking for authenticating 3d models in dual domains based on variable direction double modulation. IEEE Transactions on Circuits and Systems for Video Technology.

  27. Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (pp. 803– 818).

  28. Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In 2019 IEEE/CVF international conference on computer vision (pp. 7082– 7092).

  29. Peng, F., Lin, Z.-X., Zhang, X., & Long, M. (2020). A semi-fragile reversible watermarking for authenticating 2d engineering graphics based on improved region nesting. IEEE Transactions on Circuits and Systems for Video Technology, 31(1), 411–424.

    Article  Google Scholar 

  30. Lin, Z.-X., Peng, F., & Long, M. (2018). A low-distortion reversible watermarking for 2d engineering graphics based on region nesting. IEEE Transactions on Information Forensics and Security, 13(9), 2372–2382.

    Article  Google Scholar 

  31. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20– 36). Springer.

  32. Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In 2019 IEEE/CVF International conference on computer vision (pp 2000– 2009).

  33. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In 2020 IEEE/CVF conference on computer vision and pattern recognition (pp. 906– 915).

  34. Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11669–11676.

    Article  Google Scholar 

  35. Wang, Z., She, Q., & Smolic, A. (2021). Action-net: Multipath excitation for action recognition. In 2021 IEEE/CVF conference on computer vision and pattern recognition (pp. 13209– 13218).

  36. Wang, L., Tong, Z., Ji, B., & Wu, G. (2021). Tdn: Temporal difference networks for efficient action recognition. In 2021 IEEE/CVF conference on computer vision and pattern recognition (pp. 1895– 1904).

  37. Rao, Y., & Ni, J. (2021). Self-supervised domain adaptation for forgery localization of jpeg compressed images. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 15034– 15043).

  38. Qin, X., Li, B., Tan, S., Tang, W., & Huang, J. (2022). Gradually enhanced adversarial perturbations on color pixel vectors for image steganography. IEEE Transactions on Circuits and Systems for Video Technology.

  39. Qin, X., Tan, S., Tang, W., Li, B., & Huang, J. (2021). Image steganography based on iterative adversarial perturbations onto a synchronized-directions sub-image. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2705– 2709). IEEE

  40. Horn, B. K., & Schunck, B. G. (1981). Determining optical flow. Artificial Intelligence, 17(1–3), 185–203.

    Article  Google Scholar 

  41. Fernando, B., Gavves, E., Oramas, J., & Ghodrati, A., & Tuytelaars, T. (2017). Rank pooling for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 773–787. https://doi.org/10.1109/TPAMI.2016.2558148

  42. Hu, Y., Liu, M., Su, X., Gao, Z., & Nie, L. (2021). Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing, 30, 4667–4677. https://doi.org/10.1109/TIP.2021.3073867

    Article  MathSciNet  Google Scholar 

  43. Gan, C., Wang, N., Yang, Y., Yeung, D.-Y., Hauptmann, A.G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2568– 2577). https://doi.org/10.1109/CVPR.2015.7298872

  44. Jiang, J., & Zhang, Y. (2022). An improved action recognition network with temporal extraction and feature enhancement. IEEE Access, 10, 13926–13935. https://doi.org/10.1109/ACCESS.2022.3144035

    Article  Google Scholar 

  45. Shen, Z., Wu, X.-J., & Xu, T. (2022). Fexnet: Foreground extraction network for human action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(5), 3141–3151. https://doi.org/10.1109/TCSVT.2021.3103677

    Article  Google Scholar 

  46. Long, X., de Melo, G., He, D., Li, F., Chi, Z., Wen, S., & Gan, C. (2022). Purely attention based local feature integration for video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 2140–2154. https://doi.org/10.1109/TPAMI.2020.3029554

    Article  Google Scholar 

  47. Wang, X., Farhadi, A., & Gupta, A. (2016) Actions transformations. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2658– 2667). https://doi.org/10.1109/CVPR.2016.291

  48. Fan, L., Huang, W., Gan, C., Ermon, S., Gong, B., & Huang, J. (2018). End-to-end learning of motion representation for video understanding. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6016– 6025). https://doi.org/10.1109/CVPR.2018.00630

  49. Bilen, H., Fernando, B., Gavves, E., & Vedaldi, A. (2018). Action recognition with dynamic image networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085

    Article  Google Scholar 

  50. Khowaja, S. A., & Lee, S.-L. (2020). Semantic image networks for human action recognition. International Journal of Computer Vision, 128(2), 393–419.

    Article  Google Scholar 

  51. Gao, Z., Guo, L., Guan, W., Liu, A.-A., Ren, T., & Chen, S. (2021). A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Transactions on Image Processing, 30, 767–782. https://doi.org/10.1109/TIP.2020.3038372

    Article  Google Scholar 

  52. Gao, Z., Guo, L., Ren, T., Liu, A.-A., Cheng, Z.-Y., & Chen, S. (2022). Pairwise two-stream convnets for cross-domain action recognition with small data. IEEE Transactions on Neural Networks and Learning Systems, 33(3), 1147–1161. https://doi.org/10.1109/TNNLS.2020.3041018

    Article  Google Scholar 

  53. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018) A closer look at spatiotemporal convolutions for action recognition. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6450– 6459).

  54. Tran, D., Wang, H., Feiszli, M., & Torresani, L. (2019). Video classification with channel-separated convolutional networks. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 5551– 5560). https://doi.org/10.1109/ICCV.2019.00565

  55. Zolfaghari, M., Singh, K., & Brox, T. (2018). Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (pp. 695– 712).

  56. Feichtenhofer, C., Fan, H., Malik, J., He, K. ( 2019). Slowfast networks for video recognition. In 2019 IEEE/CVF international conference on computer vision (pp. 6201– 6210).

  57. Yu, Z., Zhou, B., Wan, J., Wang, P., Chen, H., Liu, X., Li, S. Z., & Zhao, G. (2021). Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition. IEEE Transactions on Image Processing, 30, 5626–5640. https://doi.org/10.1109/TIP.2021.3087348

    Article  Google Scholar 

  58. Liu, X., Shi, H., Hong, X., Chen, H., Tao, D., & Zhao, G. (2020). 3d skeletal gesture recognition via hidden states exploration. IEEE Transactions on Image Processing, 29, 4583–4597. https://doi.org/10.1109/TIP.2020.2974061

    Article  Google Scholar 

  59. Liu, X., & Zhao, G. (2021). 3d skeletal gesture recognition via discriminative coding on time-warping invariant riemannian trajectories. IEEE Transactions on Multimedia, 23, 1841–1854. https://doi.org/10.1109/TMM.2020.3003783

    Article  Google Scholar 

  60. Varol, G., Laptev, I., & Schmid, C. (2018). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1510–1517.

    Article  Google Scholar 

  61. Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. ICLR

  62. Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The something something video database for learning and evaluating visual common sense. In 2017 IEEE international conference on computer vision (pp. 5843– 5851).

  63. Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 7794– 7803).

  64. Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision (pp. 399– 417)

  65. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248– 255).

  66. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In 2016 IEEE conference on computer vision and pattern recognition (pp. 2921– 2929).

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No.62072127, No.62002076, No.61906049), Natural Science Foundation of Guangdong Province (No.2023A1515011774, No.2020A1515010423), Project 6142111180404 supported by CNKLSTISS, Science and Technology Program of Guangzhou, China (No.202002030131), Guangdong basic and applied basic research fund joint fund Youth Fund (No.2019A1515110213), Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (No.MJUKF-IPIC202101), Scientific research project for Guangzhou University (No.RP2022003).

Funding

The study was supported in part by the National Natural Science Foundation of China under grant 62176027, 62002076, 61906049; in part by the Natural Science Foundation of Guangdong Province under grant 2023A1515011774, No.2020A1515010423, in part by the Science and Technology Program of Guangzhou, China under grant 202002030131, in part by the Guangdong basic and applied basic research fund joint fund Youth Fund under grant 2019A1515110213, in part by the General Program of the National Natural Science Foundation of Chongqing under grant cstc2020jcyjmsxmX0790; in part by the Human Resources and Social Security Bureau Project of Chongqing under grant cx2020073, Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) under grant MJUKF-IPIC202101, Scientific research project for Guangzhou University under grant RP2022003, Project 6142111180404 supported by CNKLSTISS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xianmin Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, W., Wang, X., Zhou, M. et al. A spatiotemporal and motion information extraction network for action recognition. Wireless Netw 30, 5389–5405 (2024). https://doi.org/10.1007/s11276-023-03267-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11276-023-03267-y

Keywords