Learning and Distillating the Internal Relationship of Motion Features in Action Recognition

Lu, Lu; Li, Siyuan; Chen, Niannian; Gao, Lin; Fan, Yong; Jiang, Yong; Wu, Ling

doi:10.1007/978-3-030-63820-7_28

Lu Lu¹¹,
Siyuan Li¹¹,
Niannian Chen¹¹,
Lin Gao¹¹,
Yong Fan¹¹,
Yong Jiang¹¹ &
…
Ling Wu¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1332))

Included in the following conference series:

International Conference on Neural Information Processing

2470 Accesses

Abstract

In the field of video-based action recognition, a majority of advanced approaches train a two-stream architecture in which an appearance stream for images and a motion stream for optical flow frames. Due to the considerable computation cost of optical flow and high inference latency of the two-stream method, knowledge distillation is introduced to efficiently capture two-stream representation while only inputting RGB images. Following this technique, this paper proposes a novel distillation learning strategy to sufficiently learn and mimic the representation of the motion stream. Besides, we propose a lightweight attention-based fusion module to uniformly exploit both appearance and motion information. Experiments illustrate that the proposed distillation strategy and fusion module achieve better performance over the baseline technique, and our proposal outperforms the known state-of-art approaches in terms of single-stream and traditional two-stream methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A simulated two-stream network via multilevel distillation of reviewed features and decoupled logits for video action recognition

Article 21 October 2024

Motion Feature Network: Fixed Motion Filter for Action Recognition

LAE-Net: Light and Efficient Network for Compressed Video Action Recognition

References

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: MARS: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891 (2019)
Google Scholar
Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2329–2338 (2017)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 971–980 (2017)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV, vol. 1, p. 3 (2017)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Liu, H., Tu, J., Liu, M.: Two-stream 3D convolutional neural network for skeleton-based action recognition. arXiv preprint arXiv:1705.08106 (2017)
Pérez, J.S., Meinhardt-Llopis, E., Facciolo, G.: Tv-l1 optical flow estimation. Image Process. Line 2013, 137–150 (2013)
Article Google Scholar
Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2740–2755 (2018)
Article Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)
Google Scholar
Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3d network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5783–5792 (2017)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zhang, J., Shen, F., Xu, X., Shen, H.T.: Cooperative cross-stream network for discriminative action representation. arXiv preprint arXiv:1908.10136 (2019)
Zhu, J., Zhu, Z., Zou, W.: End-to-end video-level representation learning for action recognition. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 645–650. IEEE (2018)
Google Scholar

Download references

Acknowledgment

This research is supported by The Foundation of Sichuan Provincial Education Department (NO. 18ZA0501).

Author information

Authors and Affiliations

Southwest University of Science and Technology, Mianyang, China
Lu Lu, Siyuan Li, Niannian Chen, Lin Gao, Yong Fan, Yong Jiang & Ling Wu

Authors

Lu Lu
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Niannian Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yong Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ling Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niannian Chen .

Editor information

Editors and Affiliations

Department of AI, Ping An Life, Shenzhen, China
Haiqin Yang
Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Kitsuchart Pasupa
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi-Sing Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
James T. Kwok
School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan
The Chinese University of Hong Kong, New Territories, Hong Kong
Irwin King

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, L. et al. (2020). Learning and Distillating the Internal Relationship of Motion Features in Action Recognition. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1332. Springer, Cham. https://doi.org/10.1007/978-3-030-63820-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-63820-7_28
Published: 17 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63819-1
Online ISBN: 978-3-030-63820-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A simulated two-stream network via multilevel distillation of reviewed features and decoupled logits for video action recognition

Motion Feature Network: Fixed Motion Filter for Action Recognition

LAE-Net: Light and Efficient Network for Compressed Video Action Recognition

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A simulated two-stream network via multilevel distillation of reviewed features and decoupled logits for video action recognition

Motion Feature Network: Fixed Motion Filter for Action Recognition

LAE-Net: Light and Efficient Network for Compressed Video Action Recognition

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation