ERA: Expert Retrieval and Assembly for Early Action Prediction

Foo, Lin Geng; Li, Tianjiao; Rahmani, Hossein; Ke, Qiuhong; Liu, Jun

doi:10.1007/978-3-031-19830-4_38

Lin Geng Foo¹²,
Tianjiao Li¹²,
Hossein Rahmani¹³,
Qiuhong Ke¹⁴ &
…
Jun Liu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13694))

Included in the following conference series:

European Conference on Computer Vision

2555 Accesses

Abstract

Early action prediction aims to successfully predict the class label of an action before it is completely performed. This is a challenging task because the beginning stages of different actions can be very similar, with only minor subtle differences for discrimination. In this paper, we propose a novel Expert Retrieval and Assembly (ERA) module that retrieves and assembles a set of experts most specialized at using discriminative subtle differences, to distinguish an input sample from other highly similar samples. To encourage our model to effectively use subtle differences for early action prediction, we push experts to discriminate exclusively between samples that are highly similar, forcing these experts to learn to use subtle differences that exist between those samples. Additionally, we design an effective Expert Learning Rate Optimization method that balances the experts’ optimization and leads to better performance. We evaluate our ERA module on four public action datasets and achieve state-of-the-art performance.

L. G. Foo and T. Li—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 12583; Price includes VAT (Japan)

Softcover Book: JPY 15729; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Semi-supervised Teacher-Reference-Student Architecture for Action Quality Assessment

Heterogenous Action Ensembling for Visual Recognition of Large-Scale Actions

Unsupervised open-world human action recognition

Article 26 September 2023

References

Chaabane, M., Trabelsi, A., Blanchard, N., Beveridge, R.: Looking ahead: Anticipating pedestrians crossing with future frames prediction. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2297–2306 (2020)
Google Scholar
Chen, L., Lu, J., Song, Z., Zhou, J.: Recurrent semantic preserving generation for action prediction. IEEE Trans. Circuits Syst. Video Technol. 31(1), 231–245 (2020)
Article Google Scholar
Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., Liu, Z.: Dynamic convolution: Attention over convolution kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11030–11039 (2020)
Google Scholar
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)
Google Scholar
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020)
Google Scholar
Emad, M., Ishack, M., Ahmed, M., Osama, M., Salah, M., Khoriba, G.: Early-anomaly prediction in surveillance cameras for security applications. In: 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), pp. 124–128. IEEE (2021)
Google Scholar
Fatima, I., Fahim, M., Lee, Y.K., Lee, S.: A unified framework for activity recognition-based behavior analysis and action prediction in smart homes. Sensors 13(2), 2682–2699 (2013)
Article Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
Google Scholar
Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Predicting the future: a jointly learnt model for action anticipation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5562–5571 (2019)
Google Scholar
Gujjar, P., Vaughan, R.: Classifying pedestrian actions in advance using predicted video of urban driving scenes. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 2097–2103. IEEE (2019)
Google Scholar
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: A survey. arXiv preprint arXiv:2102.04906 (2021)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d CNNs retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Google Scholar
Hu, J.F., Zheng, W.S., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-d activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5344–5352 (2015)
Google Scholar
Hu, J.-F., Zheng, W.-S., Ma, L., Wang, G., Lai, J.: Real-time RGB-D activity prediction by soft regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 280–296. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_17
Chapter Google Scholar
Hu, J.F., Zheng, W.S., Ma, L., Wang, G., Lai, J., Zhang, J.: Early action prediction by soft regression. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2568–2583 (2018)
Article Google Scholar
Huang, C.M., Mutlu, B.: Anticipatory robot control for efficient human-robot collaboration. In: 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 83–90. IEEE (2016)
Google Scholar
Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 3118–3125. IEEE (2016)
Google Scholar
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144 (2016)
Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big Data 6(1), 1–54 (2019)
Article Google Scholar
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)
Google Scholar
Ke, Q., Bennamoun, M., Rahmani, H., An, S., Sohel, F., Boussaid, F.: Learning latent global network for skeleton-based action prediction. IEEE Trans. Image Process. 29, 959–970 (2019)
Article MathSciNet Google Scholar
Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230 (2018)
Kong, Y., Gao, S., Sun, B., Fu, Y.: Action prediction from videos via memorizing hard-to-predict samples. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: European conference on computer vision. pp. 596–611. Springer (2014)
Google Scholar
Kong, Y., Tao, Z., Fu, Y.: Deep sequential context networks for action prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1481 (2017)
Google Scholar
Kong, Y., Tao, Z., Fu, Y.: Adversarial action prediction networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(3), 539–553 (2018)
Article Google Scholar
Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 14–29 (2015)
Article Google Scholar
Li, H., Wu, Z., Shrivastava, A., Davis, L.S.: 2D or not 2d? adaptive 3d convolution selection for efficient video recognition. In: CVPR, pp. 6155–6164 (2021)
Google Scholar
Li, T., Liu, J., Zhang, W., Duan, L.: HARD-Net: hardness-AwaRe discrimination network for 3D early activity prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 420–436. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_25
Chapter Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Google Scholar
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: NTU RGB+ D 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2019)
Article Google Scholar
Liu, J., Shahroudy, A., Wang, G., Duan, L.Y., Kot, A.C.: Skeleton-based online action prediction using scale selection network. IEEE Trans. Pattern Anal. Mach. Intell. 42(6), 1453–1467 (2019)
Article Google Scholar
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)
Google Scholar
Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950 (2016)
Google Scholar
Mavrogiannis, A., Chandra, R., Manocha, D.: B-gap: Behavior-guided action prediction for autonomous navigation. arXiv preprint arXiv:2011.03748 (2020)
Mullapudi, R.T., Mark, W.R., Shazeer, N., Fatahalian, K.: Hydranets: Specialized dynamic architectures for efficient inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8080–8089 (2018)
Google Scholar
Nguyen, X.S.: GeomNet: a neural network based on Riemannian geometries of SPD matrix space and Cholesky space for 3d skeleton-based interaction recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13379–13389 (2021)
Google Scholar
Pang, G., Wang, X., Hu, J., Zhang, Q., Zheng, W.S.: DbdNet: learning bi-directional dynamics for early action prediction. In: IJCAI, pp. 897–903 (2019)
Google Scholar
Reily, B., Han, F., Parker, L.E., Zhang, H.: Skeleton-based bio-inspired human activity prediction for real-time human-robot interaction. Auton. Robot. 42(6), 1281–1298 (2018)
Article Google Scholar
Sadegh Aliakbarian, M., Sadat Saleh, F., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging lSTMs to anticipate actions very early. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 280–289 (2017)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: the sparsely-gated mixture-of-experts layer (2017)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7912–7921 (2019)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: AdaSGN: adapting joint number and model size for efficient skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13413–13422 (2021)
Google Scholar
Shu, J., et al.: Meta-Weight-Net: Learning an explicit mapping for sample weighting. In: : Proceedings of the 33rd International Conference on Neural Information Processing System (2019)
Google Scholar
Song, Y.F., Zhang, Z., Shan, C., Wang, L.: Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1625–1633 (2020)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Tran, V., Balasubramanian, N., Hoai, M.: Progressive knowledge distillation for early action recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2583–2587. IEEE (2021)
Google Scholar
Veit, A., Belongie, S.: Convolutional networks with adaptive inference graphs. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–18 (2018)
Google Scholar
Wang, W., Chang, F., Liu, C., Li, G., Wang, B.: GA-Net: a guidance aware network for skeleton-based early activity recognition. IEEE Trans. Multim. Early Access (2021)
Google Scholar
Wang, X., Yu, F., Dou, Z.Y., Darrell, T., Gonzalez, J.E.: SkipNet: learning dynamic routing in convolutional networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 409–424 (2018)
Google Scholar
Wang, X., Hu, J.F., Lai, J.H., Zhang, J., Zheng, W.S.: Progressive teacher-student learning for early action prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3556–3565 (2019)
Google Scholar
Weng, J., Jiang, X., Zheng, W.L., Yuan, J.: Early action recognition with category exclusion using policy-based reinforcement learning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4626–4638 (2020)
Article Google Scholar
Wu, X., Wang, R., Hou, J., Lin, H., Luo, J.: Spatial-temporal relation reasoning for action prediction in videos. Int. J. Comput. Vision 129(5), 1484–1505 (2021)
Article Google Scholar
Wu, X., Zhao, J., Wang, R.: Anticipating future relations via graph growing for action prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2952–2960 (2021)
Google Scholar
Wu, Z., Li, H., Zheng, Y., Xiong, C., Jiang, Y., Davis, L.S.: A coarse-to-fine framework for resource efficient video recognition. Int. J. Comput. Vision 129(11), 2965–2977 (2021)
Article Google Scholar
Wu, Z., et al.: Blockdrop: Dynamic inference paths in residual networks. In: CVPR, pp. 8817–8826 (2018)
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321 (2018)
Google Scholar
Xie, Z., Zhang, Z., Zhu, X., Huang, G., Lin, S.: Spatially adaptive inference with stochastic feature sampling and interpolation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 531–548. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_31
Chapter Google Scholar
Xu, W., Yu, J., Miao, Z., Wan, L., Ji, Q.: Prediction-CGAN: human action prediction with conditional generative adversarial networks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 611–619 (2019)
Google Scholar
Yan, R., Tang, J., Shu, X., Li, Z., Tian, Q.: Participation-contributed temporal dynamic model for group activity recognition. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1292–1300 (2018)
Google Scholar
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: HiGCIN: hierarchical graph-based cross inference network for group activity recognition. IEEE Trans. Pattern Aanal. Mach. Intell. Early Access (2020)
Google Scholar
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Social adaptive module for weakly-supervised group activity recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 208–224. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_13
Chapter Google Scholar
Yang, B., Bender, G., Le, Q.V., Ngiam, J.: CondConv: conditionally parameterized convolutions for efficient inference. In: Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (2019)
Google Scholar
Ye, F., Pu, S., Zhong, Q., Li, C., Xie, D., Tang, H.: Dynamic GCN: context-enriched topology learning for skeleton-based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 55–63 (2020)
Google Scholar

Download references

Acknowledgement

This work is supported by National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-100E-2020-065), Ministry of Education Tier 1 Grant and SUTD Startup Research Grant.

Author information

Authors and Affiliations

ISTD Pillar, Singapore University of Technology and Design, Singapore, Singapore
Lin Geng Foo, Tianjiao Li & Jun Liu
School of Computing and Communications, Lancaster University, Lancaster, UK
Hossein Rahmani
Department of Data Science and AI, Monash University, Clayton, Australia
Qiuhong Ke

Authors

Lin Geng Foo
View author publications
You can also search for this author in PubMed Google Scholar
Tianjiao Li
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Rahmani
View author publications
You can also search for this author in PubMed Google Scholar
Qiuhong Ke
View author publications
You can also search for this author in PubMed Google Scholar
Jun Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Liu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 362 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Foo, L.G., Li, T., Rahmani, H., Ke, Q., Liu, J. (2022). ERA: Expert Retrieval and Assembly for Early Action Prediction. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13694. Springer, Cham. https://doi.org/10.1007/978-3-031-19830-4_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-19830-4_38
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19829-8
Online ISBN: 978-3-031-19830-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ERA: Expert Retrieval and Assembly for Early Action Prediction