Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Yeung, Serena; Russakovsky, Olga; Jin, Ning; Andriluka, Mykhaylo; Mori, Greg; Fei-Fei, Li

doi:10.1007/s11263-017-1013-y

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Published: 22 May 2017

Volume 126, pages 375–389, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Serena Yeung¹,
Olga Russakovsky^1,2,
Ning Jin¹,
Mykhaylo Andriluka^1,3,
Greg Mori⁴ &
…
Li Fei-Fei¹

5405 Accesses
204 Citations
Explore all metrics

Abstract

Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

ActionFormer: Localizing Moments of Actions with Transformers

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The dataset is available for download at http://ai.stanford.edu/~syyeung/everymoment.html.
http://factory.datatang.com/en/.
A similar behavior can be obtained with a bi-directional model by truncating the hidden state information from future time frames to zero, but this artificially distorts the test-time behavior of the model’s outputs, while our model always operates in the regime it was trained with.

References

Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R. (2005). Actions as space-time shapes. In The Tenth IEEE International Conference on Computer Vision (ICCV’05).
Choi, M. J., Lim, J. J., Torralba, A., Willsky, A. S. (2010). Exploiting hierarchical context on a large database of object categories. In CVPR.
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T. (2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389.
Fabian Caba Heilbron, B. G., Victor Escorcia and Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 961–970).
Gkioxari, G., Malik, J. (2014). Finding action tubes. arXiv:1411.6031.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. PAMI, 20(11), 1254–1259.
Article Google Scholar
Jiang, Y.-G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Ke, Y., Sukthankar, R., Hebert, M. (2007). Event detection in crowded videos. In ICCV.
Kitani, K. M., Ziebart, B., Bagnell, J. D., Hebert, M. (2012). Activity forecasting. In ECCV.
Kuehne, H., Arslan, A., Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011). Hmdb: a large video database for human motion recognition. In ICCV.
Lan, T., Wang, Y., Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In ICCV.
Lv, F. J., Nevatia, R. (2007). Single view human action recognition using key pose matching and viterbi path searching. In CVPR.
Mansimov, E., Srivastava, N., Salakhutdinov, R. (2015). Initialization strategies of spatio-temporal convolutional neural networks. arXiv:1503.07274.
Marszałek, M., Laptev, I., Schmid, C. (2009). Actions in context. In IEEE Conference on Computer Vision & Pattern Recognition.
Myers, G. K., Nallapati, R., van Hout, J., Pancoast, S., Nevatia, R., Sun, C., et al. (2014). Evaluating multimedia features and fusion for example-based event detection. Machine Vision and Applications, 25(1), 17–32.
Article Google Scholar
Ng, J. Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. arXiv:1503.08909.
Ni, B., Paramathayalan, V. R., Moulin, P. (2014). Multiple granularity analysis for fine-grained action detection. In CVPR.
Niebles, J. C., Chen, C.-W., Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.
Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., Desai, M. (2011) A large-scale benchmark dataset for event recognition in surveillance video. In IEEE Conference on Computer Vision and Pattern Recognition.
Oh, S., Mccloskey, S., Kim, I., Vahdat, A., Cannons, K. J., Hajimirsadeghi, H., et al. (2014). Multimedia event detection with multimodal feature fusion and temporal concept localization. Machine Vision and Applications, 25(1), 49–69.
Article Google Scholar
Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A. F., Quenot, G. (2011). Trecvid 2011—An overview of the goals, tasks, data, evaluation mechansims and metrics. In Proceedings of TRECVID 2011.
Pirsiavash, H., Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR.
Poppe, R. (2010). A survey on vision-based human action recognition. IVC, 28, 976–990.
Article Google Scholar
Qinfeng Shi, L. W. A. S., Cheng, Li (2011). Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, 93, May.
Rohrbach, M., Amin, S., Andriluka, M., Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In CVPR.
Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., Schiele, B. (2015) Recognizing fine-grained and composite activities using hand-centric features and script data. arXiv:1502.06648.
Russakovsky, O., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Ryoo, M. S., Aggarwal, J. K. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In IEEE International Conference on Computer Vision (ICCV).
Schuldt, C., Laptev, I., Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR.
Shapovalova, N., Raptis, M., Sigal, L., Mori, G. (2013). Action is in the eye of the beholder: Eye-gaze driven model for spatio-temporal action localization. In NIPS.
Simonyan, K., Zisserman, A. (2014a). Two-stream convolutional networks for action recognition in videos. In NIPS.
Simonyan, K., Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. abs/1409.1556.
Soomro, K., Zamir, A. R., Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
Srivastava, N., Mansimov, E., Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. arXiv:1502.04681.
Tang, K., Fei-Fei, L., Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.
Tian, Y., Sukthankar, R., Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.
Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), 26–31.
Google Scholar
Tong, W., Yang, Y., Jiang, L., Yu, S.-I., Lan, Z., Ma, Z., et al. (2014). E-lamp: Integration of innovative ideas for multimedia event detection. Machine Vision and Applications, 25(1), 5–15.
Article Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M. (2015) C3d: Generic features for video analysis. arXiv:1412.0767.
Vahdat, A., Gao, B., Ranjbar, M., Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In VS.
Wang, P., Cao, Y., Shen, C., Liu, L., Shen, H. T. (2015). Temporal pyramid pooling based convolutional neural networks for action recognition. arXiv:1503.01224.
Wang, H., Kläser, A., Schmid, C., Liu, C.-L. (2011). Action recognition by dense trajectories. In CVPR.
Wang, H., Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, Sydney.
Weinland, D., Ronfard, R., Boyer, E. (2010). A survey of vision-based methods for action representation, segmentation and recognition. In CVIU, 115(2), (pp. 224–241).
Xu, K. et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044.
Yamato, J., Ohya, J., Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In CVPR.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Video description generation incorporating spatio-temporal features and a soft-attention mechanism. arXiv:1502.08029.
Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R. (2015). Exploiting image-trained cnn architectures for unconstrained video classification. arXiv:1503.04144.

Download references

Acknowledgements

We would like to thank Andrej Karpathy and Amir Zamir for helpful comments and discussion.

Author information

Authors and Affiliations

Stanford University, Stanford, CA, USA
Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka & Li Fei-Fei
Carnegie Mellon University, Pittsburgh, PA, USA
Olga Russakovsky
Max Planck Institute for Informatics, Saarbrücken, Germany
Mykhaylo Andriluka
Simon Fraser University, Burnaby, BC, Canada
Greg Mori

Authors

Serena Yeung
View author publications
You can also search for this author in PubMed Google Scholar
Olga Russakovsky
View author publications
You can also search for this author in PubMed Google Scholar
Ning Jin
View author publications
You can also search for this author in PubMed Google Scholar
Mykhaylo Andriluka
View author publications
You can also search for this author in PubMed Google Scholar
Greg Mori
View author publications
You can also search for this author in PubMed Google Scholar
Li Fei-Fei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Serena Yeung.

Additional information

Communicated by Ivan Laptev, Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yeung, S., Russakovsky, O., Jin, N. et al. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. Int J Comput Vis 126, 375–389 (2018). https://doi.org/10.1007/s11263-017-1013-y

Download citation

Received: 08 March 2016
Accepted: 17 April 2017
Published: 22 May 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11263-017-1013-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ActionFormer: Localizing Moments of Actions with Transformers

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ActionFormer: Localizing Moments of Actions with Transformers

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation