Abstract
Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The dataset is available for download at http://ai.stanford.edu/~syyeung/everymoment.html.
A similar behavior can be obtained with a bi-directional model by truncating the hidden state information from future time frames to zero, but this artificially distorts the test-time behavior of the model’s outputs, while our model always operates in the regime it was trained with.
References
Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R. (2005). Actions as space-time shapes. In The Tenth IEEE International Conference on Computer Vision (ICCV’05).
Choi, M. J., Lim, J. J., Torralba, A., Willsky, A. S. (2010). Exploiting hierarchical context on a large database of object categories. In CVPR.
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T. (2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389.
Fabian Caba Heilbron, B. G., Victor Escorcia and Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 961–970).
Gkioxari, G., Malik, J. (2014). Finding action tubes. arXiv:1411.6031.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. PAMI, 20(11), 1254–1259.
Jiang, Y.-G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Ke, Y., Sukthankar, R., Hebert, M. (2007). Event detection in crowded videos. In ICCV.
Kitani, K. M., Ziebart, B., Bagnell, J. D., Hebert, M. (2012). Activity forecasting. In ECCV.
Kuehne, H., Arslan, A., Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011). Hmdb: a large video database for human motion recognition. In ICCV.
Lan, T., Wang, Y., Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In ICCV.
Lv, F. J., Nevatia, R. (2007). Single view human action recognition using key pose matching and viterbi path searching. In CVPR.
Mansimov, E., Srivastava, N., Salakhutdinov, R. (2015). Initialization strategies of spatio-temporal convolutional neural networks. arXiv:1503.07274.
Marszałek, M., Laptev, I., Schmid, C. (2009). Actions in context. In IEEE Conference on Computer Vision & Pattern Recognition.
Myers, G. K., Nallapati, R., van Hout, J., Pancoast, S., Nevatia, R., Sun, C., et al. (2014). Evaluating multimedia features and fusion for example-based event detection. Machine Vision and Applications, 25(1), 17–32.
Ng, J. Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. arXiv:1503.08909.
Ni, B., Paramathayalan, V. R., Moulin, P. (2014). Multiple granularity analysis for fine-grained action detection. In CVPR.
Niebles, J. C., Chen, C.-W., Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.
Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., Desai, M. (2011) A large-scale benchmark dataset for event recognition in surveillance video. In IEEE Conference on Computer Vision and Pattern Recognition.
Oh, S., Mccloskey, S., Kim, I., Vahdat, A., Cannons, K. J., Hajimirsadeghi, H., et al. (2014). Multimedia event detection with multimodal feature fusion and temporal concept localization. Machine Vision and Applications, 25(1), 49–69.
Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A. F., Quenot, G. (2011). Trecvid 2011—An overview of the goals, tasks, data, evaluation mechansims and metrics. In Proceedings of TRECVID 2011.
Pirsiavash, H., Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR.
Poppe, R. (2010). A survey on vision-based human action recognition. IVC, 28, 976–990.
Qinfeng Shi, L. W. A. S., Cheng, Li (2011). Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, 93, May.
Rohrbach, M., Amin, S., Andriluka, M., Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In CVPR.
Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., Schiele, B. (2015) Recognizing fine-grained and composite activities using hand-centric features and script data. arXiv:1502.06648.
Russakovsky, O., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Ryoo, M. S., Aggarwal, J. K. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In IEEE International Conference on Computer Vision (ICCV).
Schuldt, C., Laptev, I., Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR.
Shapovalova, N., Raptis, M., Sigal, L., Mori, G. (2013). Action is in the eye of the beholder: Eye-gaze driven model for spatio-temporal action localization. In NIPS.
Simonyan, K., Zisserman, A. (2014a). Two-stream convolutional networks for action recognition in videos. In NIPS.
Simonyan, K., Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. abs/1409.1556.
Soomro, K., Zamir, A. R., Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
Srivastava, N., Mansimov, E., Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. arXiv:1502.04681.
Tang, K., Fei-Fei, L., Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.
Tian, Y., Sukthankar, R., Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.
Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), 26–31.
Tong, W., Yang, Y., Jiang, L., Yu, S.-I., Lan, Z., Ma, Z., et al. (2014). E-lamp: Integration of innovative ideas for multimedia event detection. Machine Vision and Applications, 25(1), 5–15.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M. (2015) C3d: Generic features for video analysis. arXiv:1412.0767.
Vahdat, A., Gao, B., Ranjbar, M., Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In VS.
Wang, P., Cao, Y., Shen, C., Liu, L., Shen, H. T. (2015). Temporal pyramid pooling based convolutional neural networks for action recognition. arXiv:1503.01224.
Wang, H., Kläser, A., Schmid, C., Liu, C.-L. (2011). Action recognition by dense trajectories. In CVPR.
Wang, H., Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, Sydney.
Weinland, D., Ronfard, R., Boyer, E. (2010). A survey of vision-based methods for action representation, segmentation and recognition. In CVIU, 115(2), (pp. 224–241).
Xu, K. et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044.
Yamato, J., Ohya, J., Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In CVPR.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Video description generation incorporating spatio-temporal features and a soft-attention mechanism. arXiv:1502.08029.
Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R. (2015). Exploiting image-trained cnn architectures for unconstrained video classification. arXiv:1503.04144.
Acknowledgements
We would like to thank Andrej Karpathy and Amir Zamir for helpful comments and discussion.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ivan Laptev, Cordelia Schmid.
Rights and permissions
About this article
Cite this article
Yeung, S., Russakovsky, O., Jin, N. et al. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. Int J Comput Vis 126, 375–389 (2018). https://doi.org/10.1007/s11263-017-1013-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-017-1013-y