Exploring Spatiotemporal Features for Activity Classifications in Films | SpringerLink
Skip to main content

Exploring Spatiotemporal Features for Activity Classifications in Films

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2020)

Abstract

Humans are able to appreciate implicit and explicit contexts in a visual scene within a few seconds. How we obtain the interpretations of the visual scene using computers has not been well understood, and so the question remains whether this ability could be emulated. We investigated activity classifications of movie clips using 3D convolutional neural network (CNN) as well as combinations of 2D CNN and long short-term memory (LSTM). This work was motivated by the concepts that CNN can effectively learn the representation of visual features, and LSTM can effectively learn temporal information. Hence, an architecture that combined information from many time slices should provide an effective means to capture the spatiotemporal features from a sequence of images. Eight experiments run on the following three main architectures were carried out: 3DCNN, ConvLSTM2D, and a pipeline of pre-trained CNN-LSTM. We analyzed the empirical output, followed by a critical discussion of the analyses and suggestions for future research directions in this domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We chose eight frames from each clip. The frames were evenly pick from each clip. The number 8 was arbitrary decision.

References

  1. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  2. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR 2009), pp. 2929–2936 (2009)

    Google Scholar 

  3. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM With CNN features. IEEE Access 2018(6), 1155–1166 (2018)

    Article  Google Scholar 

  4. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018(40), 1510–1517 (2018)

    Article  Google Scholar 

  5. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR). CoRR abs/1412.2306 (2015)

    Google Scholar 

  6. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2016)

    Article  Google Scholar 

  7. Phon-Amnuaisuk, S., Murata, K.T., Pavarangkoon, P., Mizuhara, T., Hadi, S.: Children activity descriptions from visual and textual associations. In: Chamchong, R., Wong, K.W. (eds.) MIWAI 2019. LNCS (LNAI), vol. 11909, pp. 121–132. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33709-4_11

    Chapter  Google Scholar 

  8. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. CoRR, abs/1608.06993 (2016). http://arxiv.org/abs/1608.06993

  9. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)

    Google Scholar 

  10. Zoph, B., Vasudevan, V., Shlen, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710 (2018)

    Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385

  12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning representations (ICLR) CoRR, 1409.1556 (2015)

    Google Scholar 

  13. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251–1258 (2017)

    Google Scholar 

  14. Phon-Amnuaisuk, S., Ahmad, A.: Tracking and identifying a changing appearance target. In: Bikakis, A., Zheng, X. (eds.) MIWAI 2015. LNCS (LNAI), vol. 9426, pp. 245–252. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26181-2_23

    Chapter  Google Scholar 

Download references

Acknowledgments

We wish to thank the Centre for Innovative Engineering, Universiti Teknologi Brunei for the financial support given to this research. We would also like to thank anonymous reviewers for their constructive comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Somnuk Phon-Amnuaisuk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Phon-Amnuaisuk, S., Hadi, S., Omar, S. (2020). Exploring Spatiotemporal Features for Activity Classifications in Films. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1332. Springer, Cham. https://doi.org/10.1007/978-3-030-63820-7_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63820-7_47

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63819-1

  • Online ISBN: 978-3-030-63820-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics