Abstract
Humans are able to appreciate implicit and explicit contexts in a visual scene within a few seconds. How we obtain the interpretations of the visual scene using computers has not been well understood, and so the question remains whether this ability could be emulated. We investigated activity classifications of movie clips using 3D convolutional neural network (CNN) as well as combinations of 2D CNN and long short-term memory (LSTM). This work was motivated by the concepts that CNN can effectively learn the representation of visual features, and LSTM can effectively learn temporal information. Hence, an architecture that combined information from many time slices should provide an effective means to capture the spatiotemporal features from a sequence of images. Eight experiments run on the following three main architectures were carried out: 3DCNN, ConvLSTM2D, and a pipeline of pre-trained CNN-LSTM. We analyzed the empirical output, followed by a critical discussion of the analyses and suggestions for future research directions in this domain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We chose eight frames from each clip. The frames were evenly pick from each clip. The number 8 was arbitrary decision.
References
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR 2009), pp. 2929–2936 (2009)
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM With CNN features. IEEE Access 2018(6), 1155–1166 (2018)
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018(40), 1510–1517 (2018)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR). CoRR abs/1412.2306 (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2016)
Phon-Amnuaisuk, S., Murata, K.T., Pavarangkoon, P., Mizuhara, T., Hadi, S.: Children activity descriptions from visual and textual associations. In: Chamchong, R., Wong, K.W. (eds.) MIWAI 2019. LNCS (LNAI), vol. 11909, pp. 121–132. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33709-4_11
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. CoRR, abs/1608.06993 (2016). http://arxiv.org/abs/1608.06993
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)
Zoph, B., Vasudevan, V., Shlen, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning representations (ICLR) CoRR, 1409.1556 (2015)
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251–1258 (2017)
Phon-Amnuaisuk, S., Ahmad, A.: Tracking and identifying a changing appearance target. In: Bikakis, A., Zheng, X. (eds.) MIWAI 2015. LNCS (LNAI), vol. 9426, pp. 245–252. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26181-2_23
Acknowledgments
We wish to thank the Centre for Innovative Engineering, Universiti Teknologi Brunei for the financial support given to this research. We would also like to thank anonymous reviewers for their constructive comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Phon-Amnuaisuk, S., Hadi, S., Omar, S. (2020). Exploring Spatiotemporal Features for Activity Classifications in Films. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1332. Springer, Cham. https://doi.org/10.1007/978-3-030-63820-7_47
Download citation
DOI: https://doi.org/10.1007/978-3-030-63820-7_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63819-1
Online ISBN: 978-3-030-63820-7
eBook Packages: Computer ScienceComputer Science (R0)