Authors:
Felipe F. Costa
;
Priscila T. M. Saito
and
Pedro H. Bugatti
Affiliation:
Department of Computing, Federal Unversity of Technology - Parana, 1640 Alberto Carazzai Ave., Cornelio Procopio, Brazil
Keyword(s):
Deep Learning, Graph Convolutional Network, Computer Vision, Action Classification.
Abstract:
Video classification methods have been evolving through proposals based on end-to-end deep learning architectures. Several works have testified that end-to-end models are effective for the learning of intrinsic video
features, especially when compared to the handcrafted ones. In general, convolutional neural networks are
used for deep learning in videos. Usually, when applied to such contexts, these vanilla deep learning networks
cannot identify variations based on temporal information. To do so, memory-based cells (e.g. long-short term
memory), or even optical flow techniques are used in conjunction with the convolutional process. However,
despite their effectiveness, those methods neglect global analysis, processing only a small quantity of frames
in each batch during the learning and inference process. Moreover, they also completely ignore the semantic
relationship between different videos that belong to the same context. Thus, the present work aims to fill these
gaps by u
sing information grouping concepts and contextual detection through graph-based convolutional neural networks. The experiments show that our method achieves up to 87% of accuracy in a well-known public
video dataset.
(More)