Video Captioning Based on Graph Neural Network Made from Action Knowledge and Object Features

Int J Performability Eng ›› 2024, Vol. 20 ›› Issue (4): 214-223.doi: 10.23940/ijpe.24.04.p3.214223

Previous Articles     Next Articles

Video Captioning Based on Graph Neural Network Made from Action Knowledge and Object Features

Prashant Kaushik*, Vikas Saxena, and Amarjeet Prajapati   

  1. Department of CSE&IT, Jaypee Institute of Information Technology, Utter Pradesh, India
  • Submitted on ; Revised on ; Accepted on
  • Contact: * E-mail address: jiitprashant@gmail.com

Abstract: Encoder-decoder-based video captioning gives a holistic description as per the training data. These captions are missing the object’s motion-specific features. Object motion knowledge in video allows object-oriented video captions. Similarity action knowledge-based models allow action-based features. Furthermore, the traditional encoder-decoder method uses frame-level scene features. Advanced existing methods extract spatial and temporal features for extracting context vectors. The unavailability of methods to extract action knowledge with the object’s motion features limits the models to produce action-object-oriented captions. Further presence of multiple objects’ motion gives disoriented captions in state-of-the-art methods. We propose a method that is a partial grid-based method for action-object-oriented features. This facilitates comprehension of an object's motion and its interactions with other objects, as well as movement within the scene. The proposed method takes these features and constructs a graph neural network, which is then used with graph-based filters. Object activity and interaction based re-annotated 75 videos from MSVD datasets which were used for training, validation, and evaluation. The proposed model demonstrates object-action-based video captioning with object-action and object-background interaction. The BLEU and METEOR-based evaluation results demonstrate the workability of graph neural network-based methods and the superiority of the process.

Key words: video understanding, graph neural network, video captioning, object-level analysis, object-action video captions