Abstract
Visual Question Answering (VQA) aims to automatically answer natural language questions related to given image content. Existing VQA methods integrate vision modeling and language understanding to explore the deep semantics of the question. However, these methods ignore the significant syntax information of the question, which plays a vital role in understanding the essential semantics of the question and guiding the visual feature refinement. To fill the gap, we suggested a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based on entity message passing and syntax tree. This model is able to extract a syntax tree from questions and obtain more precise syntax information. Specifically, we parse questions and obtain the question syntax tree using the Stanford syntax parsing tool. From the word level and phrase level, syntactic phrase features and question features are extracted using a hierarchical tree convolutional network. We then design a message-passing mechanism for phrase-aware visual entities and capture entity features according to a given visual context. Extensive experiments on VQA2.0 datasets demonstrate the superiority of our proposed model.
X. Su and Q. Zhang—The first two authors contribute equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Ben-Younes, H., Cadène, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV, pp. 2631–2639 (2017)
Cadène, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: CVPR, pp. 1989–1998 (2019)
Guo, D., Xu, C., Tao, D.: Bilinear graph networks for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 34, 1023–1034 (2021)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hu, R., Rohrbach, A., Darrell, T., Saenko, K.: Language-conditioned graph networks for relational reasoning. In: ICCV, pp. 10293–10302 (2019)
Ilievski, I., Feng, J.: Generative attention model with adversarial self-learning for visual question answering. In: Thematic Workshops of ACM Multimedia 2017, pp. 415–423 (2017)
Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: NeurIPS, pp. 1571–1581 (2018)
Kumar, A., et al.: Ask me anything: dynamic memory networks for natural language processing. In: ICML, vol. 48, pp. 1378–1387 (2016)
Lei, Z., Zhang, G., Wu, L., Zhang, K., Liang, R.: A multi-level mesh mutual attention model for visual question answering. Data Sci. Eng. 7(4), 339–353 (2022)
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: ICCV, pp. 10312–10321 (2019)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Nguyen, D.K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR, pp. 6087–6096 (2018)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS, vol. 28 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)
Ye, T., Hu, L., Zhang, Q., Lai, Z.Y., Naseem, U., Liu, D.D.: Show me the best outfit for a certain scene: a scene-aware fashion recommender system. In: WWW, pp. 1172–1180 (2023)
Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D.: Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 5947–5959 (2018)
Zhang, Q., Cao, L., Shi, C., Niu, Z.: Neural time-aware sequential recommendation by jointly modeling preference dynamics and explicit feature couplings. IEEE Trans. Neural Netw. Learn. Syst. 33(10), 5125–5137 (2022)
Zhang, Q., Hu, L., Cao, L., Shi, C., Wang, S., Liu, D.D.: A probabilistic code balance constraint with compactness and informativeness enhancement for deep supervised hashing. In: IJCAI, pp. 1651–1657 (2022)
Zhang, S., Chen, M., Chen, J., Zou, F., Li, Y., Lu, P.: Multimodal feature-wise co-attention method for visual question answering. Inf. Fusion 73, 1–10 (2021)
Zhang, Y., Wallace, B.: a sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Experimental Settings
A Experimental Settings
1.1 A.1 Datasets
The dataset consists of real images from MSCOCO [13] with the same training, test, and validation set separation, with more than 80k images and 444k questions, 40k images and 214k questions, 80k images, and 448k questions. For each image, there are an average of at least three questions. The questions fall into three categories based on the type of response: yes/no, number, and others. Each pair collected 10 answers from human annotators, choosing the most frequent answer as the correct answer. This data set contains both open-ended and multiple-choice questions. In this paper, we focus on open-ended questions, taking answers that appear more than 9 times in the training set as candidate answers, and generating 3129 candidate answers.
1.2 A.2 Baselines
We use the following state-of-the-art methods as baselines to evaluate the performance of our STCGN: (1) LSTM-VGG [2] uses LSTM and VGG to extract text and image features respectively, and realizes the feature fusion through the element cross product form. (2) SAN [18] designs a stacked attention mechanism that computes the most relevant areas of the question step by step to deduce the final answer. (3) DMN [10] constructs four network modules of input, question, episodic memory, and answer. Through the cyclic work of these modules, an iterative attention process is generated. (4) MUTAN [3] proposed a tensor-based Tracker decomposition method, which uses low-rank matrix decomposition to solve the problem of the large number of parameters in traditional bilinear models. (5) BUTD [1] proposed a combination of bottom-up and top-down attention mechanisms. Through the top-down attention mechanism, the correlation between the question and each region is calculated, so as to obtain more accurate picture features related to the question. (6) BAN [9] considers the bilinear interaction between two input channels is considered and the visual and textual information is fused by calculating the bilinear attention distribution. (7) LCGN [7] designs a question-aware messaging mechanism and uses question features to guide the refinement of entity features through multiple iteration steps, and realizes the integration of entity features and context information. (8) Murel [4] enriches the interaction between the problem region and the image region by vector representation and a sequence of units composed of multiple MuRel units. (9) ReGAT [12] extracted explicit relation and implicit relation in each image and constructed explicit relation graph and implicit relation graph. At the same time, the graph attention network of question awareness is used to integrate the information of different relation spaces. (10)UFSCAN [23] proposed a multimodal feature-wise attention mechanism and modeled feature-wise co-attention and spatial co-attention between image and question modalities simultaneously. (11)MMMAN [11] proposes a multi-level mesh mutual attention model to utilize low-dimensional and high-dimensional question information at different levels, providing more feature information for modal interactions.
1.3 A.3 Settings
In the experiments, Adamx was selected as the optimizer, and the initial learning rate was set at 0.001. With the increase in learning rounds, the learning rate gradually increased from 0.001 to 0.004. When the model accuracy reached its peak, the learning rate began to decay. During training, the decay rate of the learning rate was set to 0.5, and the model training rounds were 30 rounds. Regarding text, the word representation dimension of the question was 300, and the part of the speech embedding dimension was 128. In the tree hierarchical convolution module, word-level convolution uses a convolution kernel with the size of 3*428 for text convolution operation and the maximum syntax subtree length set to 4, and phrase-level convolution uses multi-head attention mechanism with the number of 8. To extract question features, we use a bidirectional GRU network with a hidden layer dimension of 1024. For picture features, we set the dimension as 2048, each picture contains 10–100 visual entity areas, the total time step of message passing T is set as 4, and the dimension of scene context-aware entity feature is set as 1024. In our experiment, the VQA score provided by the official VQA competition is used as the evaluation metrics, shown as follows:
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Su, X., Zhang, Q., Shi, C., Liu, J., Hu, L. (2024). Syntax Tree Constrained Graph Network for Visual Question Answering. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14451. Springer, Singapore. https://doi.org/10.1007/978-981-99-8073-4_10
Download citation
DOI: https://doi.org/10.1007/978-981-99-8073-4_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8072-7
Online ISBN: 978-981-99-8073-4
eBook Packages: Computer ScienceComputer Science (R0)