Syntax Tree Constrained Graph Network for Visual Question Answering

Su, Xiangrui; Zhang, Qi; Shi, Chongyang; Liu, Jiachang; Hu, Liang

doi:10.1007/978-981-99-8073-4_10

Xiangrui Su¹²,
Qi Zhang^13,14,
Chongyang Shi¹²,
Jiachang Liu¹² &
…
Liang Hu^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14451))

Included in the following conference series:

International Conference on Neural Information Processing

934 Accesses

Abstract

Visual Question Answering (VQA) aims to automatically answer natural language questions related to given image content. Existing VQA methods integrate vision modeling and language understanding to explore the deep semantics of the question. However, these methods ignore the significant syntax information of the question, which plays a vital role in understanding the essential semantics of the question and guiding the visual feature refinement. To fill the gap, we suggested a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based on entity message passing and syntax tree. This model is able to extract a syntax tree from questions and obtain more precise syntax information. Specifically, we parse questions and obtain the question syntax tree using the Stanford syntax parsing tool. From the word level and phrase level, syntactic phrase features and question features are extracted using a hierarchical tree convolutional network. We then design a message-passing mechanism for phrase-aware visual entities and capture entity features according to a given visual context. Extensive experiments on VQA2.0 datasets demonstrate the superiority of our proposed model.

X. Su and Q. Zhang—The first two authors contribute equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Google Scholar
Ben-Younes, H., Cadène, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV, pp. 2631–2639 (2017)
Google Scholar
Cadène, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: CVPR, pp. 1989–1998 (2019)
Google Scholar
Guo, D., Xu, C., Tao, D.: Bilinear graph networks for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 34, 1023–1034 (2021)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hu, R., Rohrbach, A., Darrell, T., Saenko, K.: Language-conditioned graph networks for relational reasoning. In: ICCV, pp. 10293–10302 (2019)
Google Scholar
Ilievski, I., Feng, J.: Generative attention model with adversarial self-learning for visual question answering. In: Thematic Workshops of ACM Multimedia 2017, pp. 415–423 (2017)
Google Scholar
Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: NeurIPS, pp. 1571–1581 (2018)
Google Scholar
Kumar, A., et al.: Ask me anything: dynamic memory networks for natural language processing. In: ICML, vol. 48, pp. 1378–1387 (2016)
Google Scholar
Lei, Z., Zhang, G., Wu, L., Zhang, K., Liang, R.: A multi-level mesh mutual attention model for visual question answering. Data Sci. Eng. 7(4), 339–353 (2022)
Article Google Scholar
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: ICCV, pp. 10312–10321 (2019)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Nguyen, D.K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR, pp. 6087–6096 (2018)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS, vol. 28 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)
Google Scholar
Ye, T., Hu, L., Zhang, Q., Lai, Z.Y., Naseem, U., Liu, D.D.: Show me the best outfit for a certain scene: a scene-aware fashion recommender system. In: WWW, pp. 1172–1180 (2023)
Google Scholar
Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D.: Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 5947–5959 (2018)
Article Google Scholar
Zhang, Q., Cao, L., Shi, C., Niu, Z.: Neural time-aware sequential recommendation by jointly modeling preference dynamics and explicit feature couplings. IEEE Trans. Neural Netw. Learn. Syst. 33(10), 5125–5137 (2022)
Article MathSciNet Google Scholar
Zhang, Q., Hu, L., Cao, L., Shi, C., Wang, S., Liu, D.D.: A probabilistic code balance constraint with compactness and informativeness enhancement for deep supervised hashing. In: IJCAI, pp. 1651–1657 (2022)
Google Scholar
Zhang, S., Chen, M., Chen, J., Zou, F., Li, Y., Lu, P.: Multimodal feature-wise co-attention method for visual question answering. Inf. Fusion 73, 1–10 (2021)
Article Google Scholar
Zhang, Y., Wallace, B.: a sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015)

Download references

Author information

Authors and Affiliations

Beijing Institute of Technology, Beijing, China
Xiangrui Su, Chongyang Shi & Jiachang Liu
Tongji University, Shanghai, China
Qi Zhang & Liang Hu
DeepBlue Academy of Sciences, Shanghai, China
Qi Zhang & Liang Hu

Authors

Xiangrui Su
View author publications
You can also search for this author in PubMed Google Scholar
Qi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chongyang Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jiachang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Liang Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chongyang Shi .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Biao Luo
Chinese Academy of Sciences, Beijing, China
Long Cheng
Zhejiang University, Hangzhou, China
Zheng-Guang Wu
Guangdong University of Technology, Guangzhou, China
Hongyi Li
UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

A Experimental Settings

1.1 A.1 Datasets

The dataset consists of real images from MSCOCO [13] with the same training, test, and validation set separation, with more than 80k images and 444k questions, 40k images and 214k questions, 80k images, and 448k questions. For each image, there are an average of at least three questions. The questions fall into three categories based on the type of response: yes/no, number, and others. Each pair collected 10 answers from human annotators, choosing the most frequent answer as the correct answer. This data set contains both open-ended and multiple-choice questions. In this paper, we focus on open-ended questions, taking answers that appear more than 9 times in the training set as candidate answers, and generating 3129 candidate answers.

1.2 A.2 Baselines

We use the following state-of-the-art methods as baselines to evaluate the performance of our STCGN: (1) LSTM-VGG [2] uses LSTM and VGG to extract text and image features respectively, and realizes the feature fusion through the element cross product form. (2) SAN [18] designs a stacked attention mechanism that computes the most relevant areas of the question step by step to deduce the final answer. (3) DMN [10] constructs four network modules of input, question, episodic memory, and answer. Through the cyclic work of these modules, an iterative attention process is generated. (4) MUTAN [3] proposed a tensor-based Tracker decomposition method, which uses low-rank matrix decomposition to solve the problem of the large number of parameters in traditional bilinear models. (5) BUTD [1] proposed a combination of bottom-up and top-down attention mechanisms. Through the top-down attention mechanism, the correlation between the question and each region is calculated, so as to obtain more accurate picture features related to the question. (6) BAN [9] considers the bilinear interaction between two input channels is considered and the visual and textual information is fused by calculating the bilinear attention distribution. (7) LCGN [7] designs a question-aware messaging mechanism and uses question features to guide the refinement of entity features through multiple iteration steps, and realizes the integration of entity features and context information. (8) Murel [4] enriches the interaction between the problem region and the image region by vector representation and a sequence of units composed of multiple MuRel units. (9) ReGAT [12] extracted explicit relation and implicit relation in each image and constructed explicit relation graph and implicit relation graph. At the same time, the graph attention network of question awareness is used to integrate the information of different relation spaces. (10)UFSCAN [23] proposed a multimodal feature-wise attention mechanism and modeled feature-wise co-attention and spatial co-attention between image and question modalities simultaneously. (11)MMMAN [11] proposes a multi-level mesh mutual attention model to utilize low-dimensional and high-dimensional question information at different levels, providing more feature information for modal interactions.

1.3 A.3 Settings

In the experiments, Adamx was selected as the optimizer, and the initial learning rate was set at 0.001. With the increase in learning rounds, the learning rate gradually increased from 0.001 to 0.004. When the model accuracy reached its peak, the learning rate began to decay. During training, the decay rate of the learning rate was set to 0.5, and the model training rounds were 30 rounds. Regarding text, the word representation dimension of the question was 300, and the part of the speech embedding dimension was 128. In the tree hierarchical convolution module, word-level convolution uses a convolution kernel with the size of 3*428 for text convolution operation and the maximum syntax subtree length set to 4, and phrase-level convolution uses multi-head attention mechanism with the number of 8. To extract question features, we use a bidirectional GRU network with a hidden layer dimension of 1024. For picture features, we set the dimension as 2048, each picture contains 10–100 visual entity areas, the total time step of message passing T is set as 4, and the dimension of scene context-aware entity feature is set as 1024. In our experiment, the VQA score provided by the official VQA competition is used as the evaluation metrics, shown as follows:

$$\begin{aligned} acc({\textbf {ans}}) = min(\frac{\#human \ provides \ {\textbf {ans}}}{3},1) \end{aligned}$$

(15)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, X., Zhang, Q., Shi, C., Liu, J., Hu, L. (2024). Syntax Tree Constrained Graph Network for Visual Question Answering. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14451. Springer, Singapore. https://doi.org/10.1007/978-981-99-8073-4_10

Download citation

DOI: https://doi.org/10.1007/978-981-99-8073-4_10
Published: 15 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8072-7
Online ISBN: 978-981-99-8073-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Syntax Tree Constrained Graph Network for Visual Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Experimental Settings

A Experimental Settings

1.1 A.1 Datasets

1.2 A.2 Baselines

1.3 A.3 Settings

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us