Syntax Tree Constrained Graph Network for Visual Question Answering | SpringerLink
Skip to main content

Syntax Tree Constrained Graph Network for Visual Question Answering

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14451))

Included in the following conference series:

Abstract

Visual Question Answering (VQA) aims to automatically answer natural language questions related to given image content. Existing VQA methods integrate vision modeling and language understanding to explore the deep semantics of the question. However, these methods ignore the significant syntax information of the question, which plays a vital role in understanding the essential semantics of the question and guiding the visual feature refinement. To fill the gap, we suggested a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based on entity message passing and syntax tree. This model is able to extract a syntax tree from questions and obtain more precise syntax information. Specifically, we parse questions and obtain the question syntax tree using the Stanford syntax parsing tool. From the word level and phrase level, syntactic phrase features and question features are extracted using a hierarchical tree convolutional network. We then design a message-passing mechanism for phrase-aware visual entities and capture entity features according to a given visual context. Extensive experiments on VQA2.0 datasets demonstrate the superiority of our proposed model.

X. Su and Q. Zhang—The first two authors contribute equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)

    Google Scholar 

  2. Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)

    Google Scholar 

  3. Ben-Younes, H., Cadène, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV, pp. 2631–2639 (2017)

    Google Scholar 

  4. Cadène, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: CVPR, pp. 1989–1998 (2019)

    Google Scholar 

  5. Guo, D., Xu, C., Tao, D.: Bilinear graph networks for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 34, 1023–1034 (2021)

    Article  Google Scholar 

  6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  7. Hu, R., Rohrbach, A., Darrell, T., Saenko, K.: Language-conditioned graph networks for relational reasoning. In: ICCV, pp. 10293–10302 (2019)

    Google Scholar 

  8. Ilievski, I., Feng, J.: Generative attention model with adversarial self-learning for visual question answering. In: Thematic Workshops of ACM Multimedia 2017, pp. 415–423 (2017)

    Google Scholar 

  9. Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: NeurIPS, pp. 1571–1581 (2018)

    Google Scholar 

  10. Kumar, A., et al.: Ask me anything: dynamic memory networks for natural language processing. In: ICML, vol. 48, pp. 1378–1387 (2016)

    Google Scholar 

  11. Lei, Z., Zhang, G., Wu, L., Zhang, K., Liang, R.: A multi-level mesh mutual attention model for visual question answering. Data Sci. Eng. 7(4), 339–353 (2022)

    Article  Google Scholar 

  12. Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: ICCV, pp. 10312–10321 (2019)

    Google Scholar 

  13. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  14. Nguyen, D.K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR, pp. 6087–6096 (2018)

    Google Scholar 

  15. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

    Google Scholar 

  16. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS, vol. 28 (2015)

    Google Scholar 

  17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  18. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)

    Google Scholar 

  19. Ye, T., Hu, L., Zhang, Q., Lai, Z.Y., Naseem, U., Liu, D.D.: Show me the best outfit for a certain scene: a scene-aware fashion recommender system. In: WWW, pp. 1172–1180 (2023)

    Google Scholar 

  20. Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D.: Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 5947–5959 (2018)

    Article  Google Scholar 

  21. Zhang, Q., Cao, L., Shi, C., Niu, Z.: Neural time-aware sequential recommendation by jointly modeling preference dynamics and explicit feature couplings. IEEE Trans. Neural Netw. Learn. Syst. 33(10), 5125–5137 (2022)

    Article  MathSciNet  Google Scholar 

  22. Zhang, Q., Hu, L., Cao, L., Shi, C., Wang, S., Liu, D.D.: A probabilistic code balance constraint with compactness and informativeness enhancement for deep supervised hashing. In: IJCAI, pp. 1651–1657 (2022)

    Google Scholar 

  23. Zhang, S., Chen, M., Chen, J., Zou, F., Li, Y., Lu, P.: Multimodal feature-wise co-attention method for visual question answering. Inf. Fusion 73, 1–10 (2021)

    Article  Google Scholar 

  24. Zhang, Y., Wallace, B.: a sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chongyang Shi .

Editor information

Editors and Affiliations

A Experimental Settings

A Experimental Settings

1.1 A.1 Datasets

The dataset consists of real images from MSCOCO [13] with the same training, test, and validation set separation, with more than 80k images and 444k questions, 40k images and 214k questions, 80k images, and 448k questions. For each image, there are an average of at least three questions. The questions fall into three categories based on the type of response: yes/no, number, and others. Each pair collected 10 answers from human annotators, choosing the most frequent answer as the correct answer. This data set contains both open-ended and multiple-choice questions. In this paper, we focus on open-ended questions, taking answers that appear more than 9 times in the training set as candidate answers, and generating 3129 candidate answers.

1.2 A.2 Baselines

We use the following state-of-the-art methods as baselines to evaluate the performance of our STCGN: (1) LSTM-VGG [2] uses LSTM and VGG to extract text and image features respectively, and realizes the feature fusion through the element cross product form. (2) SAN [18] designs a stacked attention mechanism that computes the most relevant areas of the question step by step to deduce the final answer. (3) DMN [10] constructs four network modules of input, question, episodic memory, and answer. Through the cyclic work of these modules, an iterative attention process is generated. (4) MUTAN [3] proposed a tensor-based Tracker decomposition method, which uses low-rank matrix decomposition to solve the problem of the large number of parameters in traditional bilinear models. (5) BUTD [1] proposed a combination of bottom-up and top-down attention mechanisms. Through the top-down attention mechanism, the correlation between the question and each region is calculated, so as to obtain more accurate picture features related to the question. (6) BAN [9] considers the bilinear interaction between two input channels is considered and the visual and textual information is fused by calculating the bilinear attention distribution. (7) LCGN [7] designs a question-aware messaging mechanism and uses question features to guide the refinement of entity features through multiple iteration steps, and realizes the integration of entity features and context information. (8) Murel [4] enriches the interaction between the problem region and the image region by vector representation and a sequence of units composed of multiple MuRel units. (9) ReGAT [12] extracted explicit relation and implicit relation in each image and constructed explicit relation graph and implicit relation graph. At the same time, the graph attention network of question awareness is used to integrate the information of different relation spaces. (10)UFSCAN [23] proposed a multimodal feature-wise attention mechanism and modeled feature-wise co-attention and spatial co-attention between image and question modalities simultaneously. (11)MMMAN [11] proposes a multi-level mesh mutual attention model to utilize low-dimensional and high-dimensional question information at different levels, providing more feature information for modal interactions.

1.3 A.3 Settings

In the experiments, Adamx was selected as the optimizer, and the initial learning rate was set at 0.001. With the increase in learning rounds, the learning rate gradually increased from 0.001 to 0.004. When the model accuracy reached its peak, the learning rate began to decay. During training, the decay rate of the learning rate was set to 0.5, and the model training rounds were 30 rounds. Regarding text, the word representation dimension of the question was 300, and the part of the speech embedding dimension was 128. In the tree hierarchical convolution module, word-level convolution uses a convolution kernel with the size of 3*428 for text convolution operation and the maximum syntax subtree length set to 4, and phrase-level convolution uses multi-head attention mechanism with the number of 8. To extract question features, we use a bidirectional GRU network with a hidden layer dimension of 1024. For picture features, we set the dimension as 2048, each picture contains 10–100 visual entity areas, the total time step of message passing T is set as 4, and the dimension of scene context-aware entity feature is set as 1024. In our experiment, the VQA score provided by the official VQA competition is used as the evaluation metrics, shown as follows:

$$\begin{aligned} acc({\textbf {ans}}) = min(\frac{\#human \ provides \ {\textbf {ans}}}{3},1) \end{aligned}$$
(15)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Su, X., Zhang, Q., Shi, C., Liu, J., Hu, L. (2024). Syntax Tree Constrained Graph Network for Visual Question Answering. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14451. Springer, Singapore. https://doi.org/10.1007/978-981-99-8073-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8073-4_10

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8072-7

  • Online ISBN: 978-981-99-8073-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics