ST-VQA: shrinkage transformer with accurate alignment for visual question answering | Applied Intelligence Skip to main content
Log in

ST-VQA: shrinkage transformer with accurate alignment for visual question answering

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

While transformer-based models have been remarkably successful in the field of visual question answering (VQA), their approaches to achieve vision and language feature alignment are simple and coarse. In recent years, this shortcoming has been further amplified with the popularity of vision-language pretraining, resulting in the slow development of an effective architecture for multimodal alignment. Based on this, we propose the shrinkage transformer-visual question answering (ST-VQA) framework. It aims to achieve more accurate multimodal alignment than the standard transformer. First, the ST-VQA framework uses the region feature of an image as a visual representation. Secondly, between the different Transformer layers, the ST-VQA framework reduces the number of visual regions in the transformer by feature fusion and ensures the difference between new regions by contrast loss. Finally, visual and textual features are fused and used for decision making answers. Many experiments demonstrate that without pretraining, our proposed method achieves better performance than the standard transformer and outperforms partial state-of-the-art methods on the VQA-v2 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

⦁ The VQA-v2 datasets analyzed during the current study are available in https://visualqa.org/download.html.

⦁ The COCO-QA datasets analyzed during the current study are available in http://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/.

⦁ The GQA datasets analyzed during the current study are available in https://cs.stanford.edu/people/dorarad/gqa/download.html.

References

  1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser LU, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc

  2. Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc

  3. Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2019.00644

  4. Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X, Xu M, Ji R (2021) Trar: routing the attention spans in transformer for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 2074–2084. https://doi.org/10.1109/ICCV48922.2021.00208

  5. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, Choi Y, Gao J (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer vision – ECCV 2020, pp 121–137. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-030-58577-8_8

  6. Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/cvpr.2017.670

  7. Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimed Tools Appl 78(3):3843–3858

    Article  Google Scholar 

  8. Heo B, Yun S, Han D, Chun S, Choe J, Oh SJ (2021) Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 11,936–11,945. https://doi.org/10.1109/ICCV48922.2021.01172

  9. Tolstikhin IO, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A (2021) Mlp-mixer: an all-mlp architecture for vision. In: Advances in neural information processing systems, vol 34, pp 24,261–24,272. Curran Associates, Inc

  10. Hudson DA, Manning CD (2019) Gqa: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6700–6709

  11. Zhang W, Yu J, Hu H, Hu H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Inf Fusion 55:116–126

    Article  Google Scholar 

  12. Sharma H, Jalal AS (2022) Improving visual question answering by combining scene-text information. Multimed Tools Appl 81(9):12,177–12,208

    Article  Google Scholar 

  13. Shuang K, Guo J, Wang Z (2022) Comprehensive-perception dynamic reasoning for visual question answering. Pattern Recogn 131:108,878

    Article  Google Scholar 

  14. Guo Z, Han D (2022) Sparse co-attention visual question answering networks based on thresholds. Appl Intell:1–15

  15. Zhao J, Zhang X, Wang X, Yang Y, Sun G (2022) Overcoming language priors in vqa via adding visual module. Neural Comput Appl 34(11):9015–9023

    Article  Google Scholar 

  16. Yan H, Liu L, Feng X, Huang Q (2022) Overcoming language priors with self-contrastive learning for visual question answering. Multimed Tools Appl:1–16

  17. Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4971–4980

  18. Kim W, Son B, Kim I (2021) Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. PMLR, pp 5583–5594

  19. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2018.00636

  20. Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. Advances Neural Inf Process Syst 31

  21. Guo D, Xu C, Tao D (2021) Bilinear graph networks for visual question answering. IEEE Trans Neural Netw Learn Syst:1–12. https://doi.org/10.1109/TNNLS.2021.3104937

  22. Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 568–578. https://doi.org/10.1109/ICCV48922.2021.00061

  23. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(06):1137–1149

    Article  Google Scholar 

  24. Wang Z, Jiang W, Zhu YM, Yuan L, Song Y, Liu W (2022) Dynamixer: a vision mlp architecture with dynamic mixing. In: International conference on machine learning. PMLR, pp 22,691–22,701

  25. He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/cvpr42600.2020.00975

  26. Feng J, Liu R (2022) LRB-Net: Improving VQA via division of labor strategy and multimodal classifiers. Displays 75:102,329

    Article  Google Scholar 

  27. Liu Y, Zhang X, Zhang Q, Li C, Huang F, Tang X, Li Z (2021) Dual self-attention with co-attention networks for visual question answering. Pattern Recogn 117:107,956

    Article  Google Scholar 

  28. Kim JJ, Lee DG, Wu J, Jung HG, Lee SW (2021) Visual question answering based on local-scene-aware referring expression generation. Neural Netw 139:158–167

    Article  Google Scholar 

  29. Chen C, Han D, Chang CC (2022) CAAN: Context-Aware attention network for visual question answering. Pattern Recogn 132:108,980

    Article  Google Scholar 

  30. Sun B, Yao Z, Zhang Y, Yu L (2020) Local relation network with multilevel attention for visual question answering. J Vis Commun Image Represent 73:102,762

    Article  Google Scholar 

  31. Wu C, Liu J, Wang X, Dong X (2018) Object-difference attention: a simple relational attention for visual question answering. In: Proceedings of the 26th ACM international conference on multimedia, pp 519–527

  32. Wu C, Liu J, Wang X, Dong X (2018) Chain of reasoning for visual question answering. Adv Neural Inf Process Syst 31

  33. Liu Y, Zhang X, Zhao Z, Zhang B, Cheng L, Li Z (2020) Alsa: adversarial learning of supervised attentions for visual question answering. IEEE Trans Cybern

  34. Mao A, Yang Z, Lin K, Xuan J, Liu YJ (2022) Positional attention guided transformer-like architecture for visual question answering. IEEE Trans Multimed

  35. Hudson DA, Manning CD (2018) Compositional attention networks for machine reasoning. In: International conference on learning representations

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos.62106054), and the Major Special Projects of Guangxi Science and Technology (Grant No. AA20302003). We thank all the reviewers for their constructive comments and helpful suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuxiang Song.

Ethics declarations

Conflict of Interests

⦁ The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, H., Lan, R., Li, H. et al. ST-VQA: shrinkage transformer with accurate alignment for visual question answering. Appl Intell 53, 20967–20978 (2023). https://doi.org/10.1007/s10489-023-04564-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04564-x

Keywords

Navigation