Abstract
While transformer-based models have been remarkably successful in the field of visual question answering (VQA), their approaches to achieve vision and language feature alignment are simple and coarse. In recent years, this shortcoming has been further amplified with the popularity of vision-language pretraining, resulting in the slow development of an effective architecture for multimodal alignment. Based on this, we propose the shrinkage transformer-visual question answering (ST-VQA) framework. It aims to achieve more accurate multimodal alignment than the standard transformer. First, the ST-VQA framework uses the region feature of an image as a visual representation. Secondly, between the different Transformer layers, the ST-VQA framework reduces the number of visual regions in the transformer by feature fusion and ensures the difference between new regions by contrast loss. Finally, visual and textual features are fused and used for decision making answers. Many experiments demonstrate that without pretraining, our proposed method achieves better performance than the standard transformer and outperforms partial state-of-the-art methods on the VQA-v2 dataset.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
⦁ The VQA-v2 datasets analyzed during the current study are available in https://visualqa.org/download.html.
⦁ The COCO-QA datasets analyzed during the current study are available in http://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/.
⦁ The GQA datasets analyzed during the current study are available in https://cs.stanford.edu/people/dorarad/gqa/download.html.
References
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser LU, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc
Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2019.00644
Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X, Xu M, Ji R (2021) Trar: routing the attention spans in transformer for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 2074–2084. https://doi.org/10.1109/ICCV48922.2021.00208
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, Choi Y, Gao J (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer vision – ECCV 2020, pp 121–137. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-030-58577-8_8
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/cvpr.2017.670
Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimed Tools Appl 78(3):3843–3858
Heo B, Yun S, Han D, Chun S, Choe J, Oh SJ (2021) Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 11,936–11,945. https://doi.org/10.1109/ICCV48922.2021.01172
Tolstikhin IO, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A (2021) Mlp-mixer: an all-mlp architecture for vision. In: Advances in neural information processing systems, vol 34, pp 24,261–24,272. Curran Associates, Inc
Hudson DA, Manning CD (2019) Gqa: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6700–6709
Zhang W, Yu J, Hu H, Hu H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Inf Fusion 55:116–126
Sharma H, Jalal AS (2022) Improving visual question answering by combining scene-text information. Multimed Tools Appl 81(9):12,177–12,208
Shuang K, Guo J, Wang Z (2022) Comprehensive-perception dynamic reasoning for visual question answering. Pattern Recogn 131:108,878
Guo Z, Han D (2022) Sparse co-attention visual question answering networks based on thresholds. Appl Intell:1–15
Zhao J, Zhang X, Wang X, Yang Y, Sun G (2022) Overcoming language priors in vqa via adding visual module. Neural Comput Appl 34(11):9015–9023
Yan H, Liu L, Feng X, Huang Q (2022) Overcoming language priors with self-contrastive learning for visual question answering. Multimed Tools Appl:1–16
Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4971–4980
Kim W, Son B, Kim I (2021) Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. PMLR, pp 5583–5594
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2018.00636
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. Advances Neural Inf Process Syst 31
Guo D, Xu C, Tao D (2021) Bilinear graph networks for visual question answering. IEEE Trans Neural Netw Learn Syst:1–12. https://doi.org/10.1109/TNNLS.2021.3104937
Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 568–578. https://doi.org/10.1109/ICCV48922.2021.00061
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(06):1137–1149
Wang Z, Jiang W, Zhu YM, Yuan L, Song Y, Liu W (2022) Dynamixer: a vision mlp architecture with dynamic mixing. In: International conference on machine learning. PMLR, pp 22,691–22,701
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/cvpr42600.2020.00975
Feng J, Liu R (2022) LRB-Net: Improving VQA via division of labor strategy and multimodal classifiers. Displays 75:102,329
Liu Y, Zhang X, Zhang Q, Li C, Huang F, Tang X, Li Z (2021) Dual self-attention with co-attention networks for visual question answering. Pattern Recogn 117:107,956
Kim JJ, Lee DG, Wu J, Jung HG, Lee SW (2021) Visual question answering based on local-scene-aware referring expression generation. Neural Netw 139:158–167
Chen C, Han D, Chang CC (2022) CAAN: Context-Aware attention network for visual question answering. Pattern Recogn 132:108,980
Sun B, Yao Z, Zhang Y, Yu L (2020) Local relation network with multilevel attention for visual question answering. J Vis Commun Image Represent 73:102,762
Wu C, Liu J, Wang X, Dong X (2018) Object-difference attention: a simple relational attention for visual question answering. In: Proceedings of the 26th ACM international conference on multimedia, pp 519–527
Wu C, Liu J, Wang X, Dong X (2018) Chain of reasoning for visual question answering. Adv Neural Inf Process Syst 31
Liu Y, Zhang X, Zhao Z, Zhang B, Cheng L, Li Z (2020) Alsa: adversarial learning of supervised attentions for visual question answering. IEEE Trans Cybern
Mao A, Yang Z, Lin K, Xuan J, Liu YJ (2022) Positional attention guided transformer-like architecture for visual question answering. IEEE Trans Multimed
Hudson DA, Manning CD (2018) Compositional attention networks for machine reasoning. In: International conference on learning representations
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant Nos.62106054), and the Major Special Projects of Guangxi Science and Technology (Grant No. AA20302003). We thank all the reviewers for their constructive comments and helpful suggestions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
⦁ The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xia, H., Lan, R., Li, H. et al. ST-VQA: shrinkage transformer with accurate alignment for visual question answering. Appl Intell 53, 20967–20978 (2023). https://doi.org/10.1007/s10489-023-04564-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04564-x