Exploring and exploiting model uncertainty for robust visual question answering

Zhang, Xuesong; He, Jun; Zhao, Jia; Hu, Zhenzhen; Yang, Xun; Li, Jia; Hong, Richang

doi:10.1007/s00530-024-01560-0

Exploring and exploiting model uncertainty for robust visual question answering

Regular Paper
Published: 19 November 2024

Volume 30, article number 348, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Xuesong Zhang¹^na1,
Jun He²^na1,
Jia Zhao³,
Zhenzhen Hu¹,
Xun Yang⁴,
Jia Li¹ &
…
Richang Hong¹

240 Accesses
Explore all metrics

Abstract

Visual Question Answering (VQA) methods have been widely demonstrated to exhibit bias in answering questions due to the distribution differences of answer samples between training and testing, resulting in resultant performance degradation. While numerous efforts have demonstrated promising results in overcoming language bias, broader implications (e.g., the trustworthiness of current VQA model predictions) of the problem remain unexplored. In this paper, we aim to provide a different viewpoint on the problem from the perspective of model uncertainty. In a series of empirical studies on the VQA-CP v2 dataset, we find that current VQA models are often biased towards making obviously incorrect answers with high confidence, i.e., being overconfident, which indicates high uncertainty. In light of this observation, we: (1) design a novel metric for monitoring model overconfidence, and (2) propose a model calibration method to address the overconfidence issue, thereby making the model more reliable and better at generalization. The calibration method explicitly imposes constraints on model predictions to make the model less confident during training. It has the advantage of being model-agnostic and computationally efficient. Experiments demonstrate that VQA approaches exhibiting overconfidence are usually negatively impacted in terms of generalization, and fortunately their performance and trustworthiness can be boosted by the adoption of our calibration method. Code is available at https://github.com/HCI-LMC/VQA-Uncertainty

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

ReDiT: re-evaluating large visual question answering model confidence by defining input scenario difficulty and applying temperature mapping

Article 06 January 2025

A Self-supervised Strategy for the Robustness of VQA Models

VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task

Data availability

Code for this manuscript is available at https://github.com/HCI-LMC/VQA-Uncertainty

Notes

Confidence intervals, confidence levels and confidence bins are hereafter used interchangeably.
Frequent answers are those that appear more often within the same question type in the dataset, while sparse answers are those that appear less frequently [35].

References

Cheng, Y., Huang, F., Zhou, L., Jin, C., Zhang, Y., Zhang, T.: A hierarchical multimodal attention-based neural network for image captioning. In: ACM SIGIR, pp. 889–892 (2017)
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 2425–2433 (2015)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Xiao, J., Zhou, P., Yao, A., Li, Y., Hong, R., Yan, S., Chua, T.-S.: Contrastive video question answering via video graph transformer. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Xiao, J., Yao, A., Li, Y., Chua, T.-S.: Can i trust your answer? visually grounded video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13204–13214 (2024)
Nawaz, H.S., Shi, Z., Gan, Y., Hirpa, A., Dong, J., Zheng, H.: Temporal moment localization via natural language by utilizing video question answers as a special variant and bypassing nlp for corpora. IEEE Trans. Circuits Syst. Video Technol. 32(9), 6174–6185 (2022)
Article Google Scholar
Zhang, J., Shao, J., Cao, R., Gao, L., Xu, X., Shen, H.T.: Action-centric relation transformer network for video question answering. IEEE Trans. Circuits Syst. Video Technol. 32(1), 63–74 (2020)
Article Google Scholar
Dong, J., Li, X., Xu, C., Yang, X., Yang, G., Wang, X., Wang, M.: Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell (2021)
Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M., Chua, T.-S.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: ACM SIGIR, pp. 1339–1348 (2020)
Yang, X., Feng, F., Ji, W., Wang, M., Chua, T.-S.: Deconfounded video moment retrieval with causal intervention. In: ACM SIGIR, pp. 1–10 (2021)
Yang, X., Wang, S., Dong, J., Dong, J., Wang, M., Chua, T.-S.: Video moment retrieval with cross-modal neural architecture search. IEEE Trans. Image Process. 31, 1204–1216 (2022)
Article Google Scholar
Ben-younes, H., Cadene, R., Cord, M., Thome, N.: Mutan: multimodal tucker fusion for visual question answering. Proc. IEEE Int. Conf. Comput. Vis. (2017). https://doi.org/10.1109/ICCV.2017.285
Article Google Scholar
Zhao, L., Cai, D., Zhang, J., Sheng, L., Xu, D., Zheng, R., Zhao, Y., Wang, L., Fan, X.: Towards explainable 3d grounded visual question answering: a new benchmark and strong baseline. IEEE Trans. Circuits Syst. Video Technol. (2022)
Gupta, V., Li, Z., Kortylewski, A., Zhang, C., Li, Y., Yuille, A.: Swapmix: Diagnosing and regularizing the over-reliance on visual context in visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5078–5088 (2022)
Ding, Y., Yu, J., Liu, B., Hu, Y., Cui, M., Wu, Q.: Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5089–5098 (2022)
Wu, J., Mooney, R.: Self-critical reasoning for robust visual question answering. Adv. Neural Inf. Process Syst. 32 (2019)
Clark, C., Yatskar, M., Zettlemoyer, L.: Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In: Proc. Conf. Empir . Methods Nat. Lang. Process & Joint Conf. Nat. Lang. Process, Hong Kong, China, pp. 4069–4082 (2019). https://doi.org/10.18653/v1/D19-1418
Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. Proc. IEEE Conf. Comput. Vis. Pattern. Recognit. (2020). https://doi.org/10.1109/CVPR42600.2020.01081
Article Google Scholar
Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer Overcoming priors for visual question answering. Proc. IEEE Conf. Comput. Vis. Pattern. Recognit. (2018). https://doi.org/10.1109/CVPR.2018.00522
Article Google Scholar
Cadene, R., Dancette, C., Cord, M., Parikh, D., et al.: Rubi: Reducing unimodal biases for visual question answering. Adv. Neural Inf. Process Syst. 32 (2019)
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6077–6086 (2018). https://doi.org/10.1109/CVPR.2018.00636
Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. In: Adv. Neural Inf. Process Syst. NIPS’18, pp. 1548–1558. Curran Associates Inc., Red Hook, NY, USA (2018)
Zhu, X., Mao, Z., Liu, C., Zhang, P., Wang, B., Zhang, Y.: Overcoming language priors with self-supervised learning for visual question answering. In: Process and Proc. Int. Joint Conf. Artif. Intell. (2021)
Han, Y., Nie, L., Yin, J., Wu, J., Yan, Y.: Visual perturbation-aware collaborative learning for overcoming the language prior problem. ArXiv abs/2207.11850 (2022)
Teney, D., Abbasnejad, E., Kafle, K., Shrestha, R., Kanan, C., Van Den Hengel, A.: On the value of out-of-distribution testing: an example of goodhart’s law. Adv. Neural. Inf. Process. Syst. 33, 407–417 (2020)
Google Scholar
Basu, A., Addepalli, S., Babu, R.V.: Rmlvqa: A margin loss approach for visual question answering with language biases. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11671–11680 (2023)
Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.-S., Wen, J.-R.: Counterfactual vqa: A cause-effect look at language bias. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2021). https://doi.org/10.1109/CVPR46437.2021.01251
Article Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 21–29 (2016). https://doi.org/10.1109/CVPR.2016.10
Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. Adv. Neural Inf. Process Syst. 31 (2018)
Kervadec, C., Antipov, G.-g., Baccouche, M., Wolf, C.: Roses are red, violets are blue... but should vqa expect them to? In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2776–2785 (2021)
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: Proc. IEEE Int. Conf. Comput. Vis., 10312–10321 (2019) https://doi.org/10.1109/ICCV.2019.01041
Zhao, J., Zhang, X., Wang, X., Yang, Y., Sun, G.: Overcoming language priors in VQA via adding visual module. Neural. Comput. Appl. 34(11), 9015–9023 (2022)
Article Google Scholar
Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Mutant: A training paradigm for out-of-distribution generalization in visual question answering. In: Proc. Conf. Empir . Methods Nat. Lang. Process, pp. 878–892 (2020)
Liang, Z., Jiang, W., Hu, H., Zhu, J.: Learning to contrast the counterfactual samples for robust visual question answering. In: Proc. Conf. Empir . Methods Nat. Lang. Process, pp. 3285–3292 (2020)
Guo, Y., Nie, L., Cheng, Z., Tian, Q., Zhang, M.: Loss re-scaling VQA: revisiting the language prior problem from a class-imbalance view. IEEE Trans. Image Process. 31, 227–238 (2022). https://doi.org/10.1109/TIP.2021.3128322
Article Google Scholar
Shrestha, R., Kafle, K., Kanan, C.: A negative case analysis of visual grounding methods for VQA. In: Proc. Assoc. Comput. Linguist., pp. 8172–8181 (2020). https://doi.org/10.18653/v1/2020.acl-main.727
Cao, R., Li, Z.: Overcoming language priors for visual question answering via loss rebalancing label and global context. In: Uncertainty in Artificial Intelligence, pp. 249–259 (2023). PMLR
Wang, D.-B., Feng, L., Zhang, M.-L.: Rethinking calibration of deep neural networks: Do not be afraid of overconfidence. Adv. Neural. Inf. Process. Syst. 34, 11809–11820 (2021)
Google Scholar
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International Conference on Machine Learning, pp. 1321–1330 (2017). PMLR
Zhang, J., Kailkhura, B., Han, T.Y.-J.: Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In: International Conference on Machine Learning, pp. 11117–11128 (2020). PMLR
Gruber, S.G., Buettner, F.: Better uncertainty calibration via proper scores for classification and beyond. In: Advances in Neural Information Processing Systems (2022)
Ghosh, A., Schaaf, T., Gormley, M.R.: Adafocal: Calibration-aware adaptive focal loss. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=kUOm0Fdtvh
Gupta, C., Ramdas, A.: Top-label calibration and multiclass-to-binary reductions. In: International Conference on Learning Representations (2022)
Ren, A.Z., Clark, J., Dixit, A., Itkina, M., Majumdar, A., Sadigh, D.: Explore until confident: Efficient exploration for embodied question answering. In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024
Munir, M.A., Khan, M.H., Khan, S., Khan, F.S.: Bridging precision and confidence: A train-time loss for calibrating object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11474–11483 (2023)
Zhu, Q., Zheng, C., Zhang, Z., Shao, W., Zhang, D.: Dynamic confidence-aware multi-modal emotion recognition. IEEE Trans. Affect. Comput. (2023)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Gawlikowski, J., Tassi, C.R.N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A., Triebel, R., Jung, P., Roscher, R., et al.: A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342 (2021)
Xie, M., Han, Z., Zhang, C., Bai, Y., Hu, Q.: Exploring and exploiting uncertainty for incomplete multi-view classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19873–19882 (2023)
Zhong, Z., Cui, J., Liu, S., Jia, J.: Improving calibration for long-tailed recognition. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 16489–16498 (2021)
Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019)
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O.K., Aggarwal, K., Som, S., Piao, S., Wei, F.: Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022)
Google Scholar
Pan, Y., Liu, J., Jin, L., Li, Z.: Unbiased visual question answering by leveraging instrumental variable. IEEE Transactions on Multimedia, 1–16 (2024) https://doi.org/10.1109/TMM.2024.3355640
Si, Q., Liu, Y., Meng, F., Lin, Z., Fu, P., Cao, Y., Wang, W., Zhou, J.: Towards robust visual question answering: Making the most of biased samples via contrastive learning. In: Conference on Empirical Methods in Natural Language Processing (2022)
Wu, Y., Zhao, Y., Zhao, S., Zhang, Y., Yuan, X., Zhao, G., Jiang, N.: Overcoming language priors in visual question answering via distinguishing superficially similar instances. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 5721–5729 (2022)
Li, Y., Hu, B., Zhang, F., Yu, Y., Liu, J., Chen, Y., Xu, J.: A multi-modal debiasing model with dynamical constraint for robust visual question answering. In: Annual Meeting of the Association for Computational Linguistics (2023)
Shu, X., Yan, S., Yang, X., Wu, Z., Chen, Z., Lu, Z.: Sc-ml: Self-supervised counterfactual metric learning for debiased visual question answering. ArXiv abs/2304.01647 (2023)
Bi, Y., Jiang, H., Hu, Y., Sun, Y., Yin, B.: See and learn more: dense caption-aware representation for visual question answering. IEEE Trans. Circuits Syst. Video Technol. 34(2), 1135–1146 (2024). https://doi.org/10.1109/TCSVT.2023.3291379
Article Google Scholar

Download references

Funding

This work was supported the National Natural Science Foundation of China under Grants 61932009 and 62202139.

Author information

Xuesong Zhang, Jun He contributed equally to this work.

Authors and Affiliations

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China
Xuesong Zhang, Zhenzhen Hu, Jia Li & Richang Hong
Institute of Dataspace, Hefei Comprehensive National Science Center, Hefei, China
Jun He
School of Computer and Information Engineering, Fuyang Normal University, Fuyang, China
Jia Zhao
School of Information Science and Technology, University of Science and Technology of China, Hefei, China
Xun Yang

Authors

Xuesong Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Jun He
View author publications
You can also search for this author inPubMed Google Scholar
Jia Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Zhenzhen Hu
View author publications
You can also search for this author inPubMed Google Scholar
Xun Yang
View author publications
You can also search for this author inPubMed Google Scholar
Jia Li
View author publications
You can also search for this author inPubMed Google Scholar
Richang Hong
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Xuesong Zhang and Jun He contributed equally. Jia Li is the corresponding author. Xuesong Zhang was primarily responsible for the code implementation and contributed to writing the initial draft of the manuscript. Jun He wrote the main text of the manuscript. Jia Zhao designed the methodology. Zhenzhen Hu, Xun Yang, and Jia Li refined the methodology and revised the manuscript content. Richang Hong provided experimental resources, supervised and guided the research, and revised the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Jia Li.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, X., He, J., Zhao, J. et al. Exploring and exploiting model uncertainty for robust visual question answering. Multimedia Systems 30, 348 (2024). https://doi.org/10.1007/s00530-024-01560-0

Download citation

Received: 15 July 2024
Accepted: 06 November 2024
Published: 19 November 2024
DOI: https://doi.org/10.1007/s00530-024-01560-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Exploring and exploiting model uncertainty for robust visual question answering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ReDiT: re-evaluating large visual question answering model confidence by defining input scenario difficulty and applying temperature mapping

A Self-supervised Strategy for the Robustness of VQA Models

VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now