Abstract
Visual Question Answering (VQA) methods have been widely demonstrated to exhibit bias in answering questions due to the distribution differences of answer samples between training and testing, resulting in resultant performance degradation. While numerous efforts have demonstrated promising results in overcoming language bias, broader implications (e.g., the trustworthiness of current VQA model predictions) of the problem remain unexplored. In this paper, we aim to provide a different viewpoint on the problem from the perspective of model uncertainty. In a series of empirical studies on the VQA-CP v2 dataset, we find that current VQA models are often biased towards making obviously incorrect answers with high confidence, i.e., being overconfident, which indicates high uncertainty. In light of this observation, we: (1) design a novel metric for monitoring model overconfidence, and (2) propose a model calibration method to address the overconfidence issue, thereby making the model more reliable and better at generalization. The calibration method explicitly imposes constraints on model predictions to make the model less confident during training. It has the advantage of being model-agnostic and computationally efficient. Experiments demonstrate that VQA approaches exhibiting overconfidence are usually negatively impacted in terms of generalization, and fortunately their performance and trustworthiness can be boosted by the adoption of our calibration method. Code is available at https://github.com/HCI-LMC/VQA-Uncertainty







Similar content being viewed by others
Data availability
Code for this manuscript is available at https://github.com/HCI-LMC/VQA-Uncertainty
Notes
Confidence intervals, confidence levels and confidence bins are hereafter used interchangeably.
Frequent answers are those that appear more often within the same question type in the dataset, while sparse answers are those that appear less frequently [35].
References
Cheng, Y., Huang, F., Zhou, L., Jin, C., Zhang, Y., Zhang, T.: A hierarchical multimodal attention-based neural network for image captioning. In: ACM SIGIR, pp. 889–892 (2017)
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 2425–2433 (2015)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Xiao, J., Zhou, P., Yao, A., Li, Y., Hong, R., Yan, S., Chua, T.-S.: Contrastive video question answering via video graph transformer. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Xiao, J., Yao, A., Li, Y., Chua, T.-S.: Can i trust your answer? visually grounded video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13204–13214 (2024)
Nawaz, H.S., Shi, Z., Gan, Y., Hirpa, A., Dong, J., Zheng, H.: Temporal moment localization via natural language by utilizing video question answers as a special variant and bypassing nlp for corpora. IEEE Trans. Circuits Syst. Video Technol. 32(9), 6174–6185 (2022)
Zhang, J., Shao, J., Cao, R., Gao, L., Xu, X., Shen, H.T.: Action-centric relation transformer network for video question answering. IEEE Trans. Circuits Syst. Video Technol. 32(1), 63–74 (2020)
Dong, J., Li, X., Xu, C., Yang, X., Yang, G., Wang, X., Wang, M.: Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell (2021)
Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M., Chua, T.-S.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: ACM SIGIR, pp. 1339–1348 (2020)
Yang, X., Feng, F., Ji, W., Wang, M., Chua, T.-S.: Deconfounded video moment retrieval with causal intervention. In: ACM SIGIR, pp. 1–10 (2021)
Yang, X., Wang, S., Dong, J., Dong, J., Wang, M., Chua, T.-S.: Video moment retrieval with cross-modal neural architecture search. IEEE Trans. Image Process. 31, 1204–1216 (2022)
Ben-younes, H., Cadene, R., Cord, M., Thome, N.: Mutan: multimodal tucker fusion for visual question answering. Proc. IEEE Int. Conf. Comput. Vis. (2017). https://doi.org/10.1109/ICCV.2017.285
Zhao, L., Cai, D., Zhang, J., Sheng, L., Xu, D., Zheng, R., Zhao, Y., Wang, L., Fan, X.: Towards explainable 3d grounded visual question answering: a new benchmark and strong baseline. IEEE Trans. Circuits Syst. Video Technol. (2022)
Gupta, V., Li, Z., Kortylewski, A., Zhang, C., Li, Y., Yuille, A.: Swapmix: Diagnosing and regularizing the over-reliance on visual context in visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5078–5088 (2022)
Ding, Y., Yu, J., Liu, B., Hu, Y., Cui, M., Wu, Q.: Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5089–5098 (2022)
Wu, J., Mooney, R.: Self-critical reasoning for robust visual question answering. Adv. Neural Inf. Process Syst. 32 (2019)
Clark, C., Yatskar, M., Zettlemoyer, L.: Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In: Proc. Conf. Empir . Methods Nat. Lang. Process & Joint Conf. Nat. Lang. Process, Hong Kong, China, pp. 4069–4082 (2019). https://doi.org/10.18653/v1/D19-1418
Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. Proc. IEEE Conf. Comput. Vis. Pattern. Recognit. (2020). https://doi.org/10.1109/CVPR42600.2020.01081
Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer Overcoming priors for visual question answering. Proc. IEEE Conf. Comput. Vis. Pattern. Recognit. (2018). https://doi.org/10.1109/CVPR.2018.00522
Cadene, R., Dancette, C., Cord, M., Parikh, D., et al.: Rubi: Reducing unimodal biases for visual question answering. Adv. Neural Inf. Process Syst. 32 (2019)
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6077–6086 (2018). https://doi.org/10.1109/CVPR.2018.00636
Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. In: Adv. Neural Inf. Process Syst. NIPS’18, pp. 1548–1558. Curran Associates Inc., Red Hook, NY, USA (2018)
Zhu, X., Mao, Z., Liu, C., Zhang, P., Wang, B., Zhang, Y.: Overcoming language priors with self-supervised learning for visual question answering. In: Process and Proc. Int. Joint Conf. Artif. Intell. (2021)
Han, Y., Nie, L., Yin, J., Wu, J., Yan, Y.: Visual perturbation-aware collaborative learning for overcoming the language prior problem. ArXiv abs/2207.11850 (2022)
Teney, D., Abbasnejad, E., Kafle, K., Shrestha, R., Kanan, C., Van Den Hengel, A.: On the value of out-of-distribution testing: an example of goodhart’s law. Adv. Neural. Inf. Process. Syst. 33, 407–417 (2020)
Basu, A., Addepalli, S., Babu, R.V.: Rmlvqa: A margin loss approach for visual question answering with language biases. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11671–11680 (2023)
Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.-S., Wen, J.-R.: Counterfactual vqa: A cause-effect look at language bias. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2021). https://doi.org/10.1109/CVPR46437.2021.01251
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 21–29 (2016). https://doi.org/10.1109/CVPR.2016.10
Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. Adv. Neural Inf. Process Syst. 31 (2018)
Kervadec, C., Antipov, G.-g., Baccouche, M., Wolf, C.: Roses are red, violets are blue... but should vqa expect them to? In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2776–2785 (2021)
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: Proc. IEEE Int. Conf. Comput. Vis., 10312–10321 (2019) https://doi.org/10.1109/ICCV.2019.01041
Zhao, J., Zhang, X., Wang, X., Yang, Y., Sun, G.: Overcoming language priors in VQA via adding visual module. Neural. Comput. Appl. 34(11), 9015–9023 (2022)
Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Mutant: A training paradigm for out-of-distribution generalization in visual question answering. In: Proc. Conf. Empir . Methods Nat. Lang. Process, pp. 878–892 (2020)
Liang, Z., Jiang, W., Hu, H., Zhu, J.: Learning to contrast the counterfactual samples for robust visual question answering. In: Proc. Conf. Empir . Methods Nat. Lang. Process, pp. 3285–3292 (2020)
Guo, Y., Nie, L., Cheng, Z., Tian, Q., Zhang, M.: Loss re-scaling VQA: revisiting the language prior problem from a class-imbalance view. IEEE Trans. Image Process. 31, 227–238 (2022). https://doi.org/10.1109/TIP.2021.3128322
Shrestha, R., Kafle, K., Kanan, C.: A negative case analysis of visual grounding methods for VQA. In: Proc. Assoc. Comput. Linguist., pp. 8172–8181 (2020). https://doi.org/10.18653/v1/2020.acl-main.727
Cao, R., Li, Z.: Overcoming language priors for visual question answering via loss rebalancing label and global context. In: Uncertainty in Artificial Intelligence, pp. 249–259 (2023). PMLR
Wang, D.-B., Feng, L., Zhang, M.-L.: Rethinking calibration of deep neural networks: Do not be afraid of overconfidence. Adv. Neural. Inf. Process. Syst. 34, 11809–11820 (2021)
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International Conference on Machine Learning, pp. 1321–1330 (2017). PMLR
Zhang, J., Kailkhura, B., Han, T.Y.-J.: Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In: International Conference on Machine Learning, pp. 11117–11128 (2020). PMLR
Gruber, S.G., Buettner, F.: Better uncertainty calibration via proper scores for classification and beyond. In: Advances in Neural Information Processing Systems (2022)
Ghosh, A., Schaaf, T., Gormley, M.R.: Adafocal: Calibration-aware adaptive focal loss. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=kUOm0Fdtvh
Gupta, C., Ramdas, A.: Top-label calibration and multiclass-to-binary reductions. In: International Conference on Learning Representations (2022)
Ren, A.Z., Clark, J., Dixit, A., Itkina, M., Majumdar, A., Sadigh, D.: Explore until confident: Efficient exploration for embodied question answering. In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024
Munir, M.A., Khan, M.H., Khan, S., Khan, F.S.: Bridging precision and confidence: A train-time loss for calibrating object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11474–11483 (2023)
Zhu, Q., Zheng, C., Zhang, Z., Shao, W., Zhang, D.: Dynamic confidence-aware multi-modal emotion recognition. IEEE Trans. Affect. Comput. (2023)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Gawlikowski, J., Tassi, C.R.N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A., Triebel, R., Jung, P., Roscher, R., et al.: A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342 (2021)
Xie, M., Han, Z., Zhang, C., Bai, Y., Hu, Q.: Exploring and exploiting uncertainty for incomplete multi-view classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19873–19882 (2023)
Zhong, Z., Cui, J., Liu, S., Jia, J.: Improving calibration for long-tailed recognition. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 16489–16498 (2021)
Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019)
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O.K., Aggarwal, K., Som, S., Piao, S., Wei, F.: Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022)
Pan, Y., Liu, J., Jin, L., Li, Z.: Unbiased visual question answering by leveraging instrumental variable. IEEE Transactions on Multimedia, 1–16 (2024) https://doi.org/10.1109/TMM.2024.3355640
Si, Q., Liu, Y., Meng, F., Lin, Z., Fu, P., Cao, Y., Wang, W., Zhou, J.: Towards robust visual question answering: Making the most of biased samples via contrastive learning. In: Conference on Empirical Methods in Natural Language Processing (2022)
Wu, Y., Zhao, Y., Zhao, S., Zhang, Y., Yuan, X., Zhao, G., Jiang, N.: Overcoming language priors in visual question answering via distinguishing superficially similar instances. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 5721–5729 (2022)
Li, Y., Hu, B., Zhang, F., Yu, Y., Liu, J., Chen, Y., Xu, J.: A multi-modal debiasing model with dynamical constraint for robust visual question answering. In: Annual Meeting of the Association for Computational Linguistics (2023)
Shu, X., Yan, S., Yang, X., Wu, Z., Chen, Z., Lu, Z.: Sc-ml: Self-supervised counterfactual metric learning for debiased visual question answering. ArXiv abs/2304.01647 (2023)
Bi, Y., Jiang, H., Hu, Y., Sun, Y., Yin, B.: See and learn more: dense caption-aware representation for visual question answering. IEEE Trans. Circuits Syst. Video Technol. 34(2), 1135–1146 (2024). https://doi.org/10.1109/TCSVT.2023.3291379
Funding
This work was supported the National Natural Science Foundation of China under Grants 61932009 and 62202139.
Author information
Authors and Affiliations
Contributions
Xuesong Zhang and Jun He contributed equally. Jia Li is the corresponding author. Xuesong Zhang was primarily responsible for the code implementation and contributed to writing the initial draft of the manuscript. Jun He wrote the main text of the manuscript. Jia Zhao designed the methodology. Zhenzhen Hu, Xun Yang, and Jia Li refined the methodology and revised the manuscript content. Richang Hong provided experimental resources, supervised and guided the research, and revised the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, X., He, J., Zhao, J. et al. Exploring and exploiting model uncertainty for robust visual question answering. Multimedia Systems 30, 348 (2024). https://doi.org/10.1007/s00530-024-01560-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01560-0