GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering

Zhang, Yifeng; Jiang, Ming; Zhao, Qi

doi:10.1007/978-3-031-72643-9_11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15075))

Included in the following conference series:

European Conference on Computer Vision

18 Accesses

Abstract

Large language models (LLMs) exhibit exceptional reasoning capabilities and have played significant roles in knowledge-based visual question-answering (VQA) systems. By conditioning on in-context examples and task-specific prompts, they comprehensively understand input questions and provide answers relevant to the context. However, due to the reliance on in-context examples, LLMs are susceptible to inheriting dataset biases in context descriptions and the provided examples. Innovative methods are required to ensure that LLMs can deliver unbiased yet contextually relevant responses. To tackle this challenge, we present GRAph-based Contextual DEbiasing (GRACE), a novel graph-based method for debiasing knowledge-based VQA models. This approach consists of two novel and generally applicable components. First, we propose an unsupervised context graph learning method that combats biases by explicitly creating a balanced context graph under the guidance of fairness constraints. Second, building upon the context graph, we consider both semantic features and reasoning processes to enhance prompting with more relevant and diverse in-context examples. Through extensive experimentation on both in-distribution (OK-VQA) and out-of-distribution (VQA-CP, GQA-OOD) datasets, we demonstrate the effectiveness of GRACE in mitigating biases and achieving generalization. Additionally, analyses of the model performance across gender groups demonstrate GRACE’s potential impacts on social equity. Our source code is publicly available at https://github.com/SuperJohnZhang/ContextGraphKVQA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8007; Price includes VAT (Japan)

Softcover Book: JPY 10009; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task

Overcoming Language Priors with Counterfactual Inference for Visual Question Answering

Exploring and exploiting model uncertainty for robust visual question answering

Article 19 November 2024

References

Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018)
Google Scholar
Alvi, M., Zisserman, A., Nellåker, C.: Turning a blind eye: explicit removal of biases and variation from deep neural network embeddings. In: Proceedings of the European Conference on Computer Vision Workshops (2018)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 39–48 (2016)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Caliskan, A., Bryson, J.J., Narayanan, A.: Semantics derived automatically from language corpora contain human-like biases. Science 356(6334), 183–186 (2017)
Article Google Scholar
Caton, S., Haas, C.: Fairness in machine learning: a survey. ACM Comput. Surv. (2020)
Google Scholar
Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Chen, S., Jin, Q., Wang, P., Wu, Q.: Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9962–9971 (2020)
Google Scholar
Chen, W., Gan, Z., Li, L., Cheng, Y., Wang, W., Liu, J.: Meta module network for compositional visual reasoning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 655–664 (2021)
Google Scholar
Clark, C., Yatskar, M., Zettlemoyer, L.: Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. arXiv preprint arXiv:1909.03683 (2019)
Dong, Y., Ma, J., Wang, S., Chen, C., Li, J.: Fairness in graph mining: a survey. IEEE Trans. Knowl. Data Eng. (2023)
Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, vol. 17, pp. 973–978. Lawrence Erlbaum Associates Ltd. (2001)
Google Scholar
Fu, X., Zhang, J., Meng, Z., King, I.: MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding. In: Proceedings of The Web Conference 2020, pp. 2331–2341 (2020)
Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Gardères, F., Ziaeefard, M., Abeloos, B., Lecue, F.: ConceptBERT: concept-aware representation for visual question answering. In: Proceedings of the Empirical Methods in Natural Language Processing, pp. 489–498 (2020)
Google Scholar
Gebru, T., et al.: Datasheets for datasets. Commun. ACM 64(12), 86–92 (2021)
Article Google Scholar
Givens, C.R., Shortt, R.M.: A class of Wasserstein metrics for probability distributions. Mich. Math. J. 31(2), 231–240 (1984)
Article MathSciNet Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Google Scholar
Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1978 (2019)
Google Scholar
Gui, L., Wang, B., Huang, Q., Hauptmann, A., Bisk, Y., Gao, J.: Kat: A knowledge augmented transformer for vision-and-language. arXiv preprint arXiv:2112.08614 (2021)
Gupta, N., Lin, K., Roth, D., Singh, S., Gardner, M.: Neural module networks for reasoning over text. arXiv preprint arXiv:1912.04971 (2019)
Han, X., Wang, S., Su, C., Huang, Q., Tian, Q.: General greedy de-bias learning. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 9789–9805 (2023)
Article Google Scholar
Hendricks, L.A., Burns, K., Saenko, K., Darrell, T., Rohrbach, A.: Women also Snowboard: overcoming bias in captioning models. In: Proceedings of the European Conference on Computer Vision, pp. 771–787 (2018)
Google Scholar
Hessel, J., et al.: The abduction of sherlock Holmes: a dataset for visual abductive reasoning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) European Conference on Computer Vision, pp. 558–575. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_32
Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. IEEE Trans., 53–69 (2018)
Google Scholar
Hu, X., et al.: Discovering biases in image datasets with the crowd. In: Proceedings of HCOMP (2019)
Google Scholar
Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: PromptCap: prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022)
Hudson, D., Manning, C.D.: Learning by abstraction: the neural state machine. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067 (2018)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)
Google Scholar
Jeh, G., Widom, J.: SimRank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543 (2002)
Google Scholar
Kervadec, C., Antipov, G., Baccouche, M., Wolf, C.: Roses are red, violets are blue... but should VQA expect them to? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2776–2785 (2021)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2017)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022)
Google Scholar
Lin, Y., Xie, Y., Chen, D., Xu, Y., Zhu, C., Yuan, L.: REVIVE: regional visual representation matters in knowledge-based visual question answering. arXiv preprint arXiv:2206.01201 (2022)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)
Google Scholar
Locatello, F., Abbati, G., Rainforth, T., Bauer, S., Schölkopf, B., Bachem, O.: On the fairness of disentangled representations. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3195–3204 (2019)
Google Scholar
Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016)
Park, S., Hwang, S., Hong, J., Byun, H.: Fair-VQA: fairness-aware visual question answering through sensitive attribute prediction. IEEE Access 8, 215091–215099 (2020)
Article Google Scholar
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., Carin, L.: Variational autoencoder for deep learning of images, labels and captions. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Ryu, H.J., Adam, H., Mitchell, M.: InclusiveFaceNet: improving face attribute detection with race and gender diversity. arXiv preprint arXiv:1712.00193 (2017)
Santoro, A., et al.: A simple neural network module for relational reasoning. In: Proceedings of the Conference and Workshop on Neural Information Processing Systems, pp. 4967–4976 (2017)
Google Scholar
Shao, Z., Yu, Z., Wang, M., Yu, J.: Prompting large language models with answer heuristics for knowledge-based visual question answering. In: Computer Vision and Pattern Recognition (CVPR), pp. 14974–14983 (2023)
Google Scholar
Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8376–8384 (2019)
Google Scholar
Simpson, E.H.: Measurement of diversity. Nature 163(4148), 688–688 (1949)
Article Google Scholar
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models (2023)
Google Scholar
Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4068–4076 (2015)
Google Scholar
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
Verma, S., Rubin, J.: Fairness definitions explained. In: Proceedings of the International Workshop on Software Fairness, pp. 1–7 (2018)
Google Scholar
Wang, A., et al.: REVISE: a tool for measuring and mitigating bias in visual datasets. Int. J. Comput. Vis. 130(7), 1790–1810 (2022). https://doi.org/10.1007/s11263-022-01625-5
Article Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2017)
Article Google Scholar
Wu, J., Lu, J., Sabharwal, A., Mottaghi, R.: Multi-modal answer validation for knowledge-based VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2712–2721 (2022)
Google Scholar
Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: International Conference on Machine Learning, pp. 2397–2406. PMLR (2016)
Google Scholar
Xu, H., Saenko, K.: Ask, Attend and Answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28
Chapter Google Scholar
Yang, Z., et al.: An empirical study of GPT-3 for few-shot knowledge-based VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3081–3089 (2022)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Google Scholar
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: CVPR, pp. 6713–6724 (2019)
Google Scholar
Zhang, B.H., Lemoine, B., Mitchell, M.: Mitigating unwanted biases with adversarial learning. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340 (2018)
Google Scholar
Zhang, P., et al.: VinVL: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588 (2021)
Google Scholar
Zhang, Y., Jiang, M., Zhao, Q.: Explicit knowledge incorporation for visual reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1356–1365 (2021)
Google Scholar
Zhang, Y., Jiang, M., Zhao, Q.: Query and attention augmentation for knowledge-based explainable reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15576–15585 (2022)
Google Scholar
Zhu, B., Niu, Y., Lee, S., Hur, M., Zhang, H.: Debiased fine-tuning for vision-language models by prompt regularization. arXiv preprint arXiv:2301.12429 (2023)

Download references

Acknowledgment

This work is supported by NSF Grants 2143197 and 2227450.

Author information

Authors and Affiliations

University of Minnesota, Minneapolis, MN, 55455, USA
Yifeng Zhang, Ming Jiang & Qi Zhao

Authors

Yifeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yifeng Zhang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 868 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Jiang, M., Zhao, Q. (2025). GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15075. Springer, Cham. https://doi.org/10.1007/978-3-031-72643-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-72643-9_11
Published: 22 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72642-2
Online ISBN: 978-3-031-72643-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task

Overcoming Language Priors with Counterfactual Inference for Visual Question Answering

Exploring and exploiting model uncertainty for robust visual question answering

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 868 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task

Overcoming Language Priors with Counterfactual Inference for Visual Question Answering

Exploring and exploiting model uncertainty for robust visual question answering

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 868 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation