Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples

Lai, Chengen; Song, Shengli; Yan, Sitong; Hu, Guangneng

doi:10.1007/978-3-031-72890-7_11

Chengen Lai¹³,
Shengli Song¹³,
Sitong Yan¹³ &
…
Guangneng Hu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15127))

Included in the following conference series:

European Conference on Computer Vision

138 Accesses

Abstract

Vision and Language (VL) models have achieved remarkable performance in a variety of multimodal learning tasks. The success of these models is attributed to learning a joint and aligned representation space of visual and text. However, recent popular VL models still struggle with concepts understanding beyond bag-of-objects in images & texts, suffering from compositional reasoning about relationship between objects & attributes and word order. To address the above issues, we create a synthetic multimodal counterfactual dataset (COCO-CF) and propose a novel contrastive learning framework (COMO). We contribute the COCO-CF dataset which is automatically generated from MS-COCO by injecting concepts from off-the-shelf language models and diffusion models to reduce the bias of bag-of-objects. We contribute the COMO framework for effectively leveraging COCO-CF to treat the counterfactual samples as hard negatives and reweight their importance during contrastive learning. Extensive experiments and ablations show COMO achieved a significant improvement of VL concept understanding on the two VL-Checklist and Winoground benchmarks over five strong VL baselines in their zero-shot setting evaluations. Dataset is available at https://github.com/laichengen/COMO.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Conceptual Codebook Learning for Vision-Language Models

Notes

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Basu, A., Addepalli, S., Babu, R.V.: Rmlvqa: a margin loss approach for visual question answering with language biases. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11671–11680 (2023)
Google Scholar
Cascante-Bonilla, P., et al.: Going beyond nouns with vision & language models using synthetic data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20155–20165 (2023)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, Y., et al.: Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15095–15104 (2023)
Google Scholar
Chen, Y., et al.: Vilem: visual-language error modeling for image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11018–11027 (2023)
Google Scholar
Chuang, C.Y., Robinson, J., Lin, Y.C., Torralba, A., Jegelka, S.: Debiased contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 8765–8775 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)
Google Scholar
Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. Adv. Neural Inf. Process. Syst. 27 (2014)
Google Scholar
Doveh, S., et al.: Dense and aligned captions (dac) promote compositional reasoning in vl models. In: NeurIPS (2023)
Google Scholar
Doveh, S., et al.: Teaching structured vision & language concepts to vision & language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2657–2668 (2023)
Google Scholar
Geirhos, R., et al.: Shortcut learning in deep neural networks. Nat. Mach. Intell. 2(11), 665–673 (2020)
Article Google Scholar
Goel, S., Bansal, H., Bhatia, S., Rossi, R., Vinay, V., Grover, A.: Cyclip: cyclic contrastive language-image pretraining. Adv. Neural. Inf. Process. Syst. 35, 6704–6719 (2022)
Google Scholar
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 1735–1742. IEEE (2006)
Google Scholar
He, R., et al.: Is synthetic data from generative models ready for image recognition? In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Kalantidis, Y., Sariyildiz, M.B., Pion, N., Weinzaepfel, P., Larlus, D.: Hard negative mixing for contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 21798–21809 (2020)
Google Scholar
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 18661–18673 (2020)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2017)
Article MathSciNet Google Scholar
Le, T., Lal, V., Howard, P.: Coco-counterfactuals: automatically constructed counterfactual examples for image-text pairs. arXiv preprint arXiv:2309.14356 (2023)
Li, C., et al.: Elevater: a benchmark and toolkit for evaluating language-augmented visual models. Adv. Neural. Inf. Process. Syst. 35, 9287–9301 (2022)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, Y.L., et al.: Hake: human activity knowledge engine. arXiv preprint arXiv:1904.06539 (2019)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Moltisanti, D., Keller, F., Bilen, H., Sevilla-Lara, L.: Learning action changes by measuring verb-adverb textual relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23110–23118 (2023)
Google Scholar
Momeni, L., Caron, M., Nagrani, A., Zisserman, A., Schmid, C.: Verbs in action: improving verb understanding in video-language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15579–15591 (2023)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Pham, K., Kafle, K., Lin, Z., Ding, Z., Cohen, S., Tran, Q., Shrivastava, A.: Learning to predict visual attributes in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13018–13028 (2021)
Google Scholar
Pratt, S., Yatskar, M., Weihs, L., Farhadi, A., Kembhavi, A.: Grounded situation recognition. In: European Conference on Computer Vision, pp. 314–332 (2020)
Google Scholar
Radenovic, F., et al.: Filtering, distillation, and hard negatives for vision-language pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6967–6977 (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
Google Scholar
Robinson, J.D., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (2020)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Roth, K., Kim, J.M., Koepke, A.S., Vinyals, O., Schmid, C., Akata, Z.: Waffling around for performance: Visual classification with random words and broad concepts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15746–15757 (2023)
Google Scholar
Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Shi, C., Yang, S.: Logoprompt: Synthetic text images can be good visual prompts for vision-language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2932–2941 (2023)
Google Scholar
Shi, H., Mao, J., Xiao, T., Jiang, Y., Sun, J.: Learning visually-grounded semantics from contrastive adversarial samples. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3715–3727 (2018)
Google Scholar
Smith, J.S., et al.: Construct-vl: data-free continual structured vl concepts learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14994–15004 (2023)
Google Scholar
Thrush, T., et al.: Winoground: probing vision and language models for visio-linguistic compositionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5238–5248 (2022)
Google Scholar
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., Isola, P.: What makes for good views for contrastive learning? Adv. Neural. Inf. Process. Syst. 33, 6827–6839 (2020)
Google Scholar
Trabucco, B., Doherty, K., Gurinas, M., Salakhutdinov, R.: Effective data augmentation with diffusion models. In: ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models (2023)
Google Scholar
Wang, W., Yang, Z., Xu, B., Li, J., Sun, Y.: Vilta: enhancing vision-language pre-training through textual augmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3158–3169 (2023)
Google Scholar
Wang, Z., Gao, Z., Guo, K., Yang, Y., Wang, X., Shen, H.T.: Multilateral semantic relations modeling for image text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2830–2839 (2023)
Google Scholar
Wei, J., Zou, K.: Eda: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388 (2019)
Google Scholar
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: encoding visual concepts into textual embeddings for customized text-to-image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15943–15953 (2023)
Google Scholar
Wu, H., et al.: Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6609–6618 (2019)
Google Scholar
Wu, Y., Wei, Y., Wang, H., Liu, Y., Yang, S., He, X.: Grounded image text matching with mismatched relation reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2976–2987 (2023)
Google Scholar
Xie, C.W., Sun, S., Xiong, X., Zheng, Y., Zhao, D., Zhou, J.: Ra-clip: retrieval augmented contrastive language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19265–19274 (2023)
Google Scholar
Yang, J., et al.: Vision-language pre-training with triple contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15671–15680 (2022)
Google Scholar
Yang, K., et al.: Alip: adaptive language-image pre-training with synthetic caption. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2922–2931 (2023)
Google Scholar
Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Zhang, X., Zhang, F., Xu, C.: Vqacl: a novel visual question answering continual learning setting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19102–19112 (2023)
Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28 (2015)
Google Scholar
Zhao, T., et al.: Vl-checklist: evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221 (2022)
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403 (2019)
Google Scholar
Zhong, Y., et al.: Regionclip: region-based language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16793–16803 (2022)
Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13001–13008 (2020)
Google Scholar

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (No. 62306220).

Author information

Authors and Affiliations

School of Computer Science and Technology, Xidian University, Xi’an, China
Chengen Lai, Shengli Song, Sitong Yan & Guangneng Hu

Authors

Chengen Lai
View author publications
You can also search for this author in PubMed Google Scholar
Shengli Song
View author publications
You can also search for this author in PubMed Google Scholar
Sitong Yan
View author publications
You can also search for this author in PubMed Google Scholar
Guangneng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengli Song .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lai, C., Song, S., Yan, S., Hu, G. (2025). Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15127. Springer, Cham. https://doi.org/10.1007/978-3-031-72890-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-72890-7_11
Published: 07 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72889-1
Online ISBN: 978-3-031-72890-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples