SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Vani, Ankit; Nguyen, Bac; Lavoie, Samuel; Krishna, Ranjay; Courville, Aaron

doi:10.1007/978-3-031-72848-8_14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15124))

Included in the following conference series:

European Conference on Computer Vision

127 Accesses

Abstract

Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose Sparo, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using Sparo with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using Sparo, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to \(+14\%\) for ImageNet, \(+4\%\) for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (\(+3\%\) each). We also showcase a powerful ability to intervene and select individual Sparo concepts to further improve downstream task performance (up from \(+4\%\) to \(+9\%\) for SugarCrepe) and use this ability to study the robustness of Sparo ’s representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8007; Price includes VAT (Japan)

Softcover Book: JPY 10009; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unified Perceptual Parsing for Scene Understanding

Understanding Individual Neurons of ResNet Through Improved Compositional Formulas

Low Dimensional Visual Attributes: An Interpretable Image Encoding

Notes

1.
Source code: https://github.com/ankitkv/sparo-clip.

References

Aydemir, G., Xie, W., Güney, F.: Self-supervised object-centric learning for videos. arXiv preprint arXiv:2310.06907 (2023)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Barbu, A., et al.: ObjectNet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Beattie, C., et al.: DeepMind lab. arXiv preprint arXiv:1612.03801 (2016)
Boff, K.R., Kaufman, L., Thomas, J.P.: Handbook of Perception and Human Performance, vol. 1. Wiley, New York (1986)
Google Scholar
Brady, J., Zimmermann, R.S., Sharma, Y., Schölkopf, B., von Kügelgen, J., Brendel, W.: Provably learning object-centric representations. arXiv preprint arXiv:2305.14229 (2023)
Burgess, C.P., et al.: MoNet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390 (2019)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Chang, H.S., Sun, R.Y., Ricci, K., McCallum, A.: Multi-CLS BERT: an efficient alternative to traditional ensembling. arXiv preprint arXiv:2210.05043 (2022)
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568 (2021)
Google Scholar
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057 (2021)
Chen, Y., et al.: Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15095–15104 (2023)
Google Scholar
Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 105(10), 1865–1883 (2017)
Article Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014)
Google Scholar
Colby, C.L., Goldberg, M.E.: Space and attention in parietal cortex. Annu. Rev. Neurosci. 22(1), 319–349 (1999)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Dugas, E., Jared, Jorge, Cukierski, W.: Diabetic retinopathy detection (2015). https://kaggle.com/competitions/diabetic-retinopathy-detection
Engelcke, M., Kosiorek, A.R., Jones, O.P., Posner, I.: Genesis: generative scene inference and sampling with object-centric latent representations. arXiv preprint arXiv:1907.13052 (2019)
Eslami, S., et al.: Attend, infer, repeat: fast scene understanding with generative models. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)
Article Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Article Google Scholar
Goyal, A., et al.: Neural production systems. Adv. Neural. Inf. Process. Syst. 34, 25673–25687 (2021)
Google Scholar
Goyal, A., et al.: Factorizing declarative and procedural knowledge in structured, dynamical environments. In: International Conference on Learning Representations (2020)
Google Scholar
Goyal, A., et al.: Recurrent independent mechanisms. arXiv preprint arXiv:1909.10893 (2019)
Greff, K., et al.: Multi-object representation learning with iterative variational inference. In: International Conference on Machine Learning, pp. 2424–2433. PMLR (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 12(7), 2217–2226 (2019)
Article Google Scholar
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349 (2021)
Google Scholar
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271 (2021)
Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Arti. Intell. Res. 47, 853–899 (2013)
MathSciNet Google Scholar
Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: SugarCrepe: fixing hackable benchmarks for vision-language compositionality. Adv. Neural Inf. Process. Syst. (2023)
Google Scholar
Ilharco, G., et al.: OpenCLIP (2021). https://github.com/mlfoundations/open_clip, https://doi.org/10.5281/zenodo.5143773
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report (2009)
Google Scholar
LAION-AI: CLIP_benchmark open-source project (2022). https://github.com/LAION-AI/CLIP_benchmark
Lavoie, S., et al.: Simplicial embeddings in self-supervised learning and downstream classification. In: International Conference on Learning Representations (2023). https://openreview.net/forum?id=RWtGreRpovS
LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, vol. 2, pp. II–104. IEEE (2004)
Google Scholar
Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., Teh, Y.W.: Set transformer: a framework for attention-based permutation-invariant neural networks. In: International Conference on Machine Learning, pp. 3744–3753. PMLR (2019)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Locatello, F., et al.: Object-centric learning with slot attention. Adv. Neural. Inf. Process. Syst. 33, 11525–11538 (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Ma, Z., Hong, J., Gul, M.O., Gandhi, M., Gao, I., Krishna, R.: CREPE: can vision-language foundation models reason compositionally? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10910–10921 (2023)
Google Scholar
Mansouri, A., Hartford, J., Zhang, Y., Bengio, Y.: Object centric architectures enable efficient causal representation learning. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=r9FsiXZxZt
Martinez, A.: Putting spatial attention on the map: timing and localization of stimulus selection processes in striate and extrastriate visual areas. Vision. Res. 41, 1437–1457 (2001)
Article Google Scholar
Martins, A., Astudillo, R.: From softmax to sparsemax: a sparse model of attention and multi-label classification. In: International Conference on Machine Learning, pp. 1614–1623. PMLR (2016)
Google Scholar
Matthey, L., Higgins, I., Hassabis, D., Lerchner, A.: dSprites: disentanglement testing sprites dataset (2017). https://github.com/deepmind/dsprites-dataset/
Meta Research: DINO open-source repository (2021). https://github.com/facebookresearch/dino
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE (2008)
Google Scholar
O’Connor, D.H., Fukui, M.M., Pinsk, M.A., Kastner, S.: Attention modulates responses in the human lateral geniculate nucleus. Nat. Neurosci. 5(11), 1203–1209 (2002)
Article Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505. IEEE (2012)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Qian, R., Ding, S., Liu, X., Lin, D.: Semantics meets temporal correspondence: self-supervised object-centric learning in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16675–16687 (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Ray, A., Radenovic, F., Dubey, A., Plummer, B.A., Krishna, R., Saenko, K.: COLA: a benchmark for compositional text-to-image retrieval. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: International Conference on Machine Learning, pp. 5389–5400. PMLR (2019)
Google Scholar
Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Scott, W.A.: Cognitive complexity and cognitive flexibility. Sociometry 405–414 (1962)
Google Scholar
Seitzer, M., et al.: Bridging the gap to real-world object-centric learning. arXiv preprint arXiv:2209.14860 (2022)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, 15–20 July 2018, Volume 1: Long Papers, pp. 2556–2565. Association for Computational Linguistics (2018)
Google Scholar
Shazeer, N.: Fast transformer decoding: one write-head is all you need. arXiv preprint arXiv:1911.02150 (2019)
Thrush, T., et al.: WinoGround: probing vision and language models for visio-linguistic compositionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5238–5248 (2022)
Google Scholar
Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 12(1), 97–136 (1980)
Article Google Scholar
UW RAIVN Lab: SugarCrepe open-source repository (2023). https://github.com/RAIVNLab/sugar-crepe
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equivariant CNNs for digital pathology. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 210–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_24
Chapter Google Scholar
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. Adv. Neural Inf. Process. Syst. 10506–10518 (2019)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE (2010)
Google Scholar
Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Xu, D.: Multi-class token transformer for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4310–4319 (2022)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2023)
Google Scholar
Yuval, N.: Reading digits in natural images with unsupervised feature learning. In: Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)
Google Scholar
Zadaianchuk, A., Kleindessner, M., Zhu, Y., Locatello, F., Brox, T.: Unsupervised semantic segmentation with self-supervised object-centric representations. arXiv preprint arXiv:2207.05027 (2022)
Zhai, X., et al.: A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867 (2019)
Zhang, Y., Hare, J., Prugel-Bennett, A.: Deep set prediction networks. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Zhao, T., et al.: VL-checklist: evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221 (2022)

Download references

Acknowledgments

This research was funded by Sony and enabled in part by compute resources provided by the Digital Research Alliance of Canada, Mila, and Sony.

Author information

Authors and Affiliations

Mila, Université de Montréal, Montreal, Canada
Ankit Vani, Samuel Lavoie & Aaron Courville
Sony AI, Stuttgart, Germany
Bac Nguyen
University of Washington, Allen Institute for Artificial Intelligence, Seattle, USA
Ranjay Krishna
CIFAR AI Chair, Montreal, Canada
Aaron Courville

Authors

Ankit Vani
View author publications
You can also search for this author in PubMed Google Scholar
Bac Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Lavoie
View author publications
You can also search for this author in PubMed Google Scholar
Ranjay Krishna
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Courville
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ankit Vani .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7191 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vani, A., Nguyen, B., Lavoie, S., Krishna, R., Courville, A. (2025). SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-72848-8_14
Published: 29 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72847-1
Online ISBN: 978-3-031-72848-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Unified Perceptual Parsing for Scene Understanding

Understanding Individual Neurons of ResNet Through Improved Compositional Formulas

Low Dimensional Visual Attributes: An Interpretable Image Encoding

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7191 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Unified Perceptual Parsing for Scene Understanding

Understanding Individual Neurons of ResNet Through Improved Compositional Formulas

Low Dimensional Visual Attributes: An Interpretable Image Encoding

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7191 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation