DINOv2: Learning Robust Visual Features without Supervision

@article{Oquab2023DINOv2LR,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Maxime Oquab and Timoth{\'e}e Darcet and Th{\'e}o Moutakanni and Huy Q. Vo and Marc Szafraniec and Vasil Khalidov and Pierre Fernandez and Daniel Haziza and Francisco Massa and Alaaeldin El-Nouby and Mahmoud Assran and Nicolas Ballas and Wojciech Galuba and Russ Howes and Po-Yao (Bernie) Huang and Shang-Wen Li and Ishan Misra and Michael G. Rabbat and Vasu Sharma and Gabriel Synnaeve and Huijiao Xu and Herv{\'e} J{\'e}gou and Julien Mairal and Patrick Labatut and Armand Joulin and Piotr Bojanowski},
  journal={ArXiv},
  year={2023},
  volume={abs/2304.07193},
  url={https://api.semanticscholar.org/CorpusID:258170077}
}
This work revisits existing approaches and combines different techniques to scale the pretraining in terms of data and model size, and proposes an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

This work successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.

SILC: Improving Vision Language Pretraining with Self-Distillation

SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation and significantly improves model performance on dense predictions tasks like detection and segmentation, while also providing improvements on image-level tasks such as classification and retrieval.

Accessing Vision Foundation Models at ImageNet-level Costs

This work offers a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data, and removes the designs from conventional knowledge distillation settings that result in dataset bias.

Learning Vision from Models Rivals Learning Vision from Data

SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, is introduced, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks.

ComFe: Interpretable Image Classifiers With Foundation Models

ComFe is the first interpretable approach, that the authors know of, that can be applied at the scale of datasets such as ImageNet-1K and provides improved robustness over non-interpretable methods and outperforms previous interpretable approaches on key benchmark datasets.

Application specificity of data for pre-training in computer vision

These findings indicate that front-end embeddings sufficiently generalize learned image features independent of data composition, leaving transfer learning to inject the majority of application-specific understanding into the model, suggesting target data is a primary driver of application specificity.

Segment Anything Model is a Good Teacher for Local Feature Learning

This paper designs an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination.

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

An extensive investigation into the effectiveness of different vision encoders within MLLMs reveals that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding, and proposes a simple yet effective feature merging strategy, named COMM, to enhance the visual capabilities of MLLMs.

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

This work proposes a simple task-oriented knowledge transfer approach that outperforms task-agnostic VFM distillation, ImageNet pretraining and DINO pretraining, and introduces a retrieval-augmented knowledge transfer strategy that uses web-scale image retrieval to curate effective transfer sets.
...

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Self-supervised Pretraining of Visual Features in the Wild

This work explores if self-supervision lives to its expectation by training large models on random, uncurated images with no supervision, and observes that self- supervised models are good few-shot learners.

Unsupervised Pre-Training of Image Features on Non-Curated Data

This work proposes a new unsupervised approach which leverages self-supervision and clustering to capture complementary statistics from large-scale data and validates its approach on 96 million images from YFCC100M, achieving state-of-the-art results among unsuper supervised methods on standard benchmarks.

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

This study shows that denoising autoencoders, such as BEiT or a variant that is introduced in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings.

A Simple Recipe for Competitive Low-compute Self supervised Vision Models

The main insight is that existing joint-embedding based SSL methods can be repurposed for knowledge distillation from a large self-supervised teacher to a small student model, and this method is called Replace one Branch (RoB) as it simply replaces one branch of the joint- embedding training with a large teacher model.

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.

Unsupervised Representation Learning by Predicting Image Rotations

This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.

LAION-5B: An open large-scale dataset for training next generation image-text models

This work presents LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language, and shows successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discusses further experiments enabled with an openly available dataset of this scale.

Benchmarking Representation Learning for Natural World Image Collections

It is found that features produced by standard supervised methods still outperform those produced by self-supervised approaches such as SimCLR, however, improved self- supervised learning methods are constantly being released and the iNat2021 and NeWT datasets are a valuable resource for tracking their progress.

DeiT III: Revenge of the ViT

This paper revisits the supervised training of ViTs and builds upon and simplifies a recipe introduced for training ResNet-50, and includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning.
...