DINOv2: Learning Robust Visual Features without Supervision

M. Oquab; Timothée Darcet; Théo Moutakanni; Huy Q. Vo; Marc Szafraniec; Vasil Khalidov; Pierre Fernandez; Daniel Haziza; Francisco Massa; Alaaeldin El-Nouby; Mahmoud Assran; Nicolas Ballas; Wojciech Galuba; Russ Howes; Po-Yao (Bernie) Huang; Shang-Wen Li; Ishan Misra; Michael G. Rabbat; Vasu Sharma; Gabriel Synnaeve; Huijiao Xu; H. Jégou; J. Mairal; Patrick Labatut; Armand Joulin; Piotr Bojanowski

DOI:10.48550/arXiv.2304.07193
Corpus ID: 258170077

DINOv2: Learning Robust Visual Features without Supervision

@article{Oquab2023DINOv2LR,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Maxime Oquab and Timoth{\'e}e Darcet and Th{\'e}o Moutakanni and Huy Q. Vo and Marc Szafraniec and Vasil Khalidov and Pierre Fernandez and Daniel Haziza and Francisco Massa and Alaaeldin El-Nouby and Mahmoud Assran and Nicolas Ballas and Wojciech Galuba and Russ Howes and Po-Yao (Bernie) Huang and Shang-Wen Li and Ishan Misra and Michael G. Rabbat and Vasu Sharma and Gabriel Synnaeve and Huijiao Xu and Herv{\'e} J{\'e}gou and Julien Mairal and Patrick Labatut and Armand Joulin and Piotr Bojanowski},
  journal={ArXiv},
  year={2023},
  volume={abs/2304.07193},
  url={https://api.semanticscholar.org/CorpusID:258170077}
}

M. OquabTimothée Darcet Piotr Bojanowski
Published in Trans. Mach. Learn. Res. 14 April 2023
Computer Science
ArXiv

This work revisits existing approaches and combines different techniques to scale the pretraining in terms of data and model size, and proposes an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.

[PDF] Semantic Reader

2,001 Citations

Highly Influential Citations

365

Background Citations

739

Methods Citations

1,036

Results Citations

Figures and Tables from this paper

Topics

DINOv2 LVD-142M Discriminative Self-supervised Learning ViT-L Masked Image Modeling Distillation With No ViT-G Patch-level Features ViT-G Model Frozen Features

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Cijo JoseThéo Moutakanni Piotr Bojanowski

Computer Science

2024

This work successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.

DINOv2: Learning Robust Visual Features without Supervision

Figures and Tables from this paper

Topics

2,001 Citations

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Self-supervised visual learning in the low-data regime: a comparative evaluation

SILC: Improving Vision Language Pretraining with Self-Distillation

Accessing Vision Foundation Models at ImageNet-level Costs

Learning Vision from Models Rivals Learning Vision from Data

ComFe: Interpretable Image Classifiers With Foundation Models

Application specificity of data for pre-training in computer vision

Segment Anything Model is a Good Teacher for Local Feature Learning

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

140 References

Learning Transferable Visual Models From Natural Language Supervision

Self-supervised Pretraining of Visual Features in the Wild

Unsupervised Pre-Training of Image Features on Non-Curated Data

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

A Simple Recipe for Competitive Low-compute Self supervised Vision Models

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Unsupervised Representation Learning by Predicting Image Rotations

LAION-5B: An open large-scale dataset for training next generation image-text models

Benchmarking Representation Learning for Natural World Image Collections

DeiT III: Revenge of the ViT

Related Papers