Momentum Contrast for Unsupervised Visual Representation Learning

@article{He2019MomentumCF,
  title={Momentum Contrast for Unsupervised Visual Representation Learning},
  author={Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross B. Girshick},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2019},
  pages={9726-9735},
  url={https://api.semanticscholar.org/CorpusID:207930212}
}
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo… 

Figures and Tables from this paper

Unsupervised Learning of Dense Visual Representations

View-Agnostic Dense Representation (VADeR) is proposed for unsupervised learning of dense representations of pixelwise representations by forcing local features to remain constant over different viewing conditions through pixel-level contrastive learning.

When Does Contrastive Visual Representation Learning Work?

Recent self-supervised representation learning techniques have largely closed the gap between supervised and unsupervised learning on ImageNet classification. While the particulars of pretraining on

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

It is shown that learning additional invariances -- through the use of multi-scale cropping, stronger augmentations and nearest neighbors -- improves the representations and it is observed that MoCo learns spatially structured representations when trained with a multi-crop strategy.

Rethinking Image Mixture for Unsupervised Visual Representation Learning

Despite its conceptual simplicity, it is shown empirically that with the simple solution -- image mixture, the authors can learn more robust visual representations from the transformed input, and the benefits of representations learned from this space can be inherited by the linear classification and downstream tasks.

Self-Supervised Visual Representation Learning from Hierarchical Grouping

We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into

Efficient Visual Pretraining with Contrastive Detection

This work introduces a new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations, leading to state-of-the-art transfer accuracy on a variety of downstream tasks, while requiring up to 10× less pretraining.

Weakly Supervised Contrastive Learning

A weakly supervised contrastive learning framework (WCL) based on two projection heads, one of which will perform the regular instance discrimination task, and the other head will use a graph-based method to explore similar samples and generate a weak label to pull the similar images closer.

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

The pixel-level pretext tasks are found to be effective for pre-training not only regular backbone networks but also head networks used for dense downstream tasks, and are complementary to instance-level contrastive methods.
...

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Unsupervised Visual Representation Learning by Context Prediction

It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.

Unsupervised Representation Learning by Predicting Image Rotations

This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.

Learning Features by Watching Objects Move

Inspired by the human visual system, low-level motion-based grouping cues can be used to learn an effective visual representation that significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

Data-Efficient Image Recognition with Contrastive Predictive Coding

This work revisit and improve Contrastive Predictive Coding, an unsupervised objective for learning such representations which make the variability in natural signals more predictable, and produces features which support state-of-the-art linear classification accuracy on the ImageNet dataset.

Revisiting Self-Supervised Visual Representation Learning

This study revisits numerous previously proposed self-supervised models, conducts a thorough large scale study and uncovers multiple crucial insights about standard recipes for CNN design that do not always translate to self- supervised representation learning.

Local Aggregation for Unsupervised Learning of Visual Embeddings

This work describes a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate.

Representation Learning with Contrastive Predictive Coding

This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.

Multi-task Self-Supervised Visual Learning

The results show that deeper networks work better, and that combining tasks—even via a na¨ýve multihead architecture—always improves performance.

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

A novel unsupervised learning approach to build features suitable for object detection and classification and to facilitate the transfer of features to other tasks, the context-free network (CFN), a siamese-ennead convolutional neural network is introduced.
...