Boosting Visual-Language Models by Exploiting Hard Samples

Wang, Haonan; Huang, Minbin; Huang, Runhui; Hong, Lanqing; Xu, Hang; Hu, Tianyang; Liang, Xiaodan; Li, Zhenguo; Cheng, Hong; Kawaguchi, Kenji

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.05208 (cs)

[Submitted on 9 May 2023 (v1), last revised 10 Mar 2024 (this version, v2)]

Title:Boosting Visual-Language Models by Exploiting Hard Samples

Authors:Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Hang Xu, Tianyang Hu, Xiaodan Liang, Zhenguo Li, Hong Cheng, Kenji Kawaguchi

View PDF HTML (experimental)

Abstract:Contrastive Language-Image Pre-training (CLIP) has become the standard for learning cross-modal representations between images and text. Efforts to improve its capabilities typically demand the collection of additional data and retraining with new loss functions. While effective, the added requirements limit their practical use due to the increased resource and time investments needed. In this work, we present HELIP, a cost-effective strategy tailored to enhance the performance of existing CLIP models without the need for training a model from scratch or collecting additional data. Our method allows for effortless integration with existing models' training pipelines, providing an instant boost by training them with selected challenging text-image pairs from their original training datasets. HELIP treats each text-image pair as a single point in the joint vision-language space, identifying those in close proximity as hard pairs. By incorporating the challenging data, pre-trained CLIP models are refined using both the traditional contrastive loss and the newly introduced hard negative margin loss, ensuring the challenging data is fully utilized. On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance. In particular, it improves the zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M and YFCC15M datasets. The improvements are 3.05%, 4.47%, and 10.1% respectively, achieved within two epochs of training. In addition, across fine-grained classification datasets, HELIP improves the zero-shot performance of pre-trained CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%.

Comments:	The code is publicly available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.05208 [cs.CV]
	(or arXiv:2305.05208v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.05208

Submission history

From: Haonan Wang [view email]
[v1] Tue, 9 May 2023 07:00:17 UTC (11,774 KB)
[v2] Sun, 10 Mar 2024 14:00:53 UTC (20,477 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Boosting Visual-Language Models by Exploiting Hard Samples

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Boosting Visual-Language Models by Exploiting Hard Samples

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators