Multi-layer Tuning CLIP for Few-Shot Image Classification

Zhang, Ruihao; Geng, Jinsong; Liu, Cenyu; Zhang, Wei; Feng, Zunlei; xue, Liang; Bei, Yijun

doi:10.1007/978-981-97-8620-6_12

Ruihao Zhang¹⁵,
Jinsong Geng¹⁶,
Cenyu Liu¹⁶,
Wei Zhang¹⁶,
Zunlei Feng^16,17,
Liang xue¹⁸ &
…
Yijun Bei¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15035))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

136 Accesses

Abstract

CLIP bridges the gap between visual and language by learning both image and text representations simultaneously. As a large pre-trained visual language model, CLIP is highly generalisable and has demonstrated excellent few-shot learning capabilities. Numerous studies have been conducted on CLIP models for few-shot learning of downstream visual tasks, all of which have demonstrated excellent results. However, current research methods, such as the Adapter and Prompts methods, still fall short in extracting visual features for CLIP. Many methods only fine-tune the adapter after feature extraction, failing to fully utilise the feature extraction potential of CLIP. In addition, the fine-tuning approach using key-value cache improves performance significantly, but it requires careful tuning of the model’s hyperparameters for a specific dataset. Considering these issues, we propose a new approach: fine-tuning the multi-layer features with side adapters for adaptive selection in the visual backbone network. This approach efficiently extracts effective visual features for different layers of the task. Additionally, we propose augmenting the original features using dynamic feature fusion to reduce reliance on hyper-parameter tuning. Extensive experiments are conducted on 11 datasets to verify the superiority of the proposed method over existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 10295; Price includes VAT (Japan)

Softcover Book: JPY 12869; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

Article 23 August 2023

References

Li, X., Yang, X., Ma, Z., Xue, J.-H.: Deep metric learning for few-shot image classification: a review of recent developments. Pattern Recognit. 138, 109381 (2023)
Google Scholar
Feuz, K.Y., Cook, D.J.: Transfer learning across feature-rich heterogeneous feature spaces via feature-space remapping (FSR). ACM Trans. Intell. Syst. Technol. (TIST) 6(1), 1–27 (2015)
Google Scholar
Liu, W., Chang, X., Yan, Y., Yang, Y., Hauptmann, A.G.: Few-shot text and image classification via analogical transfer learning. ACM Trans. Intell. Syst. Technol. (TIST) 9(6), 1–20 (2018)
Google Scholar
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)
Google Scholar
Chu, W.H., Li, Y.J., Chang, J.C., Wang, Y.C.F.: Spot and learn: a maximum-entropy patch sampler for few-shot image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6251–6260 (2019)
Google Scholar
Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 403–412 (2019)
Google Scholar
Alfassy, A., Karlinsky, L., Aides, A., Shtok, J., Harary, S., Feris, R., Giryes, R., Bronstein, A.M.: Laso: label-set operations networks for multi-label few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6548–6557 (2019)
Google Scholar
Peng, Z., Li, Z., Zhang, J., Li, Y., Qi, G.J., Tang, J.: Few-shot image recognition with knowledge transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 441–449 (2019)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, pp 8748–8763
Google Scholar
He, R., Liu, L., Ye, H., Tan, Q., Ding, B., Cheng, L., Low, J.W., Bing, L., Si, L.: On the effectiveness of adapter-based tuning for pretrained language model adaptation (2021). arXiv:2106.03164
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)
Google Scholar
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning (2021). arXiv:2104.08691
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self-instruct: aligning language models with self-generated instructions (2022). arXiv:2212.10560
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)
Google Scholar
Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.15659–15669 (2023)
Google Scholar
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Yu.: Clip-adapter: better vision-language models with feature adapters. Int. J. Comput. Vision 132(2), 581–595 (2024)
Article MATH Google Scholar
Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: training-free adaption of clip for few-shot classification. In: European Conference on Computer Vision, pp. 493–510. Springer (2022)
Google Scholar
Zhu, X., Zhang, R., He, B., Zhou, A., Wang, D., Zhao, B., Gao, P.: Not all features matter: enhancing few-shot clip with adaptive prior refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2605–2615 (2023)
Google Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Lee, D., Song, S., Suh, J., Choi, J., Lee, S., Kim, H.J.: Read-only prompt optimization for vision-language few-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1401–1411 (2023)
Google Scholar
Menon, S., Vondrick, C.: Visual classification via description from large language models (2022). arXiv:2210.07183
Pratt, S., Covert, I., Liu, R., Farhadi, A.: What does a platypus look like? generating customized prompts for zero-shot image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15691–15701 (2023)
Google Scholar
Maniparambil, M., Vorster, C., Molloy, D., Murphy, N., McGuinness, K., O’Connor, N.E.: Enhancing clip with GPT-4: Harnessing visual descriptions as prompts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 262–271 (2023)
Google Scholar
Gondal, M.W., Gast, J., Ruiz, I.A., Droste, R., Macri, T., Kumar, S., Staudigl, L.: Domain aligned clip for few-shot classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5721–5730 (2024)
Google Scholar
Yan, J., Xie, Y., Guo, Y., Wei, Y., Zhang, X., Luan, X.: Cocoopter: pre-train, prompt, and fine-tune the vision-language model for few-shot image classification. Int. J. Multimed. Inf. Retrieval 12(2), 27 (2023)
Google Scholar
Guo, Z., Zhang, R., Qiu, L., Ma, X., Miao, X., He, X., Cui, B.: Calip: zero-shot enhancement of clip with parameter-free attention. Proc. AAAI Conf. Artif. Intell. 37, 746–754 (2023)
MATH Google Scholar
Lin, Z., Yu, S., Kuang, Z., Pathak, D., Ramanan, D.: Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19325–19337 (2023)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178. IEEE (2004)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)
Google Scholar
Nilsback, M-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp.722–729. IEEE (2008)
Google Scholar
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446–461. Springer (2014)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505. IEEE (2012)
Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft (2013). arXiv:1306.5151
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE (2010)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Topics Appl. Earth Observat. Remote Sens. 12(7), 2217–2226 (2019)
Article Google Scholar
Silva-Rodriguez, J., Hajimiri, S., Ben Ayed, I., Dolz, J.: A closer look at the few-shot adaptation of large vision-language models (2023). arXiv:2312.12730

Download references

Acknowledgements

This work is supported by the Ningbo Key Research and Development Program (Grant No. 2023Z057), Ningbo Natural Science Foundation (2023J281), and the Fundamental Research Funds for the Central Universities (226-2024-00058).

Author information

Authors and Affiliations

Polytechnic Institute, Zhejiang University, Hangzhou, China
Ruihao Zhang
School of Software Technology, Zhejiang University, Hangzhou, China
Jinsong Geng, Cenyu Liu, Wei Zhang, Zunlei Feng & Yijun Bei
Key Laboratory of Visual Perception, Zhejiang University, Ministry of Education and Microsoft, Hangzhou, China
Zunlei Feng
Suzhou City University, Suzhou, China
Liang xue

Authors

Ruihao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jinsong Geng
View author publications
You can also search for this author in PubMed Google Scholar
Cenyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zunlei Feng
View author publications
You can also search for this author in PubMed Google Scholar
Liang xue
View author publications
You can also search for this author in PubMed Google Scholar
Yijun Bei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yijun Bei .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, R. et al. (2025). Multi-layer Tuning CLIP for Few-Shot Image Classification. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15035. Springer, Singapore. https://doi.org/10.1007/978-981-97-8620-6_12

Download citation

DOI: https://doi.org/10.1007/978-981-97-8620-6_12
Published: 20 October 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8619-0
Online ISBN: 978-981-97-8620-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-layer Tuning CLIP for Few-Shot Image Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multi-layer Tuning CLIP for Few-Shot Image Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation