Multi-layer Tuning CLIP for Few-Shot Image Classification | SpringerLink
Skip to main content

Multi-layer Tuning CLIP for Few-Shot Image Classification

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15035))

Included in the following conference series:

  • 136 Accesses

Abstract

CLIP bridges the gap between visual and language by learning both image and text representations simultaneously. As a large pre-trained visual language model, CLIP is highly generalisable and has demonstrated excellent few-shot learning capabilities. Numerous studies have been conducted on CLIP models for few-shot learning of downstream visual tasks, all of which have demonstrated excellent results. However, current research methods, such as the Adapter and Prompts methods, still fall short in extracting visual features for CLIP. Many methods only fine-tune the adapter after feature extraction, failing to fully utilise the feature extraction potential of CLIP. In addition, the fine-tuning approach using key-value cache improves performance significantly, but it requires careful tuning of the model’s hyperparameters for a specific dataset. Considering these issues, we propose a new approach: fine-tuning the multi-layer features with side adapters for adaptive selection in the visual backbone network. This approach efficiently extracts effective visual features for different layers of the task. Additionally, we propose augmenting the original features using dynamic feature fusion to reduce reliance on hyper-parameter tuning. Extensive experiments are conducted on 11 datasets to verify the superiority of the proposed method over existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Li, X., Yang, X., Ma, Z., Xue, J.-H.: Deep metric learning for few-shot image classification: a review of recent developments. Pattern Recognit. 138, 109381 (2023)

    Google Scholar 

  2. Feuz, K.Y., Cook, D.J.: Transfer learning across feature-rich heterogeneous feature spaces via feature-space remapping (FSR). ACM Trans. Intell. Syst. Technol. (TIST) 6(1), 1–27 (2015)

    Google Scholar 

  3. Liu, W., Chang, X., Yan, Y., Yang, Y., Hauptmann, A.G.: Few-shot text and image classification via analogical transfer learning. ACM Trans. Intell. Syst. Technol. (TIST) 9(6), 1–20 (2018)

    Google Scholar 

  4. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)

    Google Scholar 

  5. Chu, W.H., Li, Y.J., Chang, J.C., Wang, Y.C.F.: Spot and learn: a maximum-entropy patch sampler for few-shot image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6251–6260 (2019)

    Google Scholar 

  6. Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 403–412 (2019)

    Google Scholar 

  7. Alfassy, A., Karlinsky, L., Aides, A., Shtok, J., Harary, S., Feris, R., Giryes, R., Bronstein, A.M.: Laso: label-set operations networks for multi-label few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6548–6557 (2019)

    Google Scholar 

  8. Peng, Z., Li, Z., Zhang, J., Li, Y., Qi, G.J., Tang, J.: Few-shot image recognition with knowledge transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 441–449 (2019)

    Google Scholar 

  9. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, pp 8748–8763

    Google Scholar 

  10. He, R., Liu, L., Ye, H., Tan, Q., Ding, B., Cheng, L., Low, J.W., Bing, L., Si, L.: On the effectiveness of adapter-based tuning for pretrained language model adaptation (2021). arXiv:2106.03164

  11. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)

    Google Scholar 

  12. Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning (2021). arXiv:2104.08691

  13. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self-instruct: aligning language models with self-generated instructions (2022). arXiv:2212.10560

  14. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)

    Google Scholar 

  15. Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.15659–15669 (2023)

    Google Scholar 

  16. Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Yu.: Clip-adapter: better vision-language models with feature adapters. Int. J. Comput. Vision 132(2), 581–595 (2024)

    Article  MATH  Google Scholar 

  17. Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: training-free adaption of clip for few-shot classification. In: European Conference on Computer Vision, pp. 493–510. Springer (2022)

    Google Scholar 

  18. Zhu, X., Zhang, R., He, B., Zhou, A., Wang, D., Zhao, B., Gao, P.: Not all features matter: enhancing few-shot clip with adaptive prior refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2605–2615 (2023)

    Google Scholar 

  19. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  20. Lee, D., Song, S., Suh, J., Choi, J., Lee, S., Kim, H.J.: Read-only prompt optimization for vision-language few-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1401–1411 (2023)

    Google Scholar 

  21. Menon, S., Vondrick, C.: Visual classification via description from large language models (2022). arXiv:2210.07183

  22. Pratt, S., Covert, I., Liu, R., Farhadi, A.: What does a platypus look like? generating customized prompts for zero-shot image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15691–15701 (2023)

    Google Scholar 

  23. Maniparambil, M., Vorster, C., Molloy, D., Murphy, N., McGuinness, K., O’Connor, N.E.: Enhancing clip with GPT-4: Harnessing visual descriptions as prompts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 262–271 (2023)

    Google Scholar 

  24. Gondal, M.W., Gast, J., Ruiz, I.A., Droste, R., Macri, T., Kumar, S., Staudigl, L.: Domain aligned clip for few-shot classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5721–5730 (2024)

    Google Scholar 

  25. Yan, J., Xie, Y., Guo, Y., Wei, Y., Zhang, X., Luan, X.: Cocoopter: pre-train, prompt, and fine-tune the vision-language model for few-shot image classification. Int. J. Multimed. Inf. Retrieval 12(2), 27 (2023)

    Google Scholar 

  26. Guo, Z., Zhang, R., Qiu, L., Ma, X., Miao, X., He, X., Cui, B.: Calip: zero-shot enhancement of clip with parameter-free attention. Proc. AAAI Conf. Artif. Intell. 37, 746–754 (2023)

    MATH  Google Scholar 

  27. Lin, Z., Yu, S., Kuang, Z., Pathak, D., Ramanan, D.: Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19325–19337 (2023)

    Google Scholar 

  28. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  29. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178. IEEE (2004)

    Google Scholar 

  30. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)

    Google Scholar 

  31. Nilsback, M-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp.722–729. IEEE (2008)

    Google Scholar 

  32. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446–461. Springer (2014)

    Google Scholar 

  33. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505. IEEE (2012)

    Google Scholar 

  34. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft (2013). arXiv:1306.5151

  35. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE (2010)

    Google Scholar 

  36. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402

  37. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014)

    Google Scholar 

  38. Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Topics Appl. Earth Observat. Remote Sens. 12(7), 2217–2226 (2019)

    Article  Google Scholar 

  39. Silva-Rodriguez, J., Hajimiri, S., Ben Ayed, I., Dolz, J.: A closer look at the few-shot adaptation of large vision-language models (2023). arXiv:2312.12730

Download references

Acknowledgements

This work is supported by the Ningbo Key Research and Development Program (Grant No. 2023Z057), Ningbo Natural Science Foundation (2023J281), and the Fundamental Research Funds for the Central Universities (226-2024-00058).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yijun Bei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, R. et al. (2025). Multi-layer Tuning CLIP for Few-Shot Image Classification. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15035. Springer, Singapore. https://doi.org/10.1007/978-981-97-8620-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8620-6_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8619-0

  • Online ISBN: 978-981-97-8620-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics