Abstract
CLIP bridges the gap between visual and language by learning both image and text representations simultaneously. As a large pre-trained visual language model, CLIP is highly generalisable and has demonstrated excellent few-shot learning capabilities. Numerous studies have been conducted on CLIP models for few-shot learning of downstream visual tasks, all of which have demonstrated excellent results. However, current research methods, such as the Adapter and Prompts methods, still fall short in extracting visual features for CLIP. Many methods only fine-tune the adapter after feature extraction, failing to fully utilise the feature extraction potential of CLIP. In addition, the fine-tuning approach using key-value cache improves performance significantly, but it requires careful tuning of the model’s hyperparameters for a specific dataset. Considering these issues, we propose a new approach: fine-tuning the multi-layer features with side adapters for adaptive selection in the visual backbone network. This approach efficiently extracts effective visual features for different layers of the task. Additionally, we propose augmenting the original features using dynamic feature fusion to reduce reliance on hyper-parameter tuning. Extensive experiments are conducted on 11 datasets to verify the superiority of the proposed method over existing methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Li, X., Yang, X., Ma, Z., Xue, J.-H.: Deep metric learning for few-shot image classification: a review of recent developments. Pattern Recognit. 138, 109381 (2023)
Feuz, K.Y., Cook, D.J.: Transfer learning across feature-rich heterogeneous feature spaces via feature-space remapping (FSR). ACM Trans. Intell. Syst. Technol. (TIST) 6(1), 1–27 (2015)
Liu, W., Chang, X., Yan, Y., Yang, Y., Hauptmann, A.G.: Few-shot text and image classification via analogical transfer learning. ACM Trans. Intell. Syst. Technol. (TIST) 9(6), 1–20 (2018)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)
Chu, W.H., Li, Y.J., Chang, J.C., Wang, Y.C.F.: Spot and learn: a maximum-entropy patch sampler for few-shot image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6251–6260 (2019)
Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 403–412 (2019)
Alfassy, A., Karlinsky, L., Aides, A., Shtok, J., Harary, S., Feris, R., Giryes, R., Bronstein, A.M.: Laso: label-set operations networks for multi-label few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6548–6557 (2019)
Peng, Z., Li, Z., Zhang, J., Li, Y., Qi, G.J., Tang, J.: Few-shot image recognition with knowledge transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 441–449 (2019)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, pp 8748–8763
He, R., Liu, L., Ye, H., Tan, Q., Ding, B., Cheng, L., Low, J.W., Bing, L., Si, L.: On the effectiveness of adapter-based tuning for pretrained language model adaptation (2021). arXiv:2106.03164
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning (2021). arXiv:2104.08691
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self-instruct: aligning language models with self-generated instructions (2022). arXiv:2212.10560
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)
Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.15659–15669 (2023)
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Yu.: Clip-adapter: better vision-language models with feature adapters. Int. J. Comput. Vision 132(2), 581–595 (2024)
Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: training-free adaption of clip for few-shot classification. In: European Conference on Computer Vision, pp. 493–510. Springer (2022)
Zhu, X., Zhang, R., He, B., Zhou, A., Wang, D., Zhao, B., Gao, P.: Not all features matter: enhancing few-shot clip with adaptive prior refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2605–2615 (2023)
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Lee, D., Song, S., Suh, J., Choi, J., Lee, S., Kim, H.J.: Read-only prompt optimization for vision-language few-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1401–1411 (2023)
Menon, S., Vondrick, C.: Visual classification via description from large language models (2022). arXiv:2210.07183
Pratt, S., Covert, I., Liu, R., Farhadi, A.: What does a platypus look like? generating customized prompts for zero-shot image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15691–15701 (2023)
Maniparambil, M., Vorster, C., Molloy, D., Murphy, N., McGuinness, K., O’Connor, N.E.: Enhancing clip with GPT-4: Harnessing visual descriptions as prompts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 262–271 (2023)
Gondal, M.W., Gast, J., Ruiz, I.A., Droste, R., Macri, T., Kumar, S., Staudigl, L.: Domain aligned clip for few-shot classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5721–5730 (2024)
Yan, J., Xie, Y., Guo, Y., Wei, Y., Zhang, X., Luan, X.: Cocoopter: pre-train, prompt, and fine-tune the vision-language model for few-shot image classification. Int. J. Multimed. Inf. Retrieval 12(2), 27 (2023)
Guo, Z., Zhang, R., Qiu, L., Ma, X., Miao, X., He, X., Cui, B.: Calip: zero-shot enhancement of clip with parameter-free attention. Proc. AAAI Conf. Artif. Intell. 37, 746–754 (2023)
Lin, Z., Yu, S., Kuang, Z., Pathak, D., Ramanan, D.: Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19325–19337 (2023)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178. IEEE (2004)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)
Nilsback, M-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp.722–729. IEEE (2008)
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446–461. Springer (2014)
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505. IEEE (2012)
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft (2013). arXiv:1306.5151
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE (2010)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014)
Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Topics Appl. Earth Observat. Remote Sens. 12(7), 2217–2226 (2019)
Silva-Rodriguez, J., Hajimiri, S., Ben Ayed, I., Dolz, J.: A closer look at the few-shot adaptation of large vision-language models (2023). arXiv:2312.12730
Acknowledgements
This work is supported by the Ningbo Key Research and Development Program (Grant No. 2023Z057), Ningbo Natural Science Foundation (2023J281), and the Fundamental Research Funds for the Central Universities (226-2024-00058).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, R. et al. (2025). Multi-layer Tuning CLIP for Few-Shot Image Classification. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15035. Springer, Singapore. https://doi.org/10.1007/978-981-97-8620-6_12
Download citation
DOI: https://doi.org/10.1007/978-981-97-8620-6_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8619-0
Online ISBN: 978-981-97-8620-6
eBook Packages: Computer ScienceComputer Science (R0)