UNIC: Universal Classification Models via Multi-teacher Distillation

Sarıyıldız, Mert Bülent; Weinzaepfel, Philippe; Lucas, Thomas; Larlus, Diane; Kalantidis, Yannis

doi:10.1007/978-3-031-73235-5_20

Mert Bülent Sarıyıldız¹³,
Philippe Weinzaepfel¹³,
Thomas Lucas¹³,
Diane Larlus¹³ &
…
Yannis Kalantidis¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15062))

Included in the following conference series:

European Conference on Computer Vision

148 Accesses

Abstract

Pretrained models have become a commodity and offer strong results on a broad range of tasks. In this work, we focus on classification and seek to learn a unique encoder able to take from several complementary pretrained models. We aim at even stronger generalization across a variety of classification tasks. We propose to learn such an encoder via multi-teacher distillation. We first thoroughly analyze standard distillation when driven by multiple strong teachers with complementary strengths. Guided by this analysis, we gradually propose improvements to the basic distillation setup. Among those, we enrich the architecture of the encoder with a ladder of expendable projectors, which increases the impact of intermediate features during distillation, and we introduce teacher dropping, a regularization mechanism that better balances the teachers’ influence. Our final distillation strategy leads to student models of the same capacity as any of the teachers, while retaining or improving upon the performance of the best teacher for each task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Distilling Visual Priors from Self-Supervised Learning

PrUE: Distilling Knowledge from Sparse Teacher Networks

Calibration Transfer via Knowledge Distillation

Notes

1.
The 15 datasets are: 5 ImageNet-CoG levels [49] tailored for concept generalization, 8 small-scale fine-grained datasets (Aircraft, Cars196, DTD, EuroSAT, Flowers, Pets, Food101, SUN397) and two long-tail datasets (iNaturalist-2018 and 2019).
2.
Projector heads are discarded after distillation and linear probes are learned over the encoder outputs \({\boldsymbol{z}}\).
3.
We use the dBOT model fine-tuned for ImageNet-1K classification.

References

Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: Proceedings of CVPR (2019)
Google Scholar
Asif, U., Tang, J., Harrer, S.: Ensemble knowledge distillation for learning improved and efficient networks. In: Proceedings of ECAI (2020)
Google Scholar
Ba, J., Frey, B.: Adaptive dropout for training deep neural networks. In: Proceedings of NeurIPS (2013)
Google Scholar
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chapter Google Scholar
Buciluǎ, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of SIGKDD (2006)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of ICCV (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of ICML (2020)
Google Scholar
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of CVPR (2021)
Google Scholar
Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: Proceedings of ICML (2018)
Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of CVPR (2014)
Google Scholar
Clark, K., Luong, M.T., Khandelwal, U., Manning, C.D., Le, Q.V.: Bam! born-again multi-task networks for natural language understanding. In: ACL (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of ICLR (2021)
Google Scholar
Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., Ramabhadran, B.: Efficient knowledge distillation from an ensemble of teachers. In: Interspeech (2017)
Google Scholar
Ghiasi, G., Zoph, B., Cubuk, E.D., Le, Q.V., Lin, T.Y.: Multi-task self-training for learning general representations. In: Proceedings of CVPR (2021)
Google Scholar
Hao, Z., et al.: Learning efficient vision transformers via fine-grained manifold distillation. In: Proceedings of NeurIPS (2022)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of CVPR (2022)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. JSTAEORS (2019)
Google Scholar
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of ICCV (2021)
Google Scholar
Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. In: Proceedings of ICCV (2019)
Google Scholar
Heo, B., Lee, M., Yun, S., Choi, J.Y.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: Proceedings of AAAI (2019)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Proceedings of NeurIPS-W (2014)
Google Scholar
Hu, H., Dey, D., Hebert, M., Bagnell, J.A.: Learning anytime predictions in neural networks via adaptive loss balancing. In: Proceedings of AAAI (2019)
Google Scholar
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Chapter Google Scholar
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of CVPR (2018)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Krause, J., Deng, J., Stark, M., Li, F.F.: Collecting a large-scale dataset of fine-grained cars. In: Proceedings of CVPR-W (2013)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of NeurIPS (2012)
Google Scholar
Landgraf, S., Hillemann, M., Kapler, T., Ulrich, M.: Efficient multi-task uncertainties for joint semantic segmentation and monocular depth estimation. arXiv:2402.10580 (2024)
Liu, X., Zhou, J., Kong, T., Lin, X., Ji, R.: Exploring target representations for masked autoencoders. In: Proceedings of ICLR (2022)
Google Scholar
Liu, Y., Zhang, W., Wang, J.: Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing (2020)
Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013)
Marrie, J., Arbel, M., Mairal, J., Larlus, D.: On good practices for task-specific distillation of large pretrained models. arXiv:2402.11305 (2024)
Matena, M.S., Raffel, C.A.: Merging models with fisher-weighted averaging. In: Proc.eedings of NeurIPS (2022)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proceedings of ICVGIP (2008)
Google Scholar
Oquab, M., et al.: DINOv2: Learning robust visual features without supervision. TMLR (2024)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: Proceedings of CVPR (2012)
Google Scholar
Peng, Z., Dong, L., Bao, H., Wei, F., Ye, Q.: A unified view of masked image modeling. TMLR (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of ICML (2021)
Google Scholar
Ramé, A., Ahuja, K., Zhang, J., Cord, M., Bottou, L., Lopez-Paz, D.: Model ratatouille: Recycling diverse models for out-of-distribution generalization. In: Proceedings of ICML (2023)
Google Scholar
Rame, A., Kirchmeyer, M., Rahier, T., Rakotomamonjy, A., Gallinari, P., Cord, M.: Diverse weight averaging for out-of-distribution generalization. In: Proceedings of NeurIPS (2022)
Google Scholar
Ranzinger, M., Heinrich, G., Kautz, J., Molchanov, P.: AM-RADIO: Agglomerative model–reduce all domains into one. In: Proceedings of CVPR (2024)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: Proceedings of ICML (2019)
Google Scholar
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. In: Proceedings of ICLR (2015)
Google Scholar
Roth, K., Milbich, T., Ommer, B., Cohen, J.P., Ghassemi, M.: Simultaneous similarity-based self-distillation for deep metric learning. In: Proceedings of ICML (2021)
Google Scholar
Roth, K., Thede, L., Koepke, A.S., Vinyals, O., Henaff, O.J., Akata, Z.: Fantastic gains and where to find them: on the existence and prospect of general knowledge transfer between any pretrained model. In: Proceedings of ICLR (2024)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3) (2015)
Google Scholar
Sariyildiz, M.B., Alahari, K., Larlus, D., Kalantidis, Y.: Fake it till you make it: Learning transferable representations from synthetic ImageNet clones. In: Proceedings of CVPR (2023)
Google Scholar
Sariyildiz, M.B., Kalantidis, Y., Alahari, K., Larlus, D.: No reason for no supervision: Improved generalization in supervised models. In: Proceedings of ICLR (2023)
Google Scholar
Sariyildiz, M.B., Kalantidis, Y., Larlus, D., Alahari, K.: Concept generalization in visual representation learning. In: Proceedings of ICCV (2021)
Google Scholar
Shi, B., et al.: Hybrid distillation: Connecting masked autoencoders with contrastive learners. In: Proceedings of ICLR (2024)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1) (2014)
Google Scholar
Stoica, G., Bolya, D., Bjorner, J., Ramesh, P., Hearn, T., Hoffman, J.: Zipit! merging models from different tasks without training. In: Proceedings of ICLR (2024)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: Proceedings of ICLR (2020)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of ICML (2021)
Google Scholar
Touvron, H., Cord, M., Jegou, H.: DeiT III: Revenge of the ViT. In: Proc. ECCV (2022). https://doi.org/10.1007/978-3-031-20053-3_30
Van Horn, G., Met al.: The iNaturalist species classification and detection dataset. In: Proceedings of CVPR (2018)
Google Scholar
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: Proceedings of NeurIPS (2019)
Google Scholar
Wang, H., et al.: SAM-CLIP: merging vision foundation models towards semantic and spatial understanding. In: Proceedings of CVPR-W (2023)
Google Scholar
Wang, Y., et al.: Revisiting the transferability of supervised pretraining: an MLP perspective. In: Proceedings of CVPR (2022)
Google Scholar
Wei, L., Xie, L., Zhou, W., Li, H., Tian, Q.: Mvp: multimodality-guided visual pre-training. In: Proceedings of ECCV (2022)
Google Scholar
Wortsman, M., et al.: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: Proceedings of ICML (2022)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: Proceedings of CVPR (2010)
Google Scholar
Xu, H., et al.: Demystifying clip data. In: Proceedings of ICLR (2024)
Google Scholar
Yao, Y., Desai, N., Palaniswami, M.: MOMA: Distill from self-supervised teachers. arXiv:2302.02089 (2023)
Ye, P., et al.: Merging vision transformers from different tasks and domains. arXiv:2312.16240 (2023)
Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of CVPR (2017)
Google Scholar
Yin, D., Han, X., Li, B., Feng, H., Bai, J.: Parameter-efficient is not sufficient: Exploring parameter, memory, and time efficient adapter tuning for dense predictions. arXiv:2306.09729 (2023)
You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks. In: Proceedings of SIGKDD (2017)
Google Scholar
Ypsilantis, N.A., Chen, K., Araujo, A., Chum, O.: Udon: Universal dynamic online distillation for generic image representations. arXiv: 2406.08332 (2024)
Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K., Loy, C.C.: Open-vocabulary SAM: Segment and recognize twenty-thousand classes interactively. In: Proceedings of ECCV (2024)
Google Scholar
Yun, H., Cho, H.: Achievement-based training progress balancing for multi-task learning. In: Proceedings of ICCV (2023)
Google Scholar
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Proceedings of ICLR (2017)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ADE20k dataset. IJCV (2019)
Google Scholar
Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. In: Proceedings of ICLR (2022)
Google Scholar

Download references

Acknowledgements

The authors would like to sincerely thank Myung-Ho Ju, Florent Perronnin, Rafael Sampaio de Rezende, Vassilina Nikoulina and Jean-Marc Andreoli for inspiring discussions and many thoughtful comments.

Author information

Authors and Affiliations

NAVER LABS Europe, Meylan, France
Mert Bülent Sarıyıldız, Philippe Weinzaepfel, Thomas Lucas, Diane Larlus & Yannis Kalantidis

Authors

Mert Bülent Sarıyıldız
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Weinzaepfel
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Lucas
View author publications
You can also search for this author in PubMed Google Scholar
Diane Larlus
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Kalantidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mert Bülent Sarıyıldız .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 677 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarıyıldız, M.B., Weinzaepfel, P., Lucas, T., Larlus, D., Kalantidis, Y. (2025). UNIC: Universal Classification Models via Multi-teacher Distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15062. Springer, Cham. https://doi.org/10.1007/978-3-031-73235-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-73235-5_20
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73234-8
Online ISBN: 978-3-031-73235-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

UNIC: Universal Classification Models via Multi-teacher Distillation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Distilling Visual Priors from Self-Supervised Learning

PrUE: Distilling Knowledge from Sparse Teacher Networks

Calibration Transfer via Knowledge Distillation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 677 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

UNIC: Universal Classification Models via Multi-teacher Distillation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Distilling Visual Priors from Self-Supervised Learning

PrUE: Distilling Knowledge from Sparse Teacher Networks

Calibration Transfer via Knowledge Distillation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 677 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation