Abstract
Pretrained models have become a commodity and offer strong results on a broad range of tasks. In this work, we focus on classification and seek to learn a unique encoder able to take from several complementary pretrained models. We aim at even stronger generalization across a variety of classification tasks. We propose to learn such an encoder via multi-teacher distillation. We first thoroughly analyze standard distillation when driven by multiple strong teachers with complementary strengths. Guided by this analysis, we gradually propose improvements to the basic distillation setup. Among those, we enrich the architecture of the encoder with a ladder of expendable projectors, which increases the impact of intermediate features during distillation, and we introduce teacher dropping, a regularization mechanism that better balances the teachers’ influence. Our final distillation strategy leads to student models of the same capacity as any of the teachers, while retaining or improving upon the performance of the best teacher for each task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The 15 datasets are: 5 ImageNet-CoG levels [49] tailored for concept generalization, 8 small-scale fine-grained datasets (Aircraft, Cars196, DTD, EuroSAT, Flowers, Pets, Food101, SUN397) and two long-tail datasets (iNaturalist-2018 and 2019).
- 2.
Projector heads are discarded after distillation and linear probes are learned over the encoder outputs \({\boldsymbol{z}}\).
- 3.
We use the dBOT model fine-tuned for ImageNet-1K classification.
References
Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: Proceedings of CVPR (2019)
Asif, U., Tang, J., Harrer, S.: Ensemble knowledge distillation for learning improved and efficient networks. In: Proceedings of ECAI (2020)
Ba, J., Frey, B.: Adaptive dropout for training deep neural networks. In: Proceedings of NeurIPS (2013)
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Buciluǎ, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of SIGKDD (2006)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of ICCV (2021)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of ICML (2020)
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of CVPR (2021)
Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: Proceedings of ICML (2018)
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of CVPR (2014)
Clark, K., Luong, M.T., Khandelwal, U., Manning, C.D., Le, Q.V.: Bam! born-again multi-task networks for natural language understanding. In: ACL (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of ICLR (2021)
Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., Ramabhadran, B.: Efficient knowledge distillation from an ensemble of teachers. In: Interspeech (2017)
Ghiasi, G., Zoph, B., Cubuk, E.D., Le, Q.V., Lin, T.Y.: Multi-task self-training for learning general representations. In: Proceedings of CVPR (2021)
Hao, Z., et al.: Learning efficient vision transformers via fine-grained manifold distillation. In: Proceedings of NeurIPS (2022)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of CVPR (2022)
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. JSTAEORS (2019)
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of ICCV (2021)
Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. In: Proceedings of ICCV (2019)
Heo, B., Lee, M., Yun, S., Choi, J.Y.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: Proceedings of AAAI (2019)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Proceedings of NeurIPS-W (2014)
Hu, H., Dey, D., Hebert, M., Bagnell, J.A.: Learning anytime predictions in neural networks via adaptive loss balancing. In: Proceedings of AAAI (2019)
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of CVPR (2018)
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Krause, J., Deng, J., Stark, M., Li, F.F.: Collecting a large-scale dataset of fine-grained cars. In: Proceedings of CVPR-W (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of NeurIPS (2012)
Landgraf, S., Hillemann, M., Kapler, T., Ulrich, M.: Efficient multi-task uncertainties for joint semantic segmentation and monocular depth estimation. arXiv:2402.10580 (2024)
Liu, X., Zhou, J., Kong, T., Lin, X., Ji, R.: Exploring target representations for masked autoencoders. In: Proceedings of ICLR (2022)
Liu, Y., Zhang, W., Wang, J.: Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing (2020)
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013)
Marrie, J., Arbel, M., Mairal, J., Larlus, D.: On good practices for task-specific distillation of large pretrained models. arXiv:2402.11305 (2024)
Matena, M.S., Raffel, C.A.: Merging models with fisher-weighted averaging. In: Proc.eedings of NeurIPS (2022)
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proceedings of ICVGIP (2008)
Oquab, M., et al.: DINOv2: Learning robust visual features without supervision. TMLR (2024)
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: Proceedings of CVPR (2012)
Peng, Z., Dong, L., Bao, H., Wei, F., Ye, Q.: A unified view of masked image modeling. TMLR (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of ICML (2021)
Ramé, A., Ahuja, K., Zhang, J., Cord, M., Bottou, L., Lopez-Paz, D.: Model ratatouille: Recycling diverse models for out-of-distribution generalization. In: Proceedings of ICML (2023)
Rame, A., Kirchmeyer, M., Rahier, T., Rakotomamonjy, A., Gallinari, P., Cord, M.: Diverse weight averaging for out-of-distribution generalization. In: Proceedings of NeurIPS (2022)
Ranzinger, M., Heinrich, G., Kautz, J., Molchanov, P.: AM-RADIO: Agglomerative model–reduce all domains into one. In: Proceedings of CVPR (2024)
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: Proceedings of ICML (2019)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. In: Proceedings of ICLR (2015)
Roth, K., Milbich, T., Ommer, B., Cohen, J.P., Ghassemi, M.: Simultaneous similarity-based self-distillation for deep metric learning. In: Proceedings of ICML (2021)
Roth, K., Thede, L., Koepke, A.S., Vinyals, O., Henaff, O.J., Akata, Z.: Fantastic gains and where to find them: on the existence and prospect of general knowledge transfer between any pretrained model. In: Proceedings of ICLR (2024)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3) (2015)
Sariyildiz, M.B., Alahari, K., Larlus, D., Kalantidis, Y.: Fake it till you make it: Learning transferable representations from synthetic ImageNet clones. In: Proceedings of CVPR (2023)
Sariyildiz, M.B., Kalantidis, Y., Alahari, K., Larlus, D.: No reason for no supervision: Improved generalization in supervised models. In: Proceedings of ICLR (2023)
Sariyildiz, M.B., Kalantidis, Y., Larlus, D., Alahari, K.: Concept generalization in visual representation learning. In: Proceedings of ICCV (2021)
Shi, B., et al.: Hybrid distillation: Connecting masked autoencoders with contrastive learners. In: Proceedings of ICLR (2024)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1) (2014)
Stoica, G., Bolya, D., Bjorner, J., Ramesh, P., Hearn, T., Hoffman, J.: Zipit! merging models from different tasks without training. In: Proceedings of ICLR (2024)
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: Proceedings of ICLR (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of ICML (2021)
Touvron, H., Cord, M., Jegou, H.: DeiT III: Revenge of the ViT. In: Proc. ECCV (2022). https://doi.org/10.1007/978-3-031-20053-3_30
Van Horn, G., Met al.: The iNaturalist species classification and detection dataset. In: Proceedings of CVPR (2018)
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: Proceedings of NeurIPS (2019)
Wang, H., et al.: SAM-CLIP: merging vision foundation models towards semantic and spatial understanding. In: Proceedings of CVPR-W (2023)
Wang, Y., et al.: Revisiting the transferability of supervised pretraining: an MLP perspective. In: Proceedings of CVPR (2022)
Wei, L., Xie, L., Zhou, W., Li, H., Tian, Q.: Mvp: multimodality-guided visual pre-training. In: Proceedings of ECCV (2022)
Wortsman, M., et al.: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: Proceedings of ICML (2022)
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: Proceedings of CVPR (2010)
Xu, H., et al.: Demystifying clip data. In: Proceedings of ICLR (2024)
Yao, Y., Desai, N., Palaniswami, M.: MOMA: Distill from self-supervised teachers. arXiv:2302.02089 (2023)
Ye, P., et al.: Merging vision transformers from different tasks and domains. arXiv:2312.16240 (2023)
Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of CVPR (2017)
Yin, D., Han, X., Li, B., Feng, H., Bai, J.: Parameter-efficient is not sufficient: Exploring parameter, memory, and time efficient adapter tuning for dense predictions. arXiv:2306.09729 (2023)
You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks. In: Proceedings of SIGKDD (2017)
Ypsilantis, N.A., Chen, K., Araujo, A., Chum, O.: Udon: Universal dynamic online distillation for generic image representations. arXiv: 2406.08332 (2024)
Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K., Loy, C.C.: Open-vocabulary SAM: Segment and recognize twenty-thousand classes interactively. In: Proceedings of ECCV (2024)
Yun, H., Cho, H.: Achievement-based training progress balancing for multi-task learning. In: Proceedings of ICCV (2023)
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Proceedings of ICLR (2017)
Zhou, B., et al.: Semantic understanding of scenes through the ADE20k dataset. IJCV (2019)
Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. In: Proceedings of ICLR (2022)
Acknowledgements
The authors would like to sincerely thank Myung-Ho Ju, Florent Perronnin, Rafael Sampaio de Rezende, Vassilina Nikoulina and Jean-Marc Andreoli for inspiring discussions and many thoughtful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sarıyıldız, M.B., Weinzaepfel, P., Lucas, T., Larlus, D., Kalantidis, Y. (2025). UNIC: Universal Classification Models via Multi-teacher Distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15062. Springer, Cham. https://doi.org/10.1007/978-3-031-73235-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-73235-5_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73234-8
Online ISBN: 978-3-031-73235-5
eBook Packages: Computer ScienceComputer Science (R0)