UNIC: Universal Classification Models via Multi-teacher Distillation | SpringerLink
Skip to main content

UNIC: Universal Classification Models via Multi-teacher Distillation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15062))

Included in the following conference series:

  • 148 Accesses

Abstract

Pretrained models have become a commodity and offer strong results on a broad range of tasks. In this work, we focus on classification and seek to learn a unique encoder able to take from several complementary pretrained models. We aim at even stronger generalization across a variety of classification tasks. We propose to learn such an encoder via multi-teacher distillation. We first thoroughly analyze standard distillation when driven by multiple strong teachers with complementary strengths. Guided by this analysis, we gradually propose improvements to the basic distillation setup. Among those, we enrich the architecture of the encoder with a ladder of expendable projectors, which increases the impact of intermediate features during distillation, and we introduce teacher dropping, a regularization mechanism that better balances the teachers’ influence. Our final distillation strategy leads to student models of the same capacity as any of the teachers, while retaining or improving upon the performance of the best teacher for each task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The 15 datasets are: 5 ImageNet-CoG levels [49] tailored for concept generalization, 8 small-scale fine-grained datasets (Aircraft, Cars196, DTD, EuroSAT, Flowers, Pets, Food101, SUN397) and two long-tail datasets (iNaturalist-2018 and 2019).

  2. 2.

    Projector heads are discarded after distillation and linear probes are learned over the encoder outputs \({\boldsymbol{z}}\).

  3. 3.

    We use the dBOT model fine-tuned for ImageNet-1K classification.

References

  1. Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: Proceedings of CVPR (2019)

    Google Scholar 

  2. Asif, U., Tang, J., Harrer, S.: Ensemble knowledge distillation for learning improved and efficient networks. In: Proceedings of ECAI (2020)

    Google Scholar 

  3. Ba, J., Frey, B.: Adaptive dropout for training deep neural networks. In: Proceedings of NeurIPS (2013)

    Google Scholar 

  4. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29

    Chapter  Google Scholar 

  5. Buciluǎ, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of SIGKDD (2006)

    Google Scholar 

  6. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of ICCV (2021)

    Google Scholar 

  7. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of ICML (2020)

    Google Scholar 

  8. Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of CVPR (2021)

    Google Scholar 

  9. Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: Proceedings of ICML (2018)

    Google Scholar 

  10. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of CVPR (2014)

    Google Scholar 

  11. Clark, K., Luong, M.T., Khandelwal, U., Manning, C.D., Le, Q.V.: Bam! born-again multi-task networks for natural language understanding. In: ACL (2019)

    Google Scholar 

  12. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of ICLR (2021)

    Google Scholar 

  13. Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., Ramabhadran, B.: Efficient knowledge distillation from an ensemble of teachers. In: Interspeech (2017)

    Google Scholar 

  14. Ghiasi, G., Zoph, B., Cubuk, E.D., Le, Q.V., Lin, T.Y.: Multi-task self-training for learning general representations. In: Proceedings of CVPR (2021)

    Google Scholar 

  15. Hao, Z., et al.: Learning efficient vision transformers via fine-grained manifold distillation. In: Proceedings of NeurIPS (2022)

    Google Scholar 

  16. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of CVPR (2022)

    Google Scholar 

  17. Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. JSTAEORS (2019)

    Google Scholar 

  18. Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of ICCV (2021)

    Google Scholar 

  19. Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. In: Proceedings of ICCV (2019)

    Google Scholar 

  20. Heo, B., Lee, M., Yun, S., Choi, J.Y.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: Proceedings of AAAI (2019)

    Google Scholar 

  21. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Proceedings of NeurIPS-W (2014)

    Google Scholar 

  22. Hu, H., Dey, D., Hebert, M., Bagnell, J.A.: Learning anytime predictions in neural networks via adaptive loss balancing. In: Proceedings of AAAI (2019)

    Google Scholar 

  23. Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39

    Chapter  Google Scholar 

  24. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of CVPR (2018)

    Google Scholar 

  25. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)

  26. Krause, J., Deng, J., Stark, M., Li, F.F.: Collecting a large-scale dataset of fine-grained cars. In: Proceedings of CVPR-W (2013)

    Google Scholar 

  27. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of NeurIPS (2012)

    Google Scholar 

  28. Landgraf, S., Hillemann, M., Kapler, T., Ulrich, M.: Efficient multi-task uncertainties for joint semantic segmentation and monocular depth estimation. arXiv:2402.10580 (2024)

  29. Liu, X., Zhou, J., Kong, T., Lin, X., Ji, R.: Exploring target representations for masked autoencoders. In: Proceedings of ICLR (2022)

    Google Scholar 

  30. Liu, Y., Zhang, W., Wang, J.: Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing (2020)

    Google Scholar 

  31. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013)

  32. Marrie, J., Arbel, M., Mairal, J., Larlus, D.: On good practices for task-specific distillation of large pretrained models. arXiv:2402.11305 (2024)

  33. Matena, M.S., Raffel, C.A.: Merging models with fisher-weighted averaging. In: Proc.eedings of NeurIPS (2022)

    Google Scholar 

  34. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proceedings of ICVGIP (2008)

    Google Scholar 

  35. Oquab, M., et al.: DINOv2: Learning robust visual features without supervision. TMLR (2024)

    Google Scholar 

  36. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: Proceedings of CVPR (2012)

    Google Scholar 

  37. Peng, Z., Dong, L., Bao, H., Wei, F., Ye, Q.: A unified view of masked image modeling. TMLR (2023)

    Google Scholar 

  38. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of ICML (2021)

    Google Scholar 

  39. Ramé, A., Ahuja, K., Zhang, J., Cord, M., Bottou, L., Lopez-Paz, D.: Model ratatouille: Recycling diverse models for out-of-distribution generalization. In: Proceedings of ICML (2023)

    Google Scholar 

  40. Rame, A., Kirchmeyer, M., Rahier, T., Rakotomamonjy, A., Gallinari, P., Cord, M.: Diverse weight averaging for out-of-distribution generalization. In: Proceedings of NeurIPS (2022)

    Google Scholar 

  41. Ranzinger, M., Heinrich, G., Kautz, J., Molchanov, P.: AM-RADIO: Agglomerative model–reduce all domains into one. In: Proceedings of CVPR (2024)

    Google Scholar 

  42. Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: Proceedings of ICML (2019)

    Google Scholar 

  43. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. In: Proceedings of ICLR (2015)

    Google Scholar 

  44. Roth, K., Milbich, T., Ommer, B., Cohen, J.P., Ghassemi, M.: Simultaneous similarity-based self-distillation for deep metric learning. In: Proceedings of ICML (2021)

    Google Scholar 

  45. Roth, K., Thede, L., Koepke, A.S., Vinyals, O., Henaff, O.J., Akata, Z.: Fantastic gains and where to find them: on the existence and prospect of general knowledge transfer between any pretrained model. In: Proceedings of ICLR (2024)

    Google Scholar 

  46. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3) (2015)

    Google Scholar 

  47. Sariyildiz, M.B., Alahari, K., Larlus, D., Kalantidis, Y.: Fake it till you make it: Learning transferable representations from synthetic ImageNet clones. In: Proceedings of CVPR (2023)

    Google Scholar 

  48. Sariyildiz, M.B., Kalantidis, Y., Alahari, K., Larlus, D.: No reason for no supervision: Improved generalization in supervised models. In: Proceedings of ICLR (2023)

    Google Scholar 

  49. Sariyildiz, M.B., Kalantidis, Y., Larlus, D., Alahari, K.: Concept generalization in visual representation learning. In: Proceedings of ICCV (2021)

    Google Scholar 

  50. Shi, B., et al.: Hybrid distillation: Connecting masked autoencoders with contrastive learners. In: Proceedings of ICLR (2024)

    Google Scholar 

  51. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  52. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1) (2014)

    Google Scholar 

  53. Stoica, G., Bolya, D., Bjorner, J., Ramesh, P., Hearn, T., Hoffman, J.: Zipit! merging models from different tasks without training. In: Proceedings of ICLR (2024)

    Google Scholar 

  54. Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: Proceedings of ICLR (2020)

    Google Scholar 

  55. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of ICML (2021)

    Google Scholar 

  56. Touvron, H., Cord, M., Jegou, H.: DeiT III: Revenge of the ViT. In: Proc. ECCV (2022). https://doi.org/10.1007/978-3-031-20053-3_30

  57. Van Horn, G., Met al.: The iNaturalist species classification and detection dataset. In: Proceedings of CVPR (2018)

    Google Scholar 

  58. Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: Proceedings of NeurIPS (2019)

    Google Scholar 

  59. Wang, H., et al.: SAM-CLIP: merging vision foundation models towards semantic and spatial understanding. In: Proceedings of CVPR-W (2023)

    Google Scholar 

  60. Wang, Y., et al.: Revisiting the transferability of supervised pretraining: an MLP perspective. In: Proceedings of CVPR (2022)

    Google Scholar 

  61. Wei, L., Xie, L., Zhou, W., Li, H., Tian, Q.: Mvp: multimodality-guided visual pre-training. In: Proceedings of ECCV (2022)

    Google Scholar 

  62. Wortsman, M., et al.: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: Proceedings of ICML (2022)

    Google Scholar 

  63. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: Proceedings of CVPR (2010)

    Google Scholar 

  64. Xu, H., et al.: Demystifying clip data. In: Proceedings of ICLR (2024)

    Google Scholar 

  65. Yao, Y., Desai, N., Palaniswami, M.: MOMA: Distill from self-supervised teachers. arXiv:2302.02089 (2023)

  66. Ye, P., et al.: Merging vision transformers from different tasks and domains. arXiv:2312.16240 (2023)

  67. Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of CVPR (2017)

    Google Scholar 

  68. Yin, D., Han, X., Li, B., Feng, H., Bai, J.: Parameter-efficient is not sufficient: Exploring parameter, memory, and time efficient adapter tuning for dense predictions. arXiv:2306.09729 (2023)

  69. You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks. In: Proceedings of SIGKDD (2017)

    Google Scholar 

  70. Ypsilantis, N.A., Chen, K., Araujo, A., Chum, O.: Udon: Universal dynamic online distillation for generic image representations. arXiv: 2406.08332 (2024)

  71. Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K., Loy, C.C.: Open-vocabulary SAM: Segment and recognize twenty-thousand classes interactively. In: Proceedings of ECCV (2024)

    Google Scholar 

  72. Yun, H., Cho, H.: Achievement-based training progress balancing for multi-task learning. In: Proceedings of ICCV (2023)

    Google Scholar 

  73. Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Proceedings of ICLR (2017)

    Google Scholar 

  74. Zhou, B., et al.: Semantic understanding of scenes through the ADE20k dataset. IJCV (2019)

    Google Scholar 

  75. Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. In: Proceedings of ICLR (2022)

    Google Scholar 

Download references

Acknowledgements

The authors would like to sincerely thank Myung-Ho Ju, Florent Perronnin, Rafael Sampaio de Rezende, Vassilina Nikoulina and Jean-Marc Andreoli for inspiring discussions and many thoughtful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mert Bülent Sarıyıldız .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 677 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sarıyıldız, M.B., Weinzaepfel, P., Lucas, T., Larlus, D., Kalantidis, Y. (2025). UNIC: Universal Classification Models via Multi-teacher Distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15062. Springer, Cham. https://doi.org/10.1007/978-3-031-73235-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73235-5_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73234-8

  • Online ISBN: 978-3-031-73235-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics