Delta Distillation for Efficient Video Processing

Habibian, Amirhossein; Ben Yahia, Haitam; Abati, Davide; Gavves, Efstratios; Porikli, Fatih

doi:10.1007/978-3-031-19833-5_13

Amirhossein Habibian¹²,
Haitam Ben Yahia¹²,
Davide Abati¹²,
Efstratios Gavves¹³ &
…
Fatih Porikli¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

2472 Accesses
5 Citations

Abstract

This paper aims to accelerate video stream processing, such as object detection and semantic segmentation, by leveraging the temporal redundancies that exist between video frames. Instead of propagating and warping features using motion alignment, such as optical flow, we propose a novel knowledge distillation schema coined as Delta Distillation. In our proposal, the student learns the variations in the teacher’s intermediate features over time. We demonstrate that these temporal variations can be effectively distilled due to the temporal redundancies within video frames. During inference, both teacher and student cooperate for providing predictions: the former by providing initial representations extracted only on the key-frame, and the latter by iteratively estimating and applying deltas for the successive frames. Moreover, we consider various design choices to learn optimal student architectures including an end-to-end learnable architecture search. By extensive experiments on a wide range of architectures, including the most efficient ones, we demonstrate that delta distillation sets a new state of the art in terms of accuracy vs. efficiency trade-off for semantic segmentation and object detection in videos. Finally, we show that, as a by-product, delta distillation improves the temporal consistency of the teacher model.

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 12583; Price includes VAT (Japan)

Softcover Book: JPY 15729; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Streaming Multiscale Deep Equilibrium Models

Efficient Semantic Video Segmentation with Per-Frame Inference

DYAN: A Dynamical Atoms-Based Network for Video Prediction

Notes

1.
As an example, transition layers in HRNets [44].
2.
FLOPs denotes number of multiply-adds.
3.
We limit our comparisons to efficient models with less than 100 GFLOPs.

References

Chai, Y.: Patchwork: a patch-wise attention network for efficient object detection and segmentation in video streams. In: ICCV (2019)
Google Scholar
Chen, W., Gong, X., Liu, X., Zhang, Q., Li, Y., Wang, Z.: FasterSeg: searching for faster real-time semantic segmentation. In: ICLR (2020)
Google Scholar
Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: CVPR (2020)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Google Scholar
Dai, X., et al.: General instance distillation for object detection. In: CVPR (2021)
Google Scholar
Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de Freitas, N.: Predicting parameters in deep learning. In: NeurIPS (2013)
Google Scholar
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. In: IJCV (2021)
Google Scholar
Guo, Q., et al.: Online knowledge distillation via collaborative learning. In: CVPR (2020)
Google Scholar
Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: ICML (2015)
Google Scholar
Habibian, A., Abati, D., Cohen, T.S., Bejnordi, B.E.: Skip-convolutions for efficient video processing. In: CVPR (2021)
Google Scholar
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: ICCV (2017)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hong, Y., Pan, H., Sun, W., Jia, Y., et al.: Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv preprint arXiv:2101.06085 (2021)
Hu, P., Caba, F., Wang, O., Lin, Z., Sclaroff, S., Perazzi, F.: Temporally distributed networks for fast video semantic segmentation. In: CVPR (2020)
Google Scholar
Hu, P., et al.: Real-time semantic segmentation with fast attention. In: ICRA (2020)
Google Scholar
Jacob, B., et al.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: CVPR (2018)
Google Scholar
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: BMVC (2014)
Google Scholar
Jain, S., Wang, X., Gonzalez, J.E.: Accel: a corrective fusion network for efficient semantic segmentation on video. In: CVPR (2019)
Google Scholar
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. In: ICLR (2017)
Google Scholar
Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342 (2018)
Lan, X., Zhu, X., Gong, S., et al.: Knowledge distillation by on-the-fly native ensemble. In: NeurIPS (2018)
Google Scholar
Lei, C., Xing, Y., Chen, Q.: Blind video temporal consistency via deep video prior. In: NeurIPS (2020)
Google Scholar
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2017)
Li, Y., Shi, J., Lin, D.: Low-latency video semantic segmentation. In: CVPR (2018)
Google Scholar
Liu, M., Zhu, M.: Mobile video object detection with temporally-aware feature maps. In: CVPR (2018)
Google Scholar
Liu, M., Zhu, M., White, M., Li, Y., Kalenichenko, D.: Looking fast and slow: memory-guided mobile video object detection. arXiv preprint arXiv:1903.10172 (2019)
Liu, Y., Shen, C., Yu, C., Wang, J.: Efficient semantic video segmentation with per-frame inference. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 352–368. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_21
Chapter Google Scholar
Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. In: ICLR (2017)
Google Scholar
Mao, H., Zhu, S., Han, S., Dally, W.J.: PatchNet-short-range template matching for efficient video processing. arXiv preprint arXiv:2103.07371 (2021)
Moons, B., et al.: Distilling optimal neural networks: rapid search in diverse spaces. In: ICCV (2021)
Google Scholar
Nagel, M., van Baalen, M., Blankevoort, T., Welling, M.: Data-free quantization through weight equalization and bias correction. In: ICCV (2019)
Google Scholar
Orsic, M., Kreso, I., Bevandic, P., Segvic, S.: In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In: CVPR (2019)
Google Scholar
Rebol, M., Knöbelreiter, P.: Frame-to-frame consistent semantic segmentation. In: Joint Austrian Computer Vision And Robotics Workshop (ACVRW) (2020)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Romera, E., Alvarez, J.M., Bergasa, L.M., Arroyo, R.: ERFNet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. (2017)
Google Scholar
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. In: ICLR (2015)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. In: IJCV (2015)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetv 2: inverted residuals and linear bottlenecks. In: CVPR (2018)
Google Scholar
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: CVPR (2016)
Google Scholar
Sibechi, R., Booij, O., Baka, N., Bloem, P.: Exploiting temporality for semi-supervised video segmentation. In: ICCV Workshops (2019)
Google Scholar
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Google Scholar
Tan, M., Pang, R., Le, Q.V.: EfficientDET: scalable and efficient object detection. In: CVPR (2020)
Google Scholar
Tao, A., Sapra, K., Catanzaro, B.: Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821 (2020)
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. TPAMI (2019)
Google Scholar
Wang, T., Yuan, L., Zhang, X., Feng, J.: Distilling object detectors with fine-grained feature imitation. In: CVPR (2019)
Google Scholar
Wang, Y., et al.: LedNet: a lightweight encoder-decoder network for real-time semantic segmentation. In: ICIP (2019)
Google Scholar
Wu, G., Gong, S.: Peer collaborative learning for online knowledge distillation. In: AAAI (2021)
Google Scholar
Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., Sang, N.: BiseNet v2: bilateral network with guided aggregation for real-time semantic segmentation. In: IJCV (2021)
Google Scholar
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 334–349. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_20
Chapter Google Scholar
Zhang, X., Zou, J., He, K., Sun, J.: Accelerating very deep convolutional networks for classification and detection. TPAMI (2016)
Google Scholar
Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: CVPR (2018)
Google Scholar
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: ICNet for real-time semantic segmentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 418–434. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_25
Chapter Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
Google Scholar
Zhu, X., Dai, J., Zhu, X., Wei, Y., Yuan, L.: Towards high performance video object detection for mobiles. arXiv preprint arXiv:1804.05830 (2018)
Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: ICCV (2017)
Google Scholar
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: CVPR (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Qualcomm AI Research, California, USA
Amirhossein Habibian, Haitam Ben Yahia, Davide Abati & Fatih Porikli
University of Amsterdam, Amsterdam, Netherlands
Efstratios Gavves

Authors

Amirhossein Habibian
View author publications
You can also search for this author in PubMed Google Scholar
Haitam Ben Yahia
View author publications
You can also search for this author in PubMed Google Scholar
Davide Abati
View author publications
You can also search for this author in PubMed Google Scholar
Efstratios Gavves
View author publications
You can also search for this author in PubMed Google Scholar
Fatih Porikli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amirhossein Habibian .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 695 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Habibian, A., Ben Yahia, H., Abati, D., Gavves, E., Porikli, F. (2022). Delta Distillation for Efficient Video Processing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_13
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Delta Distillation for Efficient Video Processing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Streaming Multiscale Deep Equilibrium Models

Efficient Semantic Video Segmentation with Per-Frame Inference

DYAN: A Dynamical Atoms-Based Network for Video Prediction

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 695 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Delta Distillation for Efficient Video Processing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Streaming Multiscale Deep Equilibrium Models

Efficient Semantic Video Segmentation with Per-Frame Inference

DYAN: A Dynamical Atoms-Based Network for Video Prediction

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 695 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation