LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs

Zhang, Yonghua; Jiang, Hongxu; Zhu, Yuting; Zhang, Runhua; Cao, Yongxiang; Zhu, Chenhui; Wang, Wei; Dong, Dong; Li, Xiaobin

doi:10.1007/s11227-023-05212-4

LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs

Published: 06 April 2023

Volume 79, pages 14313–14341, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yonghua Zhang ORCID: orcid.org/0000-0002-6018-0379¹,
Hongxu Jiang^1,2,3,
Yuting Zhu¹,
Runhua Zhang¹,
Yongxiang Cao¹,
Chenhui Zhu¹,
Wei Wang¹,
Dong Dong¹ &
…
Xiaobin Li¹

440 Accesses
Explore all metrics

Abstract

Channel pruning has recently become a widely used model compression method. However, most existing channel pruning methods only prune to decrease the model size, such as the number of parameters or FLOPs, and hence the decrease in model size does not effectively lead to an improvement in inference performance. To address this problem, this paper proposes a latency-optimized channel pruning method for CNN inference acceleration on GPU platforms by latency stair-step discrimination, two-stage benefit assessment and latency-sharing channel pruning. Compared with recent state-of-the-art model compression methods, it can achieve significant improvements in inference performance with comparable compression rates and model accuracy. The contributions of this paper include the following: first, a three-point latency stair-step discrimination method is proposed for determining the candidate prunable coordinates with the best latency performance adapted to the current hardware. Then, a two-stage benefit assessment method based on interlayer dependencies is proposed for determining the optimal channel pruning rate of each layer in the network. Finally, a latency-sharing channel pruning framework is proposed to accelerate the model pruning adaptation process. The method proposed in this paper can significantly reduce the model inference latency on multiple types of GPU platforms. To verify the effectiveness, we use three general-purpose GPU platforms and two embedded GPU platforms to evaluate the algorithm performance. The experimental results show that for recent state-of-the-art CNNs, the proposed method can achieve a 22.0–6.6% latency reduction and a 1.3 –3.0 inference performance improvement as well as a 1.2–4.3 pruning adaptation speedup with high model accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Layer-First “Cache-Rollback” Collaborative Pruning Method

ViT Hybrid Channel Fit Pruning Algorithm for Co-optimization of Hardware and Software for Edge Device

Brief Review of Low-Power GPU Techniques

Data availability

The public datasets of CIFAR-10/100 [44] and Tiny-ImageNet [45] used in the current study are available at https://www.cs.toronto.edu/~kriz/cifar.html and https://tiny-imagenet.herokuapp.com/, respectively.

References

Wu X, Sahoo D, Hoi SCH (2020) Recent advances in deep learning for object detection. Neurocomputing 396:39–64
Article Google Scholar
Bell P, Fainberg J, Klejch O et al (2020) Adaptation algorithms for neural network-based speech recognition: an overview. IEEE Open J Signal Process 2:33–66
Article Google Scholar
Minaee S, Boykov YY, Porikli F et al (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3523–3542
Google Scholar
Hu L, Zhou X, Zhang X et al (2021) A review on key challenges in intelligent vehicles: safety and driver-oriented features. IET Intel Transport Syst 15(9):1093–1105
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Tan M, Le Q. (2021) Efficientnetv2: Smaller models and faster training In: International Conference on Machine Learning. PMLR, pp 10096–10106.
Zhuang B, Tan M, Liu J et al (2021) Effective training of convolutional neural networks with low-bitwidth weights and activations. IEEE Trans Pattern Anal Mach Intell 44(10):6140–6152
Article Google Scholar
Yang C, Xie L, Su C, et al. (2019) Snapshot distillation: Teacher-student optimization in one generation In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 2859–2868.
Lin M, Ji R, Wang Y, et al. (2020) Hrank: Filter pruning using high-rank feature map In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 1529–1538.
Lin M, Ji R, Zhang Y, et al. (2021) Channel pruning via automatic structure search In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. pp 673–679.
Tu C H, Lee J H, Chan Y M, et al. (2020) Pruning depthwise separable convolutions for mobilenet compression In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 1–8.
Lubana E S, Dick R. (2020) A Gradient Flow Framework For Analyzing Network Pruning In: International Conference on Learning Representations.
Li Y, Gu S, Mayer C, et al. (2020) Group sparsity: The hinge between filter pruning and decomposition for network compression In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.pp 8018–8027.
Radu V, Kaszyk K, Wen Y, et al. (2019) Performance aware convolutional neural network channel pruning for embedded GPUs In: 2019 IEEE International Symposium on Workload Characterization (IISWC). IEEE, pp 24–34.
Molchanov P, Tyree S, Karras T, et al. (2017) Pruning convolutional neural networks for resource efficient inference In: International Conference on Learning Representations, ICLR 2017-Conference Track Proceedings.
Wang C, Zhang G, Grosse R. (2020) Picking Winning Tickets Before Training by Preserving Gradient Flow In: International Conference on Learning Representations, ICLR 2020-Conference Track Proceedings.
Yu J, Huang T. Autoslim: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728, 2019.
Li B, Wu B, Su J et al (2020) Eagleeye: Fast sub-net evaluation for efficient neural network pruning. In: European Conference on Computer Vision. Springer, Cham, pp 639–654
Google Scholar
Wu Y C, Liu C T, Chen B Y, et al. 2020 Constraint-aware importance estimation for global filter pruning under multiple resource constraints In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp 686–687.
Tan M, Chen B, Pang R, et al. (2019) Mnasnet: Platform-aware neural architecture search for mobile In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 2820–2828.
Liu J, Sun J, Xu Z et al (2021) Latency-aware automatic CNN channel pruning with GPU runtime analysis. BenchCouncil Trans Benchmarks, Stand Eval 1(1):100009
Article Google Scholar
Dong J D, Cheng A C, Juan D C, et al. (2018) Dpp-net: Device-aware progressive search for pareto-optimal neural architectures In: Proceedings of the European Conference on Computer Vision (ECCV). pp 517–531.
Dai X, Zhang P, Wu B, et al. (2019) Chamnet: Towards efficient network design through platform-aware model adaptation In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 11398–11407.
Wu B, Dai X, Zhang P, et al. (2019) Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 10734–10742.
Chen C, Tung F, Vedula N, et al. (2018) Constraint-aware deep neural network compression In: Proceedings of the European Conference on Computer Vision (ECCV). pp 400–415.
Yang T J, Howard A, Chen B, et al. ( 2018) Netadapt: Platform-aware neural network adaptation for mobile applications, In: Proceedings of the European Conference on Computer Vision (ECCV). pp 285–300.
Denton E L, Zaremba W, Bruna J, et al. (2014) Exploiting linear structure within convolutional networks for efficient evaluation. Advances in neural information processing systems, pp 27.
Ba J, Caruana R. Do deep nets really need to be deep?[J]. Advances in neural information processing systems, 2014, 27.
Li H, Kadav A, Durdanovic I, et al. (2017) Pruning filters for efficient convnets, In: International Conference on Learning Representations.
Molchanov P, Tyree S, Karras T, et al. (2017) Pruning convolutional neural networks for resource efficient inference[C]//International Conference on Learning Representations.
Chen Z, Chen Z, Lin J et al (2020) Deep neural network acceleration based on low-rank approximated channel pruning[J]. IEEE Trans Circuits Syst I Regul Pap 67(4):1232–1244
Article Google Scholar
Liu Z, Li J, Shen Z, et al. (2017) Learning efficient convolutional networks through network slimming, In: Proceedings of the IEEE International Conference on Computer Vision. pp 2736–2744.
Yu R, Li A, Chen C F, et al. (2018) Nisp: Pruning networks using neuron importance score propagation In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 9194–9203.
He Y, Liu P, Wang Z, et al. (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 4340–4349.
Wen W, Wu C, Wang Y, et al. (2016) Learning structured sparsity in deep neural networks. Advances in neural information processing systems, pp 29.
Louizos C, Welling M, Kingma D P. (2018) Learning sparse neural networks through L_0 Regularization In: International Conference on Learning Representations.
Gamanayake C, Jayasinghe L, Ng BKK et al (2020) Cluster pruning: an efficient filter pruning method for edge ai vision applications. IEEE J Sel Top Signal Process 14(4):802–816
Article Google Scholar
Yu F, Xu Z, Shen T, et al. (2020) Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination. arXiv preprint arXiv:2011.03897
Shen M, Yin H, Molchanov P, et al. (2021) HALP: Hardware-Aware Latency Pruning. arXiv preprint arXiv:2110.10811
Yu F, Han C, Wang P, et al. (2021) HFP: Hardware-Aware Filter Pruning for Deep Convolutional Neural Networks Acceleration In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, pp 255–262.
Paszke A, Gross S and et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems, pp 32.
Li G, Ma X, Wang X et al (2022) Optimizing deep neural networks on intelli-gent edge accelerators via flexible-rate filter pruning. J Syst Archit 124:102431
Article Google Scholar
Zhu L (2018) THOP: PyTorch-OpCounter. https://pypi.org/project/thop/
Krizhevsky A, Hinton G. (2009) Learning multiple layers of features from tiny images.
Le Y, Yang X. 2015 Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments. Parts of our work in this paper have been accepted by the ISPA 2022 International Conference for publication. This work was partially supported by the National Key Research and Development Program of China (No. 2021ZD0110202).

Author information

Authors and Affiliations

Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Yonghua Zhang, Hongxu Jiang, Yuting Zhu, Runhua Zhang, Yongxiang Cao, Chenhui Zhu, Wei Wang, Dong Dong & Xiaobin Li
Hangzhou Innovation Institute, Beihang University, Hangzhou, 310052, China
Hongxu Jiang
State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Hongxu Jiang

Authors

Yonghua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hongxu Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yuting Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Runhua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yongxiang Cao
View author publications
You can also search for this author in PubMed Google Scholar
Chenhui Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dong Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobin Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YZ wrote the main part of the manuscript. HJ, YZ, and RZ assisted with the experiments. YC, CZ, and WW assisted with the figure preparation. DD and XL assisted with the proofreading of the manuscript.

Corresponding authors

Correspondence to Yonghua Zhang or Hongxu Jiang.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Jiang, H., Zhu, Y. et al. LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs. J Supercomput 79, 14313–14341 (2023). https://doi.org/10.1007/s11227-023-05212-4

Download citation

Accepted: 18 March 2023
Published: 06 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11227-023-05212-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Layer-First “Cache-Rollback” Collaborative Pruning Method

ViT Hybrid Channel Fit Pruning Algorithm for Co-optimization of Hardware and Software for Edge Device

Brief Review of Low-Power GPU Techniques

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Layer-First “Cache-Rollback” Collaborative Pruning Method

ViT Hybrid Channel Fit Pruning Algorithm for Co-optimization of Hardware and Software for Edge Device

Brief Review of Low-Power GPU Techniques

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation