LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs | The Journal of Supercomputing Skip to main content
Log in

LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Channel pruning has recently become a widely used model compression method. However, most existing channel pruning methods only prune to decrease the model size, such as the number of parameters or FLOPs, and hence the decrease in model size does not effectively lead to an improvement in inference performance. To address this problem, this paper proposes a latency-optimized channel pruning method for CNN inference acceleration on GPU platforms by latency stair-step discrimination, two-stage benefit assessment and latency-sharing channel pruning. Compared with recent state-of-the-art model compression methods, it can achieve significant improvements in inference performance with comparable compression rates and model accuracy. The contributions of this paper include the following: first, a three-point latency stair-step discrimination method is proposed for determining the candidate prunable coordinates with the best latency performance adapted to the current hardware. Then, a two-stage benefit assessment method based on interlayer dependencies is proposed for determining the optimal channel pruning rate of each layer in the network. Finally, a latency-sharing channel pruning framework is proposed to accelerate the model pruning adaptation process. The method proposed in this paper can significantly reduce the model inference latency on multiple types of GPU platforms. To verify the effectiveness, we use three general-purpose GPU platforms and two embedded GPU platforms to evaluate the algorithm performance. The experimental results show that for recent state-of-the-art CNNs, the proposed method can achieve a 22.0–6.6% latency reduction and a 1.3 –3.0 inference performance improvement as well as a 1.2–4.3 pruning adaptation speedup with high model accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The public datasets of CIFAR-10/100 [44] and Tiny-ImageNet [45] used in the current study are available at https://www.cs.toronto.edu/~kriz/cifar.html and https://tiny-imagenet.herokuapp.com/, respectively.

References

  1. Wu X, Sahoo D, Hoi SCH (2020) Recent advances in deep learning for object detection. Neurocomputing 396:39–64

    Article  Google Scholar 

  2. Bell P, Fainberg J, Klejch O et al (2020) Adaptation algorithms for neural network-based speech recognition: an overview. IEEE Open J Signal Process 2:33–66

    Article  Google Scholar 

  3. Minaee S, Boykov YY, Porikli F et al (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3523–3542

    Google Scholar 

  4. Hu L, Zhou X, Zhang X et al (2021) A review on key challenges in intelligent vehicles: safety and driver-oriented features. IET Intel Transport Syst 15(9):1093–1105

    Article  Google Scholar 

  5. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Article  Google Scholar 

  6. Tan M, Le Q. (2021) Efficientnetv2: Smaller models and faster training In: International Conference on Machine Learning. PMLR, pp 10096–10106.

  7. Zhuang B, Tan M, Liu J et al (2021) Effective training of convolutional neural networks with low-bitwidth weights and activations. IEEE Trans Pattern Anal Mach Intell 44(10):6140–6152

    Article  Google Scholar 

  8. Yang C, Xie L, Su C, et al. (2019) Snapshot distillation: Teacher-student optimization in one generation In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 2859–2868.

  9. Lin M, Ji R, Wang Y, et al. (2020) Hrank: Filter pruning using high-rank feature map In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 1529–1538.

  10. Lin M, Ji R, Zhang Y, et al. (2021) Channel pruning via automatic structure search In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. pp 673–679.

  11. Tu C H, Lee J H, Chan Y M, et al. (2020) Pruning depthwise separable convolutions for mobilenet compression In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 1–8.

  12. Lubana E S, Dick R. (2020) A Gradient Flow Framework For Analyzing Network Pruning In: International Conference on Learning Representations.

  13. Li Y, Gu S, Mayer C, et al. (2020) Group sparsity: The hinge between filter pruning and decomposition for network compression In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.pp 8018–8027.

  14. Radu V, Kaszyk K, Wen Y, et al. (2019) Performance aware convolutional neural network channel pruning for embedded GPUs In: 2019 IEEE International Symposium on Workload Characterization (IISWC). IEEE, pp 24–34.

  15. Molchanov P, Tyree S, Karras T, et al. (2017) Pruning convolutional neural networks for resource efficient inference In: International Conference on Learning Representations, ICLR 2017-Conference Track Proceedings.

  16. Wang C, Zhang G, Grosse R. (2020) Picking Winning Tickets Before Training by Preserving Gradient Flow In: International Conference on Learning Representations, ICLR 2020-Conference Track Proceedings.

  17. Yu J, Huang T. Autoslim: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728, 2019.

  18. Li B, Wu B, Su J et al (2020) Eagleeye: Fast sub-net evaluation for efficient neural network pruning. In: European Conference on Computer Vision. Springer, Cham, pp 639–654

    Google Scholar 

  19. Wu Y C, Liu C T, Chen B Y, et al. 2020 Constraint-aware importance estimation for global filter pruning under multiple resource constraints In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp 686–687.

  20. Tan M, Chen B, Pang R, et al. (2019) Mnasnet: Platform-aware neural architecture search for mobile In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 2820–2828.

  21. Liu J, Sun J, Xu Z et al (2021) Latency-aware automatic CNN channel pruning with GPU runtime analysis. BenchCouncil Trans Benchmarks, Stand Eval 1(1):100009

    Article  Google Scholar 

  22. Dong J D, Cheng A C, Juan D C, et al. (2018) Dpp-net: Device-aware progressive search for pareto-optimal neural architectures In: Proceedings of the European Conference on Computer Vision (ECCV). pp 517–531.

  23. Dai X, Zhang P, Wu B, et al. (2019) Chamnet: Towards efficient network design through platform-aware model adaptation In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 11398–11407.

  24. Wu B, Dai X, Zhang P, et al. (2019) Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 10734–10742.

  25. Chen C, Tung F, Vedula N, et al. (2018) Constraint-aware deep neural network compression In: Proceedings of the European Conference on Computer Vision (ECCV). pp 400–415.

  26. Yang T J, Howard A, Chen B, et al. ( 2018) Netadapt: Platform-aware neural network adaptation for mobile applications, In: Proceedings of the European Conference on Computer Vision (ECCV). pp 285–300.

  27. Denton E L, Zaremba W, Bruna J, et al. (2014) Exploiting linear structure within convolutional networks for efficient evaluation. Advances in neural information processing systems, pp 27.

  28. Ba J, Caruana R. Do deep nets really need to be deep?[J]. Advances in neural information processing systems, 2014, 27.

  29. Li H, Kadav A, Durdanovic I, et al. (2017) Pruning filters for efficient convnets, In: International Conference on Learning Representations.

  30. Molchanov P, Tyree S, Karras T, et al. (2017) Pruning convolutional neural networks for resource efficient inference[C]//International Conference on Learning Representations.

  31. Chen Z, Chen Z, Lin J et al (2020) Deep neural network acceleration based on low-rank approximated channel pruning[J]. IEEE Trans Circuits Syst I Regul Pap 67(4):1232–1244

    Article  Google Scholar 

  32. Liu Z, Li J, Shen Z, et al. (2017) Learning efficient convolutional networks through network slimming, In: Proceedings of the IEEE International Conference on Computer Vision. pp 2736–2744.

  33. Yu R, Li A, Chen C F, et al. (2018) Nisp: Pruning networks using neuron importance score propagation In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 9194–9203.

  34. He Y, Liu P, Wang Z, et al. (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 4340–4349.

  35. Wen W, Wu C, Wang Y, et al. (2016) Learning structured sparsity in deep neural networks. Advances in neural information processing systems, pp 29.

  36. Louizos C, Welling M, Kingma D P. (2018) Learning sparse neural networks through L_0 Regularization In: International Conference on Learning Representations.

  37. Gamanayake C, Jayasinghe L, Ng BKK et al (2020) Cluster pruning: an efficient filter pruning method for edge ai vision applications. IEEE J Sel Top Signal Process 14(4):802–816

    Article  Google Scholar 

  38. Yu F, Xu Z, Shen T, et al. (2020) Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination. arXiv preprint arXiv:2011.03897

  39. Shen M, Yin H, Molchanov P, et al. (2021) HALP: Hardware-Aware Latency Pruning. arXiv preprint arXiv:2110.10811

  40. Yu F, Han C, Wang P, et al. (2021) HFP: Hardware-Aware Filter Pruning for Deep Convolutional Neural Networks Acceleration In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, pp 255–262.

  41. Paszke A, Gross S and et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems, pp 32.

  42. Li G, Ma X, Wang X et al (2022) Optimizing deep neural networks on intelli-gent edge accelerators via flexible-rate filter pruning. J Syst Archit 124:102431

    Article  Google Scholar 

  43. Zhu L (2018) THOP: PyTorch-OpCounter. https://pypi.org/project/thop/

  44. Krizhevsky A, Hinton G. (2009) Learning multiple layers of features from tiny images.

  45. Le Y, Yang X. 2015 Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments. Parts of our work in this paper have been accepted by the ISPA 2022 International Conference for publication. This work was partially supported by the National Key Research and Development Program of China (No. 2021ZD0110202).

Author information

Authors and Affiliations

Authors

Contributions

YZ wrote the main part of the manuscript. HJ, YZ, and RZ assisted with the experiments. YC, CZ, and WW assisted with the figure preparation. DD and XL assisted with the proofreading of the manuscript.

Corresponding authors

Correspondence to Yonghua Zhang or Hongxu Jiang.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Jiang, H., Zhu, Y. et al. LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs. J Supercomput 79, 14313–14341 (2023). https://doi.org/10.1007/s11227-023-05212-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05212-4

Keywords

Navigation