Abstract
Channel pruning has recently become a widely used model compression method. However, most existing channel pruning methods only prune to decrease the model size, such as the number of parameters or FLOPs, and hence the decrease in model size does not effectively lead to an improvement in inference performance. To address this problem, this paper proposes a latency-optimized channel pruning method for CNN inference acceleration on GPU platforms by latency stair-step discrimination, two-stage benefit assessment and latency-sharing channel pruning. Compared with recent state-of-the-art model compression methods, it can achieve significant improvements in inference performance with comparable compression rates and model accuracy. The contributions of this paper include the following: first, a three-point latency stair-step discrimination method is proposed for determining the candidate prunable coordinates with the best latency performance adapted to the current hardware. Then, a two-stage benefit assessment method based on interlayer dependencies is proposed for determining the optimal channel pruning rate of each layer in the network. Finally, a latency-sharing channel pruning framework is proposed to accelerate the model pruning adaptation process. The method proposed in this paper can significantly reduce the model inference latency on multiple types of GPU platforms. To verify the effectiveness, we use three general-purpose GPU platforms and two embedded GPU platforms to evaluate the algorithm performance. The experimental results show that for recent state-of-the-art CNNs, the proposed method can achieve a 22.0–6.6% latency reduction and a 1.3 –3.0 inference performance improvement as well as a 1.2–4.3 pruning adaptation speedup with high model accuracy.
Similar content being viewed by others
Data availability
The public datasets of CIFAR-10/100 [44] and Tiny-ImageNet [45] used in the current study are available at https://www.cs.toronto.edu/~kriz/cifar.html and https://tiny-imagenet.herokuapp.com/, respectively.
References
Wu X, Sahoo D, Hoi SCH (2020) Recent advances in deep learning for object detection. Neurocomputing 396:39–64
Bell P, Fainberg J, Klejch O et al (2020) Adaptation algorithms for neural network-based speech recognition: an overview. IEEE Open J Signal Process 2:33–66
Minaee S, Boykov YY, Porikli F et al (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3523–3542
Hu L, Zhou X, Zhang X et al (2021) A review on key challenges in intelligent vehicles: safety and driver-oriented features. IET Intel Transport Syst 15(9):1093–1105
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Tan M, Le Q. (2021) Efficientnetv2: Smaller models and faster training In: International Conference on Machine Learning. PMLR, pp 10096–10106.
Zhuang B, Tan M, Liu J et al (2021) Effective training of convolutional neural networks with low-bitwidth weights and activations. IEEE Trans Pattern Anal Mach Intell 44(10):6140–6152
Yang C, Xie L, Su C, et al. (2019) Snapshot distillation: Teacher-student optimization in one generation In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 2859–2868.
Lin M, Ji R, Wang Y, et al. (2020) Hrank: Filter pruning using high-rank feature map In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 1529–1538.
Lin M, Ji R, Zhang Y, et al. (2021) Channel pruning via automatic structure search In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. pp 673–679.
Tu C H, Lee J H, Chan Y M, et al. (2020) Pruning depthwise separable convolutions for mobilenet compression In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 1–8.
Lubana E S, Dick R. (2020) A Gradient Flow Framework For Analyzing Network Pruning In: International Conference on Learning Representations.
Li Y, Gu S, Mayer C, et al. (2020) Group sparsity: The hinge between filter pruning and decomposition for network compression In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.pp 8018–8027.
Radu V, Kaszyk K, Wen Y, et al. (2019) Performance aware convolutional neural network channel pruning for embedded GPUs In: 2019 IEEE International Symposium on Workload Characterization (IISWC). IEEE, pp 24–34.
Molchanov P, Tyree S, Karras T, et al. (2017) Pruning convolutional neural networks for resource efficient inference In: International Conference on Learning Representations, ICLR 2017-Conference Track Proceedings.
Wang C, Zhang G, Grosse R. (2020) Picking Winning Tickets Before Training by Preserving Gradient Flow In: International Conference on Learning Representations, ICLR 2020-Conference Track Proceedings.
Yu J, Huang T. Autoslim: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728, 2019.
Li B, Wu B, Su J et al (2020) Eagleeye: Fast sub-net evaluation for efficient neural network pruning. In: European Conference on Computer Vision. Springer, Cham, pp 639–654
Wu Y C, Liu C T, Chen B Y, et al. 2020 Constraint-aware importance estimation for global filter pruning under multiple resource constraints In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp 686–687.
Tan M, Chen B, Pang R, et al. (2019) Mnasnet: Platform-aware neural architecture search for mobile In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 2820–2828.
Liu J, Sun J, Xu Z et al (2021) Latency-aware automatic CNN channel pruning with GPU runtime analysis. BenchCouncil Trans Benchmarks, Stand Eval 1(1):100009
Dong J D, Cheng A C, Juan D C, et al. (2018) Dpp-net: Device-aware progressive search for pareto-optimal neural architectures In: Proceedings of the European Conference on Computer Vision (ECCV). pp 517–531.
Dai X, Zhang P, Wu B, et al. (2019) Chamnet: Towards efficient network design through platform-aware model adaptation In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 11398–11407.
Wu B, Dai X, Zhang P, et al. (2019) Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 10734–10742.
Chen C, Tung F, Vedula N, et al. (2018) Constraint-aware deep neural network compression In: Proceedings of the European Conference on Computer Vision (ECCV). pp 400–415.
Yang T J, Howard A, Chen B, et al. ( 2018) Netadapt: Platform-aware neural network adaptation for mobile applications, In: Proceedings of the European Conference on Computer Vision (ECCV). pp 285–300.
Denton E L, Zaremba W, Bruna J, et al. (2014) Exploiting linear structure within convolutional networks for efficient evaluation. Advances in neural information processing systems, pp 27.
Ba J, Caruana R. Do deep nets really need to be deep?[J]. Advances in neural information processing systems, 2014, 27.
Li H, Kadav A, Durdanovic I, et al. (2017) Pruning filters for efficient convnets, In: International Conference on Learning Representations.
Molchanov P, Tyree S, Karras T, et al. (2017) Pruning convolutional neural networks for resource efficient inference[C]//International Conference on Learning Representations.
Chen Z, Chen Z, Lin J et al (2020) Deep neural network acceleration based on low-rank approximated channel pruning[J]. IEEE Trans Circuits Syst I Regul Pap 67(4):1232–1244
Liu Z, Li J, Shen Z, et al. (2017) Learning efficient convolutional networks through network slimming, In: Proceedings of the IEEE International Conference on Computer Vision. pp 2736–2744.
Yu R, Li A, Chen C F, et al. (2018) Nisp: Pruning networks using neuron importance score propagation In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 9194–9203.
He Y, Liu P, Wang Z, et al. (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 4340–4349.
Wen W, Wu C, Wang Y, et al. (2016) Learning structured sparsity in deep neural networks. Advances in neural information processing systems, pp 29.
Louizos C, Welling M, Kingma D P. (2018) Learning sparse neural networks through L_0 Regularization In: International Conference on Learning Representations.
Gamanayake C, Jayasinghe L, Ng BKK et al (2020) Cluster pruning: an efficient filter pruning method for edge ai vision applications. IEEE J Sel Top Signal Process 14(4):802–816
Yu F, Xu Z, Shen T, et al. (2020) Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination. arXiv preprint arXiv:2011.03897
Shen M, Yin H, Molchanov P, et al. (2021) HALP: Hardware-Aware Latency Pruning. arXiv preprint arXiv:2110.10811
Yu F, Han C, Wang P, et al. (2021) HFP: Hardware-Aware Filter Pruning for Deep Convolutional Neural Networks Acceleration In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, pp 255–262.
Paszke A, Gross S and et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems, pp 32.
Li G, Ma X, Wang X et al (2022) Optimizing deep neural networks on intelli-gent edge accelerators via flexible-rate filter pruning. J Syst Archit 124:102431
Zhu L (2018) THOP: PyTorch-OpCounter. https://pypi.org/project/thop/
Krizhevsky A, Hinton G. (2009) Learning multiple layers of features from tiny images.
Le Y, Yang X. 2015 Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3
Acknowledgements
The authors would like to thank the anonymous reviewers for their constructive comments. Parts of our work in this paper have been accepted by the ISPA 2022 International Conference for publication. This work was partially supported by the National Key Research and Development Program of China (No. 2021ZD0110202).
Author information
Authors and Affiliations
Contributions
YZ wrote the main part of the manuscript. HJ, YZ, and RZ assisted with the experiments. YC, CZ, and WW assisted with the figure preparation. DD and XL assisted with the proofreading of the manuscript.
Corresponding authors
Ethics declarations
Conflicts of interest
The authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Jiang, H., Zhu, Y. et al. LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs. J Supercomput 79, 14313–14341 (2023). https://doi.org/10.1007/s11227-023-05212-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05212-4