Abstract
In recent years, convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields and with accuracy that was not possible before. However, this comes with extensive computational requirements, which made general central processing units (CPUs) unable to deliver the desired real-time performance. At the same time, field-programmable gate arrays (FPGAs) have seen a surge in interest for accelerating CNN inference. This is due to their ability to create custom designs with different levels of parallelism. Furthermore, FPGAs provide better performance per watt compared to other computing technologies such as graphics processing units (GPUs). The current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs), each of which is tailored for a subset of layers. However, the growing complexity of CNN architectures makes optimizing the resources available on the target FPGA device to deliver the optimal performance more challenging. This is because of the exponential increase in the design variables that must be considered when implementing a \(\text{Multi-CLP}\) accelerator as CNN’s complexity increases. In this paper, we present a CNN accelerator and an accompanying automated design methodology that employs metaheuristics for partitioning available FPGA resources to design a \(\text {Multi-CLP}\) accelerator. Specifically, the proposed design tool adopts simulated annealing (SA) and tabu search (TS) algorithms to find the number of CLPs required and their respective configurations to achieve optimal performance on a given target FPGA device. Here, the focus is on the key specifications and hardware resources, including digital signal processors (DSPs), block random access memories (BRAMs), and off-chip memory bandwidth. Experimental results and comparisons using four well-known benchmark CNNs are presented demonstrating that the proposed acceleration framework is both encouraging and promising. The \(\text {SA-/TS-based}\) \(\text {Multi-CLP}\) achieves \(1.31{\times}~-~2.37{\times}\) higher throughput than the state-of-the-art Single-/Multi-CLP approaches in accelerating AlexNet, SqueezeNet 1.1, VGGNet, and GoogLeNet architectures on the Xilinx VC707 and VC709 FPGA boards.
Similar content being viewed by others
Data availability
The pre-trained models that support the findings of this study are taken from Pytorch Torchvision without any fine-tuning. The model architectures are available in “https://pytorch.org/vision/0.8/models.html”.
References
Hu X, Lu X, Hori C (2014) Mandarin speech recognition using convolution neural network with augmented tone features. In: The 9th International Symposium on Chinese Spoken Language Processing. pp 15–18 https://doi.org/10.1109/ISCSLP.2014.6936674
Khalil-Hani M, Sung LS (2014) A convolutional neural network approach for face verification. In: 2014 International Conference on High Performance Computing Simulation (HPCS). pp 707–714 https://doi.org/10.1109/HPCSim.20146903759
Farfade S S, M J Saberian, Li-J Li (2015) Multi-view face detection using deep convolutional neural networks. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. pp 643–650 https://doi.org/10.1145/2671188.2749408
Zheng J, Wang Y, Zeng W (2015) CNN based vehicle counting with virtual coil in traffic surveillance video. In: 2015 IEEE International Conference on Multimedia Big Data. pp 280–281. https://doi.org/10.1109/BigMM.2015.56
Wang R, Xu Z (2015) A pedestrian and vehicle rapid identification model based on convolutional neural network. In: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service. pp 1–4. https://doi.org/10.1145/2808492.2808524
Lau MM, Lim KH, Gopalai AA (2015) Malaysia traffic sign recognition with convolutional neural network. In: 2015 IEEE International Conference on Digital Signal Processing DSP. pp 1006–1010. https://doi.org/10.1109/ICDSP.2015.7252029
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inform Process syst 25
Shawahna A, Sait SM, V A (2019) FPGA-based accelerators of deep learning networks for learning and classification: a review. IEEE Access 7:7823–7859
Feng X, Jiang Y, Yang X et al (2019) Computer vision algorithms and hardware implementations: a survey. Integration 69:309–320. https://doi.org/10.1016/j.vlsi.2019.07.005
Ghimire D, Kil D, Kim S (2022) A survey on Efficient convolutional neural networks and hardware acceleration. Electronics. https://doi.org/10.3390/electronics11060945
Cong J, Xiao B (2014) Minimizing computation in convolutional neural networks. In: International conference on artificial neural networks. Springer. 8681:281–290. https://doi.org/10.1007/978-3-319-11179-7_36
Howard AG, Zhu M, Chen B et al (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. In: arXiv preprint arXiv:1704.04861
Horng GJ, Liu MX, Chen CC (2020) The smart image recognition mechanism for crop harvesting system in intelligent agriculture. IEEE Sensor J 20(5):2766–2781. https://doi.org/10.1109/JSEN.2019.2954287
Jiang H, Li X, Safara F (2021) IoT-based agriculture: deep learning in detecting apple fruit diseases. Microprocess Microsyst. https://doi.org/10.1016/j.micpro
Li H, Fan X, Jiao L, et al (2016) A high performance FPGA-based accelerator for large-scale convolutional neural networks. In: 2016 26th International Conference on Field Programmable Logic and Applications (FPL). pp 1–9. https://doi.org/10.1109/FPL.2016.7577308
Zhang C, Li P, Guang Y, et al. 2015 Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. pp 161–170. https://doi.org/10.1145/2684746.2689060
Suda N, Chandra V, Dasika G, et al. (2016) Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays. pp 16–25. https://doi.org/10.1145/2847263.2847276
Shen Y, Ferdman M, Milder P (2017) Maximizing CNN accelerator efficiency through resource partitioning. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). pp 535–547. https://doi.org/10.1145/3079856.3080221
Shen Y, Ferdman M, Milder P (2017) Maximizing CNN accelerator efficiency through resource partitioning. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). pp 535–547. https://doi.org/10.1109/HPCA.2017.29
Osman IH, Kelly JP (1996) Metaheuristics: an overview. Meta-heur. https://doi.org/10.1007/978-1-4613-1361-8_1
Rere LMR, Fanany MI, Arymurthy AM (2015) Simulated annealing algorithm for deep learning. Proc Comput Sci 72:137–144. https://doi.org/10.1016/j.procs
Iandola FN, Han S,Moskewicz MW, et al (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5 MB model size. In: arXiv preprint arXiv:1409.1556
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: arXiv preprintarXiv:1409.1556
Szegedy C, Liu W, Jia Y, et al. (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–9
Shawahna A, Sait SM, El-Maleh A et al (2022) FxP-QNet: a post-training quantizer for the design of mixed low-precision DNNs with dynamic fixed-point representation. IEEE Access 10:30202–30231. https://doi.org/10.1109/ACCESS.2022.3157893
Cho M, Kim Y (2021) FPGA-based convolutional neural network accelerator with resource optimized approximate multiply accumulate unit. Electronics. https://doi.org/10.3390/electronics10222859
Pouchet LN, Zhang P, Sadayappan P, et al. (2013) Polyhedral-based data reuse optimization for configurable computing. In: Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. pp 29–38. https://doi.org/10.1145/2435264.2435273
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65
Xilinx. Vivado Design Suite Product Guide: Floating- Point Operator v7.1 [Online]. Available:https://docs.xilinx.com/v/u/en-US/pg060-floatingpoint (2020)
Xilinx. User Guide: 7 Series FPGAs Memory Resources [Online]. Available:https://docs.xilinx.com/v/u/en -US/ug473_7Series_Memory_Resources (2019)
Sait Sadiq M, Habib Y (1999) Iterative computer algorithms with applications in engineering: solving combinatorial optimization problems. IEEE, Los Alamitos, CA . p 387
Sait SM, Youssef H (1999) VLSI physical design automation: theory and practice. World Scientific, 6
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680. https://doi.org/10.1126/science.220.4598.671
Cerny V (1985) Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. J optimiz Theory Appl 45(1):41–51. https://doi.org/10.1007/BF00940812
Metropolis N, Rosenbluth AW, Rosenbluth MN, et al (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092. https://doi.org/10.1063/1.1699114
Youssef H, Sait SM, Adiche H (2001) Evolutionary algorithms, simulated annealing and tabu search: a comparative study. Eng Appl Artif Intell 14(2):167–181
Glover F (1989) Tabu search—part I. ORSA J comput 1(3):190–206. https://doi.org/10.1287/ijoc.1.3.190
Glover F (1990) Tabu search—part II. ORSA J comput 2(1):4–32
Glover F, Laguna M. (1998) "Tabu search”. In: Handbook of combinatorial optimization. Springer., pp. 2093–2229 https://doi.org/10.1007/978-1-4613-0303-9_33
Glover F, Laguna M (1998) “Tabu search”. In: Handbook of combinatorial optimization. Springer: 2093–2229. https://doi.org/10.1007/978-1-4613-0303-9_33
Russakovsky O, Deng J, Hao S et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Xilinx.(2019) User Guide: VC707 Evaluation Board for the Virtex-7 FPGA [Online]. Available: https://docs.xilinx.com/v/u/en-US/ug885_VC707_ Eval_Bd
Xilinx.(2019) User Guide: VC709 Evaluation Board for the Virtex-7 FPGA [Online]. Available: https://docs.xilinx.com/v/u/en-US/ug887-vc709-eval-board-v7-fpga
Garcia P, Bhowmik D, Stewart R et al (2019) Optimized memory allocation and power minimization for FPGA-based image processing. J Imaging 5(1):7. https://doi.org/10.3390/jimaging5010007
Acknowledgements
The authors would like to thank King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia, for all support.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest that could have appeared to influence the work reported in this manuscript.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sait, S.M., El-Maleh, A., Altakrouri, M. et al. Optimization of FPGA-based CNN accelerators using metaheuristics. J Supercomput 79, 4493–4533 (2023). https://doi.org/10.1007/s11227-022-04787-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04787-8