Optimization of FPGA-based CNN accelerators using metaheuristics | The Journal of Supercomputing Skip to main content
Log in

Optimization of FPGA-based CNN accelerators using metaheuristics

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In recent years, convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields and with accuracy that was not possible before. However, this comes with extensive computational requirements, which made general central processing units (CPUs) unable to deliver the desired real-time performance. At the same time, field-programmable gate arrays (FPGAs) have seen a surge in interest for accelerating CNN inference. This is due to their ability to create custom designs with different levels of parallelism. Furthermore, FPGAs provide better performance per watt compared to other computing technologies such as graphics processing units (GPUs). The current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs), each of which is tailored for a subset of layers. However, the growing complexity of CNN architectures makes optimizing the resources available on the target FPGA device to deliver the optimal performance more challenging. This is because of the exponential increase in the design variables that must be considered when implementing a \(\text{Multi-CLP}\) accelerator as CNN’s complexity increases. In this paper, we present a CNN accelerator and an accompanying automated design methodology that employs metaheuristics for partitioning available FPGA resources to design a \(\text {Multi-CLP}\) accelerator. Specifically, the proposed design tool adopts simulated annealing (SA) and tabu search (TS) algorithms to find the number of CLPs required and their respective configurations to achieve optimal performance on a given target FPGA device. Here, the focus is on the key specifications and hardware resources, including digital signal processors (DSPs), block random access memories (BRAMs), and off-chip memory bandwidth. Experimental results and comparisons using four well-known benchmark CNNs are presented demonstrating that the proposed acceleration framework is both encouraging and promising. The \(\text {SA-/TS-based}\) \(\text {Multi-CLP}\) achieves \(1.31{\times}~-~2.37{\times}\) higher throughput than the state-of-the-art Single-/Multi-CLP approaches in accelerating AlexNet, SqueezeNet 1.1, VGGNet, and GoogLeNet architectures on the Xilinx VC707 and VC709 FPGA boards.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The pre-trained models that support the findings of this study are taken from Pytorch Torchvision without any fine-tuning. The model architectures are available in “https://pytorch.org/vision/0.8/models.html”.

References

  1. Hu X, Lu X, Hori C (2014) Mandarin speech recognition using convolution neural network with augmented tone features. In: The 9th International Symposium on Chinese Spoken Language Processing. pp 15–18 https://doi.org/10.1109/ISCSLP.2014.6936674

  2. Khalil-Hani M, Sung LS (2014) A convolutional neural network approach for face verification. In: 2014 International Conference on High Performance Computing Simulation (HPCS). pp 707–714 https://doi.org/10.1109/HPCSim.20146903759

  3. Farfade S S, M J Saberian, Li-J Li (2015) Multi-view face detection using deep convolutional neural networks. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. pp 643–650 https://doi.org/10.1145/2671188.2749408

  4. Zheng J, Wang Y, Zeng W (2015) CNN based vehicle counting with virtual coil in traffic surveillance video. In: 2015 IEEE International Conference on Multimedia Big Data. pp 280–281. https://doi.org/10.1109/BigMM.2015.56

  5. Wang R, Xu Z (2015) A pedestrian and vehicle rapid identification model based on convolutional neural network. In: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service. pp 1–4. https://doi.org/10.1145/2808492.2808524

  6. Lau MM, Lim KH, Gopalai AA (2015) Malaysia traffic sign recognition with convolutional neural network. In: 2015 IEEE International Conference on Digital Signal Processing DSP. pp 1006–1010. https://doi.org/10.1109/ICDSP.2015.7252029

  7. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inform Process syst 25

  8. Shawahna A, Sait SM, V A (2019) FPGA-based accelerators of deep learning networks for learning and classification: a review. IEEE Access 7:7823–7859

    Article  Google Scholar 

  9. Feng X, Jiang Y, Yang X et al (2019) Computer vision algorithms and hardware implementations: a survey. Integration 69:309–320. https://doi.org/10.1016/j.vlsi.2019.07.005

    Article  Google Scholar 

  10. Ghimire D, Kil D, Kim S (2022) A survey on Efficient convolutional neural networks and hardware acceleration. Electronics. https://doi.org/10.3390/electronics11060945

    Article  Google Scholar 

  11. Cong J, Xiao B (2014) Minimizing computation in convolutional neural networks. In: International conference on artificial neural networks. Springer. 8681:281–290. https://doi.org/10.1007/978-3-319-11179-7_36

  12. Howard AG, Zhu M, Chen B et al (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. In: arXiv preprint arXiv:1704.04861

  13. Horng GJ, Liu MX, Chen CC (2020) The smart image recognition mechanism for crop harvesting system in intelligent agriculture. IEEE Sensor J 20(5):2766–2781. https://doi.org/10.1109/JSEN.2019.2954287

    Article  Google Scholar 

  14. Jiang H, Li X, Safara F (2021) IoT-based agriculture: deep learning in detecting apple fruit diseases. Microprocess Microsyst. https://doi.org/10.1016/j.micpro

    Article  Google Scholar 

  15. Li H, Fan X, Jiao L, et al (2016) A high performance FPGA-based accelerator for large-scale convolutional neural networks. In: 2016 26th International Conference on Field Programmable Logic and Applications (FPL). pp 1–9. https://doi.org/10.1109/FPL.2016.7577308

  16. Zhang C, Li P, Guang Y, et al. 2015 Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. pp 161–170. https://doi.org/10.1145/2684746.2689060

  17. Suda N, Chandra V, Dasika G, et al. (2016) Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays. pp 16–25. https://doi.org/10.1145/2847263.2847276

  18. Shen Y, Ferdman M, Milder P (2017) Maximizing CNN accelerator efficiency through resource partitioning. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). pp 535–547. https://doi.org/10.1145/3079856.3080221

  19. Shen Y, Ferdman M, Milder P (2017) Maximizing CNN accelerator efficiency through resource partitioning. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). pp 535–547. https://doi.org/10.1109/HPCA.2017.29

  20. Osman IH, Kelly JP (1996) Metaheuristics: an overview. Meta-heur. https://doi.org/10.1007/978-1-4613-1361-8_1

    Article  Google Scholar 

  21. Rere LMR, Fanany MI, Arymurthy AM (2015) Simulated annealing algorithm for deep learning. Proc Comput Sci 72:137–144. https://doi.org/10.1016/j.procs

    Article  Google Scholar 

  22. Iandola FN, Han S,Moskewicz MW, et al (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5 MB model size. In: arXiv preprint arXiv:1409.1556

  23. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: arXiv preprintarXiv:1409.1556

  24. Szegedy C, Liu W, Jia Y, et al. (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–9

  25. Shawahna A, Sait SM, El-Maleh A et al (2022) FxP-QNet: a post-training quantizer for the design of mixed low-precision DNNs with dynamic fixed-point representation. IEEE Access 10:30202–30231. https://doi.org/10.1109/ACCESS.2022.3157893

    Article  Google Scholar 

  26. Cho M, Kim Y (2021) FPGA-based convolutional neural network accelerator with resource optimized approximate multiply accumulate unit. Electronics. https://doi.org/10.3390/electronics10222859

    Article  Google Scholar 

  27. Pouchet LN, Zhang P, Sadayappan P, et al. (2013) Polyhedral-based data reuse optimization for configurable computing. In: Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. pp 29–38. https://doi.org/10.1145/2435264.2435273

  28. Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65

    Article  Google Scholar 

  29. Xilinx. Vivado Design Suite Product Guide: Floating- Point Operator v7.1 [Online]. Available:https://docs.xilinx.com/v/u/en-US/pg060-floatingpoint (2020)

  30. Xilinx. User Guide: 7 Series FPGAs Memory Resources [Online]. Available:https://docs.xilinx.com/v/u/en -US/ug473_7Series_Memory_Resources (2019)

  31. Sait Sadiq M, Habib Y (1999) Iterative computer algorithms with applications in engineering: solving combinatorial optimization problems. IEEE, Los Alamitos, CA . p 387

  32. Sait SM, Youssef H (1999) VLSI physical design automation: theory and practice. World Scientific, 6

  33. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680. https://doi.org/10.1126/science.220.4598.671

    Article  MATH  Google Scholar 

  34. Cerny V (1985) Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. J optimiz Theory Appl 45(1):41–51. https://doi.org/10.1007/BF00940812

    Article  MATH  Google Scholar 

  35. Metropolis N, Rosenbluth AW, Rosenbluth MN, et al (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092. https://doi.org/10.1063/1.1699114

    Article  MATH  Google Scholar 

  36. Youssef H, Sait SM, Adiche H (2001) Evolutionary algorithms, simulated annealing and tabu search: a comparative study. Eng Appl Artif Intell 14(2):167–181

    Article  Google Scholar 

  37. Glover F (1989) Tabu search—part I. ORSA J comput 1(3):190–206. https://doi.org/10.1287/ijoc.1.3.190

    Article  MATH  Google Scholar 

  38. Glover F (1990) Tabu search—part II. ORSA J comput 2(1):4–32

    Article  MATH  Google Scholar 

  39. Glover F, Laguna M. (1998) "Tabu search”. In: Handbook of combinatorial optimization. Springer., pp. 2093–2229 https://doi.org/10.1007/978-1-4613-0303-9_33

  40. Glover F, Laguna M (1998) “Tabu search”. In: Handbook of combinatorial optimization. Springer: 2093–2229. https://doi.org/10.1007/978-1-4613-0303-9_33

  41. Russakovsky O, Deng J, Hao S et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  Google Scholar 

  42. Xilinx.(2019) User Guide: VC707 Evaluation Board for the Virtex-7 FPGA [Online]. Available: https://docs.xilinx.com/v/u/en-US/ug885_VC707_ Eval_Bd

  43. Xilinx.(2019) User Guide: VC709 Evaluation Board for the Virtex-7 FPGA [Online]. Available: https://docs.xilinx.com/v/u/en-US/ug887-vc709-eval-board-v7-fpga

  44. Garcia P, Bhowmik D, Stewart R et al (2019) Optimized memory allocation and power minimization for FPGA-based image processing. J Imaging 5(1):7. https://doi.org/10.3390/jimaging5010007

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia, for all support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sadiq M. Sait.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest that could have appeared to influence the work reported in this manuscript.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sait, S.M., El-Maleh, A., Altakrouri, M. et al. Optimization of FPGA-based CNN accelerators using metaheuristics. J Supercomput 79, 4493–4533 (2023). https://doi.org/10.1007/s11227-022-04787-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04787-8

Keywords

Navigation