oclCUB: an OpenCL parallel computing library for deep learning operators

Shi, Changqing; Sun, Yufei; Sui, Yicheng; Chen, Yuqiao; Wang, Haotian; Zhang, Yuzhi

doi:10.1007/s42514-024-00181-3

oclCUB: an OpenCL parallel computing library for deep learning operators

Regular Paper
Published: 16 February 2024

Volume 6, pages 319–329, (2024)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

280 Accesses
Explore all metrics

Abstract

Deep learning (DL) mainly uses various parallel computing libraries to optimize the speed of model training. The underlying computations of the DL operators typically include essential functions such as reduction and prefix scan, the efficiency of which can be greatly improved using parallel acceleration devices. However, the acceleration of these computations is mainly supported by collective primitive libraries such as NVIDIA CUB and AMD hipCUB, which are only available on vendor-specific hardware accelerators due to the highly segregated computational ecology between different vendors. To address this issue, we propose an OpenCL parallel computing library called oclCUB that can run on different heterogeneous platforms. OclCUB abstracts the OpenCL execution environment, implements reusable common underlying computations of DL, and designs two types of interfaces targeting the operators' heterogeneous acceleration pattern, enabling users to design and optimize DL operators efficiently. We evaluate the oclCUB on various hardware accelerators across Nvidia Tesla V100s with OpenCL 1.2, AMD RADEON PRO V520 with OpenCL 2.0, MT-3000 with MOCL 3, and Kunpeng 920 with POCL 1.6. Our experiments show that the oclCUB-based operators achieve accurate computational results on various platforms. The results also demonstrate that oclCUB is able to maintain a smaller, acceptable performance gap with CUB, and comparable in performance to hipCUB.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Establishing high performance AI ecosystem on Sunway platform

Article 28 September 2021

High-Performance Computing: A Deep Learning Perspective

Optimizing depthwise separable convolution on DCU

Article Open access 13 December 2024

References

Abadi, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Adinets, A., Merrill, D.: Onesweep: a faster least significant digit radix sort for GPUs. arXiv preprint arXiv:2206.01784 (2022)
AMD ROCm: A thin wrapper library on top of rocPRIM or CUB. https://github.com/ROCmSoftwarePlatform/hipCUB (2019a)
AMD ROCm: A C++ Runtime API and Kernel Language. https://github.com/ROCm-Developer-Tools/HIP (2019b)
AMD ROCm: AMD ROCm Platform Documentation. https://rocmdocs.amd.com/ (2022a)
AMD ROCm. A header-only library providing HIP parallel primitives. https://github.com/ROCmSoftwarePlatform/rocPRIM (2022b)
Bell, N., Hoberock, J.: “Thrust: A Productivity-Oriented Library for CUDA.” GPU Computing Gems, Jade, pp. 359–371. Morgan Kaufmann (2012)
Google Scholar
Cao, C., et al.: clMAGMA: high performance dense linear algebra with OpenCL. In: Proceedings of the International Workshop on OpenCL 2013 & 2014 (2014)
Chen, T., et al.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
Chetlur, S., et al. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
Cublas, N.C.: Library. NVIDIA Corporation, Santa Clara (2008)
Google Scholar
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: 2011 International Conference on Parallel Processing. IEEE (2011)
Intel: oneAPI Deep Neural Network Library. https://github.com/oneapi-src/oneDNN (2019)
Jääskeläinen, P., de La Lama, C.S., Schnetter, E., et al.: pocl: A performance-portable OpenCL implementation. Int. J. Parallel Prog. 43, 752–785 (2015)
Article Google Scholar
Khan, J., et al.: Miopen: an open source library for deep learning primitives. arXiv preprint arXiv:1910.00078 (2019)
Kirk, D.: NVIDIA CUDA software and GPU parallel computing architecture. In: ISMM. Vol. 7 (2007)
Komatsu, K., et al.: Evaluating performance and portability of OpenCL programs. In: The Fifth International Workshop on Automatic Performance Tuning. Vol. 66 (2010)
Lu, K., Wang, Y., Guo, Y., et al.: MT-3000: a heterogeneous multi-zone processor for HPC. CCF Trans. High Perform. Comput. 4(2), 150–164 (2022)
Article Google Scholar
Martín, P.J., Ayuso, L.F., Torres, R., et al.: Algorithmic strategies for optimizing the parallel reduction primitive in CUDA. In: 2012 International Conference on High Performance Computing & Simulation (HPCS). IEEE, pp. 511–519 (2012)
Merrill, D. CUB v1. 5.3: CUDA Unbound, a library of warp-wide, blockwide, and device-wide GPU parallel primitives. NVIDIA Res. (2015)
Nichols, D., et al.: MagmaDNN: accelerated deep learning using MAGMA. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning) (2019)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)
Pheatt, C.: Intel® threading building blocks. J. Comput. Sci. Coll. 23(4), 298–298 (2008)
Google Scholar
Rupp, K., et al.: ViennaCL–-linear algebra library for multi-and many-core architectures. SIAM J. Sci. Comput. 38(5), S412–S439 (2016)
Article MathSciNet Google Scholar
Stone, J.E., Gohara, D., Shi, G.: OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)
Article Google Scholar
Zhang, P., Fang, J., Yang, C., et al.: Mocl: an efficient OpenCL implementation for the matrix-2000 architecture. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, pp. 26–35 (2018)

Download references

Acknowledgements

This research is supported by National Key R&D Program of China grant 2021YFB0300104, as well as by Tianjin Research Innovation Project for Postgraduate Students grant 2022BKY023.

Author information

Authors and Affiliations

College of Software, Nankai University, Tianjin, 300450, China
Changqing Shi, Yufei Sun, Yicheng Sui, Yuqiao Chen, Haotian Wang & Yuzhi Zhang
ITAI, Haihe Lab, Tianjin, 300350, China
Yuzhi Zhang

Authors

Changqing Shi
View author publications
You can also search for this author inPubMed Google Scholar
Yufei Sun
View author publications
You can also search for this author inPubMed Google Scholar
Yicheng Sui
View author publications
You can also search for this author inPubMed Google Scholar
Yuqiao Chen
View author publications
You can also search for this author inPubMed Google Scholar
Haotian Wang
View author publications
You can also search for this author inPubMed Google Scholar
Yuzhi Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yufei Sun.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shi, C., Sun, Y., Sui, Y. et al. oclCUB: an OpenCL parallel computing library for deep learning operators. CCF Trans. HPC 6, 319–329 (2024). https://doi.org/10.1007/s42514-024-00181-3

Download citation

Received: 06 October 2023
Accepted: 05 January 2024
Published: 16 February 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s42514-024-00181-3

Keywords

Part of a collection:

Programming Models and System Software for High-Performance Computing (HPC) Environments

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

oclCUB: an OpenCL parallel computing library for deep learning operators

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Establishing high performance AI ecosystem on Sunway platform

High-Performance Computing: A Deep Learning Perspective

Optimizing depthwise separable convolution on DCU

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now