计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 699-704.doi: 10.11896/jsjkx.201200150
李爽, 赵荣彩, 王磊
LI Shuang, ZHAO Rong-cai, WANG Lei
摘要: BLAS库作为高性能计算中最基本的数学库,对高性能计算机平台上的数值计算、人工智能等领域应用都起着重要作用。BLAS3级函数GEMM是整个BLAS库性能的核心指标。目前,还没有能够充分发挥申威1621平台优势的高性能BLAS库。针对上述问题,在申威1621平台上,实现了GotoBLAS的移植与优化。提出了一种使用SIMD向量化进行核心代码优化的算法实现,为满足向量优化的算法实现分别进行了数据重排、计算数据块选择、浮点寄存器分配、向量化指令改写等优化技术。分别比较了SGEMM和DGEMM在Micro-kernel中使用cache行和使用向量化优化的最优数据块选择方案。实验结果表明,优化后最佳分块下的SGEMM单核性能比GotoBLAS单核单精度浮点数平均加速52.09倍,DGEMM单核性能比GotoBLAS单核双精度浮点数平均加速32.75倍。
中图分类号:
[1]GOTO K,GEIJN R A.Anatomy of high-performance matrixmultiplication[J].ACM Transactions on Mathematical Software (TOMS),2008,34(3):1-25. [2]ZHANG X Y,WANG Q,ZHANG Y Q.Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor[C]//2012 IEEE 18th International Conference on Parallel and Distributed Systems.Singapore,2012:684-691. [3]WANG E,ZHANG Q,SHEN B,et al.Intel math kernel library[M].High-Performance Computing on the Intel© Xeon Phi-.Springer,Cham,2014:167-188. [4]AMD.2012.AMD Core Math Library[OL].http://developer.amd.com/tools/cpu/acml/pages/default.aspx. [5]cuBLAS.Basic Linear Algebra on NVIDIA GPUs[OL].https://developer.nvidia.com/cublas. [6]GOTO K,VAN DE GEIJN R.High-performance implementa-tion of the level-3 BLAS[J].ACM Transactions on Mathematical Software (TOMS),2008,35(1):1-14. [7]JIANG M Q,ZHANG Y Q,SONG G,et al.Research on High Performance Implementation Mechanism of GOTOBLAS General Matrix-matrix Multiplication[J].Computer Engineering,2008(7):84-86,103. [8]LIU H,LIU F F,ZHANG P,et al.Optimization of BLAS Level 3 Functions on SW1600[J].Computer System Application,2016,25(12):234-239. [9]LIU Z,TIAN X.Vectorization of Matrix Multiplication forMulti-core Vector Processors[J].Chinese Journal of Compu-ters,2018,41(10):2251-2264. [10]VAN ZEE F G,SMITH T M.Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods[J].ACM Transactions on Mathematical Software,2017,44(1):1-36. [11]KIM K,COSTA T B,DEVECIM,et al.Designing vector-friendly compact BLAS and LAPACK kernels[C]//IEEE International Conference on High Performance Computing Data and Analytics.2017. [12]Chengdu Sunway Technology Corporation Limited.2017.Sun-way1621 processor structure manual[OL].http://www.swcpu.cn/uploadfile/2018/0709/20180709030836489.pdf. |
[1] | 姚建宇, 张祎维, 张广婷, 贾海鹏. 基于SIMD的三角函数高性能实现与优化 High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD 计算机科学, 2021, 48(12): 29-35. https://doi.org/10.11896/jsjkx.201200135 |
[2] | 龚彤艳,张广婷,贾海鹏,袁良. 一种偶数基Cooley-Tukey FFT高性能实现方法 High-performance Implementation Method for Even Basis of Cooley-Tukey FFT 计算机科学, 2020, 47(1): 31-39. https://doi.org/10.11896/jsjkx.190900179 |
[3] | 周蓓, 黄永忠, 许瑾晨, 郭绍忠. 向量数学库的向量化方法研究 Study on SIMD Method of Vector Math Library 计算机科学, 2019, 46(1): 320-324. https://doi.org/10.11896/j.issn.1002-137X.2019.01.050 |
[4] | 金星彤,李鹏,王刚,刘晓光,李忠伟. 基于异或的隐私保护码优化研究 Optimizing Small XOR-based Non-systematic Erasure Codes 计算机科学, 2017, 44(6): 36-42. https://doi.org/10.11896/j.issn.1002-137X.2017.06.006 |
[5] | 于海宁,韩林,李鹏远. 面向自动向量化的结构体优化 Structure Optimization for Automatic Vectorization 计算机科学, 2016, 43(2): 210-215. https://doi.org/10.11896/j.issn.1002-137X.2016.02.045 |
[6] | 徐金龙 赵荣彩 赵 博. SIMD向量指令的非满载使用方法研究 Research on Non-full Length Usage of SIMD Vector Instruction 计算机科学, 2015, 42(7): 229-233. https://doi.org/10.11896/j.issn.1002-137X.2015.07.049 |
[7] | 徐金龙,赵荣彩,徐晓燕. SIMD代码中的向量访存优化研究 Memory Access Optimization for Vector Program of SIMD Form 计算机科学, 2015, 42(12): 18-22. |
[8] | 孙回回,赵荣彩,高伟,李雁冰. 基于条件分类的控制流向量化 Control Flow Vectorization Based on Conditions Classification 计算机科学, 2015, 42(11): 240-247. https://doi.org/10.11896/j.issn.1002-137X.2015.11.049 |
[9] | 徐颖,李春江,董钰山,周思齐. GCC编译器中编译指导的自动向量化实现 Implementation of Auto-vectorization Based on Directives in GCC 计算机科学, 2014, 41(Z11): 364-367. |
[10] | 侯永生,赵荣彩,黄磊,韩林. 面向SIMD扩展部件的循环优化研究 Research on SIMD-oriented Loop Optimizations 计算机科学, 2014, 41(5): 27-32. https://doi.org/10.11896/j.issn.1002-137X.2014.05.006 |
[11] | 赵博,赵荣彩,李雁冰,高伟. 类型转换语句的SLP发掘方法 SLP Exploitation Method for Type Conversion Statements 计算机科学, 2014, 41(11): 16-21. https://doi.org/10.11896/j.issn.1002-137X.2014.11.004 |
[12] | 李春江,徐颖,黄娟娟,杨灿群. SIMD指令集设计空间的形式化描述 Formal Description of Design Space of SIMD Instruction Sets 计算机科学, 2013, 40(6): 32-36. |
[13] | 何军,黄永勤,朱英. 基于SIMD部件的四倍精度浮点乘加器设计 Design of Quadruple Precision Floating-point Fused Multiply-Add Unit Based on SIMD Device 计算机科学, 2013, 40(12): 15-18. |
[14] | 敖富江,杜静,马孝尊,汪连栋. 高性能并行仿真中程序与平台之间的适用性研究 Research on the Applicability between Program and Platform in High Performance Simulation 计算机科学, 2012, 39(Z6): 444-448. |
[15] | 魏帅,赵荣彩,姚远,侯永生. 面向SIMD的数组重组和对齐优化 Data Regroup and Alignment Optimization Based on SIMD 计算机科学, 2012, 39(2): 305-310. |
|