Optimizing FFT-Based Convolution on ARMv8 Multi-core CPUs

Wang, Qinglin; Li, Dongsheng; Huang, Xiandong; Shen, Siqi; Mei, Songzhu; Liu, Jie

doi:10.1007/978-3-030-57675-2_16

Qinglin Wang ORCID: orcid.org/0000-0002-8286-6566^10,11,
Dongsheng Li^10,11,
Xiandong Huang^10,11,
Siqi Shen^10,11,
Songzhu Mei^10,11 &
…
Jie Liu^10,11

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12247))

Included in the following conference series:

European Conference on Parallel Processing

Abstract

Convolutional Neural Networks (CNNs) are widely applied in various machine learning applications and very time-consuming. Most of CNNs’ execution time is consumed by convolutional layers. A common approach to implementing convolutions is the FFT-based one, which can reduce the arithmetic complexity of convolutions without losing too much precision. As the performance of ARMv8 multi-core CPUs improves, they can also be utilized to perform CNNs like Intel X86 CPUs. In this paper, we present a new parallel FFT-based convolution implementation on ARMv8 multi-core CPUs. The implementation makes efficient use of ARMv8 multi-core CPUs through a series of computation and memory optimizations. The experiment results on two ARMv8 multi-core CPUs demonstrate that our new implementation gives much better performance than two existing approaches in most cases.

Granted by the National Key Research and Development Program of China (No. 2018YFB0204301), and the National Natural Science Foundation of China under grant nos. 61602500, 91530324 and 91430218.

You have full access to this open access chapter, Download conference paper PDF

A Pipelining Strategy for Accelerating Convolutional Networks on ARM Processors

A Convolutional Neural Networks Accelerator Based on Parallel Memory

Performance evaluation of convolutional neural network on Tianhe-3 prototype

Article 12 April 2021

Keywords

1 Introduction

Convolutional Neural Networks (CNNs) are widely found in various machine learning applications such as computer vision [4, 10]. In some specific tasks, such as image classification [6], their performance even exceeds human capabilities. The main reason is the application of large-scale training data sets and deep convolutional neural network structures. As a result, they are often very time-consuming. There are usually convolutional, pooling, activation, and fully-connected layers in CNNs. Most of CNNs’ execution time is spent on the convolutional layers. Therefore, it is particularly important to improve the performance of the convolutional layers.

Some of the most common approaches to performing convolutions include matrix multiplication-based, Winograd-based and Fast Fourier Transform (FFT)-based approaches [7, 9, 16,17,18]. The matrix multiplication-based approach directly transforms a convolution into matrix multiplications, which are carried out via general matrix multiplication routines (GEMM) in the Basic Linear Algebra Subprograms (BLAS) library, and then is also labeled as a GEMM-based approach. Its main disadvantage is the explosion of memory requirements and the suboptimal performance of BLAS library on the produced matrices. The Winograd-based approach can reduce the arithmetic complexity of convolutions by means of Winograd minimial filtering algorithms. However, it maybe introduces non-negligible loss of accuracy and is mainly applicable to convolutional layers with small filters. The FFT-based approach converts convolutions in the time domain into multiplications in the frequency domain, so the computation requirement of convolutions is also reduced and its accuracy loss is negligible. In the performance, the FFT-based implementations generally outperform the Winograd-based ones [19]. Thus, the FFT-based approach is suitable for more convolutional layers than the Wingorad-based one. To further improve performance of convolutions, it is very interesting to study efficient parallelization of the FFT-based approach on parallel hardware resources.

Currently, many efforts have focused on efficient implementations of FFT-based convolutions on various hardware platforms. Mathieu and Vasilache et al. [11, 15] first examined the performance of different implementations of FFT-based convolutions on GPUs. Zlateski et al. [19,20,21] mainly studied high performance implementations of FFT-based convolutions on Intel many-core CPUs. However, there is relatively little work about the optimization of FFT-based convolutions on ARMv8 multi-core CPUs.

Along with the performance enhancement of ARMv8 multi-core CPUs [12, 13], they can also be utilized to perform deep neural networks like Intel X86 CPUs. However, there is absence of high-performance convolution primitives for the ARMv8 architecture. In this paper, we propose a parallel FFT-based convolution implementation on ARMv8 multi-core CPUs. Our implementation consists of four stages: FFT transforms of input feature maps and filters, complex matrix multiplications, and IFFT transforms of output feature maps. All four stages are vectorized and thread-level parallelized. The transformed data of input feature maps and filters is stored back to memory according to the access order in the optimized implementation of complex matrix multiplications, so that the unnecessary data movement is avoided. The custom data layouts for internal tensors are proposed to support the optimization above efficiently. Our implementation is tested on Phytium FT-1500A [12] and FT-2000plus [13]. The convolutional layers from Alexnet and VGG are used to test the performance of an existing FFT-based implementation in NNAPCK, a GEMM-based one used in Caffe and our new one. Compared with the GEMM-based implementation, our implementation gets speedups of 1.48–16.19 and 3.86–78.08 times on two CPUs above, respectively. Our optimization is better than the FFT-based implementation of NNPACK in most cases on FT-1500A, and superior to the latter in all test cases on FT-2000plus. The corresponding maximum speedups are 2.16 and 7.04 times, respectively.

The structure of this paper is as follows. Section 2 introduces the detailed definition and one naive FFT-based implementation of convolutions. Section 3 describes our algorithm and optimizations on ARMv8 multi-core CPUs. The performance results are analyzed in Sect. 4. Finally, Sect. 5 concludes this paper and gives our future work.

2 Background

2.1 Convolution

A convolution takes input feature maps I and filters F as input and produces output feature maps O. In C code style, input and output feature maps with BCHW (batch, channel, height, width) layout are written as $I[B][C][H_i][W_i]$ and $O[B][C'][H_o][W_o]$, and the corresponding filters are $F[C'][C][H_f][W_f]$. The convolution in deep learning networks is expressed as

$$\begin{aligned} \begin{aligned} O_{b, c', h', w'} = \sum _{c=0}^{C-1}{\sum _{hf=0}^{H_f-1}{\sum _{wf=0}^{W_f-1}{ }}} (I_{b, c, h' \times s + hf, w' \times s + wf} \\ \times F_{c', c, hf, wf}), \end{aligned} \end{aligned}$$

(1)

where $b \in [0,B)$, $c'\in [0,C')$, $h'\in [0,H_o)$, $w'\in [0,W_o)$, $c\in [0,C)$, B is the mini-batch size, C and $C'$ denote the number of input and output channels, $H_{i/o/f}$ and $W_{i/o/f}$ represent spatial dimensions of different tensors, and s is the stride size. In the following, we only consider the case where the stride size is 1.

2.2 FFT-Based Convolution

The convolution theorem shows that a convolution in the time domain can be transformed into element-wise multiplications in the frequency domain. Applied to the field of deep learning, it makes the Eq. 1 become:

$$\begin{aligned} \begin{aligned} {O_{b,c'}} = \sum \limits _{c = 0}^{C - 1} {IFFT(FFT({I_{b,c}}) \odot FFT^*({F_{c',c}}))}, \end{aligned} \end{aligned}$$

(2)

where FFT and IFFT are 2D Fast Fourier Transforms and Inverse Fast Fourier Transforms respectively, $\odot $ denotes element-wise complex multiplication, and $*$ represents complex conjugation.

In FFT and IFFT, the discrete Fourier basis is chosen to be the largest among the spatial dimensions of three tensors [15]. When the spatial dimensions of some tensors are smaller than the Fourier basis, they are zero-padded to be the same size. However, the spatial dimensions of F are often much smaller than those of the feature maps tensors, so that the overhead of padding is non-trivial. Thus, the tile-based approach is often used to reduce the overhead. At the same time, the linearity of the Fourier transforms allows that the sum in Eq. 2 is performed before IFFT. So, the Eq. 2 is transformed to:

$$\begin{aligned} \begin{aligned} {O_{b,c',\alpha ,\beta }} = IFFT(\sum \limits _{c = 0}^{C - 1} {(FFT({I_{b,c,\alpha ,\beta }}) \odot FFT^*({F_{c',c}})} )), \end{aligned} \end{aligned}$$

(3)

where $\alpha $ and $\beta $ denote the spatial coordinates of each tile.

Each component of the element-wise complex multiplication is labeled as $(\varphi ,\gamma )$. The sum and the element-wise complex multiplication can be merged into complex matrix multiplications as follows:

$$\begin{aligned} \begin{aligned} {Z^{(\varphi ,\gamma )}} = {G^{(\varphi ,\gamma )}}{D^{(\varphi ,\gamma )}}, \end{aligned} \end{aligned}$$

(4)

where $D_{c,b,\alpha ,\beta }^{(\varphi ,\gamma )} = FFT{({I_{b,c',\alpha ,\beta }})^{(\varphi ,\gamma )}}$ and $G_{c',c}^{(\varphi ,\gamma )} = FFT^*{({F_{c',c}})^{(\varphi ,\gamma )}}$. Thus, the original implementation of a FFT-based convolution is listed in Algorithm 1. It mainly consists of four procedures: FFT transforms of input feature maps and filters, complex matrix multiplications, and IFFT transforms of output feature maps.

3 Algorithm and Optimizations

This section gives an overview of our FFT-based convolution algorithm, and presents our optimizations.

3.1 Algorithm Overview

FFT and IFFT operations in FFT-based convolution only involve the Fourier transformation between real and complex numbers. For the Fourier transformation of real numbers, the Hermitian symmetry shows that only half of the complex numbers need to be stored and the remaining can be acquired by complex conjugation [15]. Thus, we can apply the symmetry to reduce the memory space requirement and computation of complex matrix multiplications.

In order to call the complex general matrix multiplication (CGEMM) routines, elements of the FFT results should be scattered to non-adjacent storage locations. There are packing operations in the CGEMM routines, which reorganize the data in the order of access. Both the scattering and packing operations are often expensive. Thus, we can combine these two data movement operations above to further reduce memory overhead. In other words, the results of FFT can be directly scattered in the order of access in complex matrix multiplication implementations.

Algorithm 2 shows the overview of our parallel FFT-based convolution implementation, which still consists of four stages: FFT transforms of input feature maps and filters, complex matrix multiplications, and IFFT transforms of output feature maps. All four stages are vectorized and parallelized by multiple threads. The FFT results of input feature maps and filters are carefully stored in accordance with the order of access in complex matrix multiplications, so that the efficiency of memory access is greatly improved.

3.2 Data Layout

In this paper, we mainly focus on BCHW data layout. Therefore, the input and output data layout in our implementation is consistent with that in Algorithm 1, and we only need to consider how internal tensors in our implementation are stored in memory. There are mainly three internal tensors for storing the results of two Fourier transformations and one complex matrix multiplication, marked as transformed inputs D, transformed filters G, and transformed outputs Z. The data layout is influenced by two primary factors. The one is the loading order of elements in the complex matrix multiplications. The other is that the space range of memory access should be minimized to get better space locality. Under the two constraints above, we store three internal tensors as $D[\delta ^2/S][C/C_{l1}][B/B_r][\mathrm X \times \varDelta ][C_{l1}][B_r][S]$, $G[\delta ^2/S][C/C_{l1}][C'/C'_r][C_{l1}][C'_r][S]$, and $Z[C'/C'_r][B/B_r][\mathrm X \times \varDelta ][\delta ^2/S][B_r][C'_r][S]$, where S is the granularity of scattering and gathering operations, and $C_{l1}$, $C'_r$ and $B_r$ are the block sizes in the complex matrix multiplications, which will be explained in Sect. 3.4.

3.3 Fourier Transformations

In FFT transforms, D and G are calculated from input feature maps I and filters F. The spatial dimensions of input feature maps are subdivided into 2D tiles of size $\delta \times \delta $, each of which has $\delta ^2$ elements. There are a total of $\mathrm X \times \varDelta $ tiles per feature map. The discrete Fourier basis is set to be $\delta $. The Radix-2 Cooley-Tukey algorithm [2] is applied to implement FFT transform of each 2D tile, and $\delta $ is chosen to be a power of 2. When the width and height of some tiles are not powers of 2, zeros are padded to their boundaries. As the padding and transformation overhead increase with the size of zero padding [9], $\delta $ can not be set much larger than the spatial dimensions $H_f$ and $W_f$ of F, which are often small. As a result, FFT transform of each tile cannot provide sufficient parallelism for thread-level parallelism. In Algorithm 2, we deal with FFT transform of each tile by vectorization and apply multiple threads to perform FFT transform of different tiles in parallel. For the transformation of I, the thread-level parallelization is performed on the dimensions of the mini-batch size and input channels. For the transformation of F, the thread-level parallelization is carried out on the dimensions of the input and output channels.

Given a specific datatype, the vector register width in ARMv8 CPUs is labeled as L. In the detailed implementation of 2D FFTs, $\delta $-point 1D FFTs of every L columns are first carried out in parallel by means of vector units in ARMv8 CPUs. Due to the Hermitian symmetry, only $\delta /2 - 1$ complex numbers and 2 real numbers need to be saved for the $\delta $-point 1D FFT of each column, and then $\delta $-point 1D FFTs of only $\delta /2$ rows need to be done. In order to avoid matrix transpose operation, the vectorization is directly applied to $\delta $-point 1D FFT of each row. Finally, only $\delta ^2/2-2$ complex numbers and 4 real numbers are required to be stored. As $\delta $ is small, the number of twiddle factors is also small, and their values are encoded into the implementation.

In IFFT transforms, output feature maps O are computed from the result Z of complex matrix multiplications. 2D IFFTs are applied to the tiles, each of which is gathered from Z and includes $\delta ^2/2-2$ complex numbers and 4 real numbers, and produce the tiles of $\delta \times \delta $ real numbers, which are stored back to the corresponding locations of O. For the data layout of Z, the purpose of setting the dimension $\delta ^2/S$ to be the inner dimension of $\mathrm X \times \varDelta $, rather than the outer dimension of $C'/C'_r$, is to reduce the overhead of gather operations above. Like the implementations of FFTs, we only exploit vector-level parallelism in 2D IFFT of each tile, and enforce thread-level task parallelization on the dimensions of the mini-batch size and output channels.

3.4 Complex Matrix Multiplications

As the transforms of input feature maps and filters have stored their outputs in the order of access in complex matrix multiplications, there is no packing in this implementation. The mini-batch size and the number of output channels are often small, so vector units are used to compute multiple complex matrix multiplications and the blocking in $\delta ^2$ is used to provide vector-level parallelism. In this way, the scattering and gathering overhead in the transforms can be reduced by a factor of the block size S/2.

The ARMv8 architecture often has the on-chip memory hierarchy of at least three levels: register, level-1 (L1) cache and level-2 (L2) cache. It is essential to improve data reuse in every level by means of blocking techniques [5]. The matrices G, D and Z are subdivided into sub-matrices of size $C_{l1} \times C'_r \times S$, $C_{l1} \times B_r \times S$ and $B_r \times C'_r \times S$, respectively. In one case, S elements of the innermost dimension include four real numbers and $S/2-2$ complex numbers. In all the left cases, they only consist of S/2 complex numbers. Each sub-matrix $z'$ is computed as follows:

$$\begin{aligned} \begin{aligned} z'_{i,j}[S] += \sum \nolimits _{c = cs'}^{cs' + {C_{l1}}} {{g'_{c,j}}[S]} \times {d'_{c,i}}[S] \end{aligned} \end{aligned}$$

(5)

where $i \in [0, {B_r})$ and $j \in [0, {C'_r})$. In register level, the sub-matrix $z'$ can be reused $C_{l1}$ times. There are $C'_{r} \times S/L$ registers for $g'$, $B'_{r} \times S/L$ registers for $d'$ and $B'_r \times C'_{r} \times S/L$ registers for $z'$, so that the block sizes $C'_r$, $B_r$ and S are dependent on the number of available vector registers $\varUpsilon $ in the ARMv8 architecture as follows:

$$\begin{aligned} \begin{aligned} \frac{{{C'_r} \times S}}{L} + \frac{{{B_r} \times S}}{L} + \frac{{{B_r} \times {C'_r} \times S}}{L} \le \varUpsilon \end{aligned} \end{aligned}$$

(6)

At the same time, all the three sub-matrices above should be filled into L1 cache so that the block size $C_{l1}$ is restricted by the size of L1 cache. In most cases, the ratio $\varPsi $ between computation and memory access [5] can be obtained via

$$\begin{aligned} \begin{aligned} \varPsi = \frac{{4 \times {C'_{r}} \times {C_{l1}} \times {B_{r}}}}{{{C'_{r}} \times {C_{l1}} + (2 \times {C'_{r}} + {C_{l1}}) \times {B_{r}}}}. \end{aligned} \end{aligned}$$

(7)

Then, the ratio should be as high as possible, under the constraints above. The computation of each sub-matrix $z'$ is called a micro-kernel of the complex matrix multiplications. The outer loops of the micro-kernels are arranged in an order that maximizes data reuse in L1 and L2 Cache. As shown at lines 22–29 in Algorithm 2, we choose to reuse G and D in L1 and L2 cache, respectively. The block size $C_{l2}$ determines how many times sub-matrices $g'$ are reused in L1 Cache and is also limited by the size of L2 Cache. The time locality of D in L2 Cache is dependent on the size of $B/B_r \times \mathrm X \times \varDelta $.

There are thirty-two vector registers in the ARMv8 architecture. Each vector register can keep four single-precision floating-point numbers. For the micro-kernels, we set S, $B_{r}$, and $C'_{r}$ to be 8, 2 and 4, respectively. All the micro-kernels are implemented in assembly. When the sub-matrices $g'$ and $d'$ includes real numbers, the data movement operations among vector registers are minimized via zeroing some registers in advance. Cache prefetch instructions are interleaved with FMA instructions to request data ahead of time. The thread-level parallelism is extracted from the three loops at lines 24–26 in Algorithm 2, which usually can provide sufficient parallelism.

4 Experimental Results

This section describes the experimental comparisons between our FFT-based convolution implementation and two existing implementations on two ARMv8-based multi-core CPUs.

4.1 Experimental Setup

The experiments are carried out on Phytium FT-1500A [1, 12] and FT-2000plus [13] processors. The detailed parameters of these two CPUs are listed in Table 1.

Our FFT-based convolution implementation is compared with two existing implementations. The one is a GEMM-based approach used in Caffe [7], which converts convolution operation of B samples in the mini-batch iteration into B matrix multiplications. Therefore, the approach calls the GEMM routine B times, which is provided by the OpenBLAS library optimized for Phytium FT-1500A and FT-2000plus in the experiment. The other is one FFT-based convolution implementation provided by NNPACK [3]. In the following, our implementation and these two existing implementations are labeled as PFFT-conv, Caffe-conv and NNPACK, respectively. Two tile sizes, 8 $\times $ 8 and 16 $\times $ 16, are involved in PFFT-conv and NNPACK.

Table 1. Specifications of the experiment platforms

Full size table

We adopt 13 unique convolutional layers with unit stride from Alexnet [8] and VGG [14] in the tests. The configurations of all convolutional layers are listed in Table 2. The convolutional layers from Alexnet start with the letter A, while the ones from VGG are labeled with the letter V. The mini-batch size for all convolutional layers is 128. In addition, all the tests are iterated ten times and the median run-time is token as the performance of a test.

Table 2. Specifications of tested convolutonal layers

Full size table

4.2 Results on FT-1500A

The relative performance of our parallel FFT-based convolution implementation based on Caffe-conv and NNPACK implementations on Phytium FT-1500A is shown in Fig. 1 and Fig. 2. In the comparison, all three implementations are parallelized on all 16 cores of FT-1500A. The column bars at different horizontal coordinates represent speedups on different convolutional layers from Alexnet and VGG.

Compared with Caffe-conv, our approach with the tiles of sizes $16 \times 16$ and $8 \times 8$ achieves the speedups of 1.87–16.19 and 1.48–13.34 times, respectively. The minimum speedups of both two tile sizes are observed on the first convolutional layer of VGG (Vconv1.1) owing to the smallest number of input channels. Except for Vconv1.1, our approach gets the speedup of at least 2.78 times. For all the tested convolutional layers, our implementation with $16 \times 16$ tile exceeds the one with $8 \times 8$ tile.

Based on the FFT-based implementation with the $16 \times 16$ and $8 \times 8$ tiles in NNPACK, our implementation with the tiles of the same sizes obtains the speedups of 1.36–1.95 and 1.00–2.16 times, respectively. For the same $16 \times 16$ tile, our implementation surpasses the FFT-based one in NNPACK on all the layers. Except for the second convolutional layer of Alexnet, our approach with the tile of size $8 \times 8$ gets higher performance than the implementation with the same tile size in NNPACK.

4.3 Results on FT-2000plus

The performance comparison between our parallel FFT-based implementation and two existing implementations (Caffe-conv and NNPACK) on Phytium FT-2000plus is shown in Fig. 3 and 4, respectively. FT-2000plus is a Non-Uniform Memory Access (NUMA) system, and includes eight panels, each of which has eight cores. In the comparison, all the tests are parallelized on all 64 cores of FT-2000plus, and the linux tool numactl is applied to interleave memory allocation on all eight panels automatically.

For all the convolutional layers, our implementation is much better than Caffe-conv, as shown in Fig. 3. Against Caffe-conv, our implementation with two tile sizes gets the speedups of 5.35–50.88 and 3.86–78.08 times, which are caused by two main factors. The one is that the matrices produced by Caffe-conv are too small to provide sufficient parallelism for all 64 cores of FT-2000plus and the GEMM routines are not optimized for those matrices. The other is that the memory access of Caffe-conv is not efficient enough [17] and its efficiency further deteriorates on the NUMA structure of FT-2000plus. Due to the influence of the NUMA structure, our implementation with $16 \times 16$ tile works worse than the one with $8 \times 8$ tile on most convolutional layers.

As shown in Fig. 3, our implementation with two tile sizes gets the maximum speedups of 5.91 and 7.04 times based on NNPACK with the same tile sizes, respectively. In addition, our approach is better than NNPACK on all the tested convolutional layers.

5 Conclusion and Future Work

In this paper, we have presented a parallel FFT-based convolution implementation on ARMv8 multi-core CPUs, which targets unit-stride convolutional layers with BCHW data layout. Our implementation does not rely on any external computing libraries and consists of four stages: FFT transforms of input feature maps and filters, complex matrix multiplications, and IFFT transforms of output feature maps. Each of all four stages above is vectorized and partitioned to multiple cores in ARMv8 multi-core CPUs. A part of data movement operations in four stages are merged so that the efficiency of memory access is greatly improved. Our implementation now supports two tiles of sizes $16 \times 16$ and $8 \times 8$, and is verified on Phytium FT-1500A and FT-2000plus processors. For all the tested convolutional layers on two processors, our approach is much better than the GEMM-based one used in Caffe. On FT-1500A, our implementation surpasses the FFT-based one of NNPACK in most cases. On FT-2000plus, our approach is much better than the FFT-based one of NNPACK in all test cases.

In the future, we will focus on the implementation that supports more tile sizes and can automatically determine the optimal tile size.

References

Chen, X., Xie, P., Chi, L., Liu, J., Gong, C.: An efficient simd compression format for sparse matrix-vector multiplication. Concurr. Comput.: Pract. Experience 30(23), e4800 (2018)
Article Google Scholar
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)
Article MathSciNet Google Scholar
Dukhan, M.: NNPACK (2019). https://github.com/Maratyszcza/NNPACK. Accessed 3 Jan 2019
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
MATH Google Scholar
Goto, K., Geijn, R.A.V.D.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. (TOMS) 34(3), 12 (2008)
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021 (2016)
Google Scholar
Li, S., Dou, Y., Niu, X., Lv, Q., Wang, Q.: A fast and memory saved gpu acceleration algorithm of convolutional neural networks for target detection. Neurocomputing 230, 48–59 (2017)
Article Google Scholar
Mathieu, M., Henaff, M., Lecun, Y.: Fast training of convolutional networks through FFTS. In: International Conference on Learning Representations (ICLR2014), CBLS, April 2014 (2014)
Google Scholar
Phytium: FT-1500A/16 (2020). http://www.phytium.com.cn/Product/detail?language=1&product_id=9. Accessed 3 Jan 2020
Phytium: FT-2000plus/64 (2020). http://www.phytium.com.cn/Product/detail?language=1&product_id=7. Accessed 3 Jan 2020
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Piantino, S., LeCun, Y.: Fast convolutional nets with FBFFT: a GPU performance evaluation. In: 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015 (2015)
Google Scholar
Wang, Q., Li, D., Mei, S., Lai, Z., Dou, Y.: Optimizing winograd-based fast convolution algorithm on phytium multi-core CPUs (in Chinese). J. Comput. Res. Dev. 57(6), 1140–1151 (2020). https://doi.org/10.7544/issn1000-1239.2020.20200107
Article Google Scholar
Wang, Q., Songzhu, M., Liu, J., Gong, C.: Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, July 2019. https://doi.org/10.1109/IJCNN.2019.8852012
Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. In: International Conference on Machine Learning, pp. 5771–5780 (2018)
Google Scholar
Zlateski, A., Jia, Z., Li, K., Durand, F.: FFT convolutions are faster than winograd on modern CPUs, here is why. arXiv preprint arXiv:1809.07851 (2018)
Zlateski, A., Lee, K., Seung, H.S.: ZNN-a fast and scalable algorithm for training 3D convolutional networks on multi-core and many-core shared memory machines. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 801–811. IEEE (2016)
Google Scholar
Zlateski, A., Lee, K., Seung, H.S.: ZNN i: maximizing the inference throughput of 3d convolutional networks on CPUs and GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 73. IEEE Press (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, 410073, China
Qinglin Wang, Dongsheng Li, Xiandong Huang, Siqi Shen, Songzhu Mei & Jie Liu
College of Computer, National University of Defense Technology, Changsha, 410073, China
Qinglin Wang, Dongsheng Li, Xiandong Huang, Siqi Shen, Songzhu Mei & Jie Liu

Authors

Qinglin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiandong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Siqi Shen
View author publications
You can also search for this author in PubMed Google Scholar
Songzhu Mei
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinglin Wang .

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Maciej Malawski
University of Warsaw, Warsaw, Poland
Krzysztof Rzadca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Q., Li, D., Huang, X., Shen, S., Mei, S., Liu, J. (2020). Optimizing FFT-Based Convolution on ARMv8 Multi-core CPUs. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-57675-2_16
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics