Optimized Dense Matrix Multiplication on a Many-Core Architecture

Garcia, Elkin; Venetis, Ioannis E.; Khan, Rishi; Gao, Guang R.

doi:10.1007/978-3-642-15291-7_29

Elkin Garcia¹⁸,
Ioannis E. Venetis¹⁹,
Rishi Khan²⁰ &
…
Guang R. Gao¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6272))

Included in the following conference series:

European Conference on Parallel Processing

1605 Accesses
13 Citations

Abstract

Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of many-core-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures.

In this paper, we use dense matrix multiplication as a case of study to present a general methodology to map applications to these kinds of architectures. Our methodology exposes the following characteristics: (1) Balanced distribution of work among threads to fully exploit available resources. (2) Optimal register tiling and sequence of traversing tiles, calculated analytically and parametrized according to the register file size of the processor used. This results in minimal memory transfers and optimal register usage. (3) Implementation of architecture specific optimizations to further increase performance. Our experimental evaluation on a real C64 chip shows a performance of 44.12 GFLOPS, which corresponds to 55.2% of the peak performance of the chip. Additionally, measurements of power consumption prove that the C64 is very power efficient providing 530 MFLOPS/W for the problem under consideration.

Download to read the full chapter text

Chapter PDF

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures

Optimizing Matrix Multiplication on NERSC’s High Performance Computer Cori

Exploring Processor Micro-architectures Optimised for BLAS3 Micro-kernels

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Alpern, B., Carter, L., Feig, E., Selker, T.: The uniform memory hierarchy model of computation. Algorithmica 12, 72–109 (1992)
Article MathSciNet MATH Google Scholar
Alpern, B., Carter, L., Ferrante, J.: Modeling parallel computers as memory hierarchies. In: Proceedings Programming Models for Massively Parallel Computers, pp. 116–123. IEEE Computer Society Press, Los Alamitos (1993)
Chapter Google Scholar
Amaral, J.N., Gao, G.R., Merkey, P., Sterling, T., Ruiz, Z., Ryan, S.: Performance Prediction for the HTMT: A Programming Example. In: Proceedings of the Third PETAFLOP Workshop (1999)
Google Scholar
Bailey, D.H., Lee, K., Simon, H.D.: Using Strassen’s Algorithm to Accelerate the Solution of Linear Systems. Journal of Supercomputing 4, 357–371 (1991)
Article MATH Google Scholar
Callahan, D., Porterfield, A.: Data cache performance of supercomputer applications. In: Supercomputing 1990: Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pp. 564–572. IEEE Computer Society Press, Los Alamitos (1990)
Google Scholar
Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, USA (1969)
Google Scholar
Chen, L., Hu, Z., Lin, J., Gao, G.R.: Optimizing the Fast Fourier Transform on a Multi-core Architecture. In: IEEE 2007 International Parallel and Distributed Processing Symposium (IPDPS 2007), pp. 1–8 (March 2007)
Google Scholar
Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: PLDI 1995: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, pp. 279–290. ACM, New York (1995)
Chapter Google Scholar
Coppersmith, D., Winograd, S.: Matrix Multiplication via Arithmetic Progressions. In: Proceedings of the 19th Annual ACM symposium on Theory of Computing (STOC 1987), New York, NY, USA, pp. 1–6 (1987)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001)
Google Scholar
Denneau, M.: Warren Jr., H.S.: 64-bit Cyclops: Principles of Operation. Tech. rep., IBM Watson Research Center, Yorktown Heights, NY (April 2005)
Google Scholar
Douglas, C.C., Heroux, M., Slishman, G., Smith, R.M.: GEMMW: A Portable Level 3 Blas Winograd Variant of Strassen’s Matrix-Matrix Multiply Algorithm (1994)
Google Scholar
Feng, W.C., Scogland, T.: The Green500 List: Year One. In: 5th IEEE Workshop on High-Performance, Power-Aware Computing. In: Conjunction with the 23rd International Parallel & Distributed Processing Symposium, Rome, Italy (May 2009)
Google Scholar
Higham, N.J.: Exploiting Fast Matrix Multiplication Within the Level 3 BLAS. ACM Transactions on Mathematical Software 16(4), 352–368 (1990)
Article MathSciNet MATH Google Scholar
Hu, Z., del Cuvillo, J., Zhu, W., Gao, G.R.: Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 134–144. Springer, Heidelberg (2006)
Chapter Google Scholar
Lee, H.-J., Robertson, J.P., Fortes, J.A.B.: Generalized Cannon’s algorithm for parallel matrix multiplication. In: Proc. of the 11th International Conference on Supercomputing (ICS 1997), pp. 44–51. ACM, Vienna (1997)
Google Scholar
Kondo, M., Okawara, H., Nakamura, H., Boku, T., Sakai, S.: Scima: a novel processor architecture for high performance computing. In: Proceedings of the Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, vol. 1, pp. 355–360 (2000)
Google Scholar
Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, pp. 63–74. ACM, New York (1991)
Chapter Google Scholar
Orozco, D.A., Gao, G.R.: Mapping the fdtd application to many-core chip architectures. In: ICPP 2009: Proceedings of the 2009 International Conference on Parallel Processing, pp. 309–316. IEEE Computer Society, Washington (2009)
Chapter Google Scholar
Strassen, V.: Gaussian Elimination is not Optimal. Numerische Mathematik 14(3), 354–356 (1969)
Article MathSciNet MATH Google Scholar
Venetis, I.E., Gao, G.R.: Mapping the LU Decomposition on a Many-Core Architecture: Challenges and Solutions. In: Proceedings of the 6th ACM Conference on Computing Frontiers (CF 2009), Ischia, Italy, pp. 71–80 (May 2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Architecture and Parallel Systems Laboratory, Department of Electrical and Computer Engineering, University of Delaware, Newark, 19716, U.S.A.
Elkin Garcia & Guang R. Gao
Department of Computer Engineering and Informatics, University of Patras, Rion, 26500, Greece
Ioannis E. Venetis
ET International, Newark, 19711, U.S.A.
Rishi Khan

Authors

Elkin Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis E. Venetis
View author publications
You can also search for this author in PubMed Google Scholar
Rishi Khan
View author publications
You can also search for this author in PubMed Google Scholar
Guang R. Gao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy
Pasqua D’Ambra & Mario Guarracino &
ICAR-CNR, Via P. Bucci 41c, 87036, Rende, Italy
Domenico Talia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garcia, E., Venetis, I.E., Khan, R., Gao, G.R. (2010). Optimized Dense Matrix Multiplication on a Many-Core Architecture. In: D’Ambra, P., Guarracino, M., Talia, D. (eds) Euro-Par 2010 - Parallel Processing. Euro-Par 2010. Lecture Notes in Computer Science, vol 6272. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15291-7_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-15291-7_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15290-0
Online ISBN: 978-3-642-15291-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimized Dense Matrix Multiplication on a Many-Core Architecture

Abstract

Chapter PDF

Similar content being viewed by others

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures

Optimizing Matrix Multiplication on NERSC’s High Performance Computer Cori

Exploring Processor Micro-architectures Optimised for BLAS3 Micro-kernels

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Optimized Dense Matrix Multiplication on a Many-Core Architecture

Abstract

Chapter PDF

Similar content being viewed by others

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures

Optimizing Matrix Multiplication on NERSC’s High Performance Computer Cori

Exploring Processor Micro-architectures Optimised for BLAS3 Micro-kernels

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation