Abstract
Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of many-core-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures.
In this paper, we use dense matrix multiplication as a case of study to present a general methodology to map applications to these kinds of architectures. Our methodology exposes the following characteristics: (1) Balanced distribution of work among threads to fully exploit available resources. (2) Optimal register tiling and sequence of traversing tiles, calculated analytically and parametrized according to the register file size of the processor used. This results in minimal memory transfers and optimal register usage. (3) Implementation of architecture specific optimizations to further increase performance. Our experimental evaluation on a real C64 chip shows a performance of 44.12 GFLOPS, which corresponds to 55.2% of the peak performance of the chip. Additionally, measurements of power consumption prove that the C64 is very power efficient providing 530 MFLOPS/W for the problem under consideration.
Chapter PDF
Similar content being viewed by others
Keywords
- Memory Hierarchy
- Memory Operation
- Tile Size
- Distribute Processing Symposium
- Matrix Multiplication Algorithm
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Alpern, B., Carter, L., Feig, E., Selker, T.: The uniform memory hierarchy model of computation. Algorithmica 12, 72–109 (1992)
Alpern, B., Carter, L., Ferrante, J.: Modeling parallel computers as memory hierarchies. In: Proceedings Programming Models for Massively Parallel Computers, pp. 116–123. IEEE Computer Society Press, Los Alamitos (1993)
Amaral, J.N., Gao, G.R., Merkey, P., Sterling, T., Ruiz, Z., Ryan, S.: Performance Prediction for the HTMT: A Programming Example. In: Proceedings of the Third PETAFLOP Workshop (1999)
Bailey, D.H., Lee, K., Simon, H.D.: Using Strassen’s Algorithm to Accelerate the Solution of Linear Systems. Journal of Supercomputing 4, 357–371 (1991)
Callahan, D., Porterfield, A.: Data cache performance of supercomputer applications. In: Supercomputing 1990: Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pp. 564–572. IEEE Computer Society Press, Los Alamitos (1990)
Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, USA (1969)
Chen, L., Hu, Z., Lin, J., Gao, G.R.: Optimizing the Fast Fourier Transform on a Multi-core Architecture. In: IEEE 2007 International Parallel and Distributed Processing Symposium (IPDPS 2007), pp. 1–8 (March 2007)
Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: PLDI 1995: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, pp. 279–290. ACM, New York (1995)
Coppersmith, D., Winograd, S.: Matrix Multiplication via Arithmetic Progressions. In: Proceedings of the 19th Annual ACM symposium on Theory of Computing (STOC 1987), New York, NY, USA, pp. 1–6 (1987)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001)
Denneau, M.: Warren Jr., H.S.: 64-bit Cyclops: Principles of Operation. Tech. rep., IBM Watson Research Center, Yorktown Heights, NY (April 2005)
Douglas, C.C., Heroux, M., Slishman, G., Smith, R.M.: GEMMW: A Portable Level 3 Blas Winograd Variant of Strassen’s Matrix-Matrix Multiply Algorithm (1994)
Feng, W.C., Scogland, T.: The Green500 List: Year One. In: 5th IEEE Workshop on High-Performance, Power-Aware Computing. In: Conjunction with the 23rd International Parallel & Distributed Processing Symposium, Rome, Italy (May 2009)
Higham, N.J.: Exploiting Fast Matrix Multiplication Within the Level 3 BLAS. ACM Transactions on Mathematical Software 16(4), 352–368 (1990)
Hu, Z., del Cuvillo, J., Zhu, W., Gao, G.R.: Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 134–144. Springer, Heidelberg (2006)
Lee, H.-J., Robertson, J.P., Fortes, J.A.B.: Generalized Cannon’s algorithm for parallel matrix multiplication. In: Proc. of the 11th International Conference on Supercomputing (ICS 1997), pp. 44–51. ACM, Vienna (1997)
Kondo, M., Okawara, H., Nakamura, H., Boku, T., Sakai, S.: Scima: a novel processor architecture for high performance computing. In: Proceedings of the Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, vol. 1, pp. 355–360 (2000)
Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, pp. 63–74. ACM, New York (1991)
Orozco, D.A., Gao, G.R.: Mapping the fdtd application to many-core chip architectures. In: ICPP 2009: Proceedings of the 2009 International Conference on Parallel Processing, pp. 309–316. IEEE Computer Society, Washington (2009)
Strassen, V.: Gaussian Elimination is not Optimal. Numerische Mathematik 14(3), 354–356 (1969)
Venetis, I.E., Gao, G.R.: Mapping the LU Decomposition on a Many-Core Architecture: Challenges and Solutions. In: Proceedings of the 6th ACM Conference on Computing Frontiers (CF 2009), Ischia, Italy, pp. 71–80 (May 2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Garcia, E., Venetis, I.E., Khan, R., Gao, G.R. (2010). Optimized Dense Matrix Multiplication on a Many-Core Architecture. In: D’Ambra, P., Guarracino, M., Talia, D. (eds) Euro-Par 2010 - Parallel Processing. Euro-Par 2010. Lecture Notes in Computer Science, vol 6272. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15291-7_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-15291-7_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15290-0
Online ISBN: 978-3-642-15291-7
eBook Packages: Computer ScienceComputer Science (R0)