Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax

Vuduc, Richard; Gyulassy, Attila; Demmel, James W.; Yelick, Katherine A.

doi:10.1007/3-540-44863-2_69

Richard Vuduc⁶,
Attila Gyulassy⁶,
James W. Demmel⁶ &
…
Katherine A. Yelick⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2659))

Included in the following conference series:

International Conference on Computational Science

724 Accesses

Abstract

This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = A ^T Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-o. Finally, we propose a hybrid o.-line/run-time heuristic which in practice automatically selects near-optimal values of the key tuning parameters, the register block sizes.

Download to read the full chapter text

Chapter PDF

Optimizing Matrix Multiplication on NERSC’s High Performance Computer Cori

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

Design Principles for Sparse Matrix Multiplication on the GPU

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

A.J.C. Bik and H.A.G. Wijsho.. Automatic nonzero structure analysis. SIAM Journal on Computing, 28(5):1576–1587, 1999.
Article MATH MathSciNet Google Scholar
J. Bilmes, K. Asanović, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the International Conference on Supercomputing, July 1997.
Google Scholar
S. Blackford et al. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001. Chapter 3: http://www.netlib.org/blast.
S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable crossplatform infrastructure for application performance tuning using hardware counters. In Proceedings of Supercomputing, November 2000.
Google Scholar
J.W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
Google Scholar
B.B. Fraguela, R. Doallo, and E.L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), 1999.
Google Scholar
W.D. Gropp, D.K. Kasushik, D.E. Keyes, and B.F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241–248, 1999.
Google Scholar
G. Heber, A.J. Dolgert, M. Alt, K.A. Mazurkiewicz, and L. Stringer. Fracture mechanics on the intel itanium architecture: A case study. In Workshop on EPIC Architectures and Compiler Technology (ACM MICRO 34), Austin, TX, 2001.
Google Scholar
E.-J. Im and K.A. Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of ICCS, pages 127–136, May 2001.
Google Scholar
J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): 604–632, 1999.
Article MATH MathSciNet Google Scholar
Y. Saad. SPARSKIT: A basic toolkit for sparse matrix computations, 1994. http://www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html.
P. Stodghill. A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997.
Google Scholar
O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing’ 92, 1992.
Google Scholar
R. Vuduc, J.W. Demmel, K.A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of Supercomputing, Baltimore, MD, USA, November 2002.
Google Scholar
R. Vuduc, A. Gyulassy, J.W. Demmel, and K.A. Yelick. Memory hierarchy optimizations and performance bounds for sparse ATAx. Technical Report UCB/CS-03-1232, University of California, Berkeley, February 2003.
Google Scholar
R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J.W. Demmel, and K.A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: POHLL Workshop, New York, USA, June 2002.
Google Scholar
W. Wang and D.P. O’Leary. Adaptive use of iterative methods in interior point methods for linear programming. Technical Report UMIACS-95-111, University of Maryland at College Park, College Park, MD, USA, 1995.
Google Scholar
C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In Proc. of Supercomp., Orlando, FL, 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Division, University of California, Berkeley
Richard Vuduc, Attila Gyulassy, James W. Demmel & Katherine A. Yelick

Authors

Richard Vuduc
View author publications
You can also search for this author in PubMed Google Scholar
Attila Gyulassy
View author publications
You can also search for this author in PubMed Google Scholar
James W. Demmel
View author publications
You can also search for this author in PubMed Google Scholar
Katherine A. Yelick
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Informatics Institute, Section of Computational Science, University of Amsterdam, Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands
Peter M. A. Sloot
School of Computer Science and Software Engineering, Monash University, Wellington Road, Clayton, VIC 3800, Australia
David Abramson
Institute for High-Performance Computing and Information Systems, Fontanka emb. 6, St. Petersburg, 191187, Russia
Alexander V. Bogdanov & Yuriy E. Gorbachev &
Computer Science Dept., University of Tennessee and Oak Ridge National Laboratory, 1122 Volunteer Blvd., Knoxville, TN, 37996-3450, USA
Jack J. Dongarra
School of Information Technologies, CISCO Systems, The University of Sydney, Madsen Building F09, Sydney, NSW, 2006, Australia
Albert Y. Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vuduc, R., Gyulassy, A., Demmel, J.W., Yelick, K.A. (2003). Memory Hierarchy Optimizations and Performance Bounds for Sparse A ^T Ax . In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J.J., Zomaya, A.Y. (eds) Computational Science — ICCS 2003. ICCS 2003. Lecture Notes in Computer Science, vol 2659. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44863-2_69

Download citation

DOI: https://doi.org/10.1007/3-540-44863-2_69
Published: 18 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40196-4
Online ISBN: 978-3-540-44863-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Memory Hierarchy Optimizations and Performance Bounds for Sparse A ^T Ax

Abstract

Chapter PDF

Similar content being viewed by others

Optimizing Matrix Multiplication on NERSC’s High Performance Computer Cori

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

Design Principles for Sparse Matrix Multiplication on the GPU

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax

Abstract

Chapter PDF

Similar content being viewed by others

Optimizing Matrix Multiplication on NERSC’s High Performance Computer Cori

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

Design Principles for Sparse Matrix Multiplication on the GPU

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Memory Hierarchy Optimizations and Performance Bounds for Sparse A ^T Ax