Abstract
This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = A T Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-o. Finally, we propose a hybrid o.-line/run-time heuristic which in practice automatically selects near-optimal values of the key tuning parameters, the register block sizes.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
A.J.C. Bik and H.A.G. Wijsho.. Automatic nonzero structure analysis. SIAM Journal on Computing, 28(5):1576–1587, 1999.
J. Bilmes, K. Asanović, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the International Conference on Supercomputing, July 1997.
S. Blackford et al. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001. Chapter 3: http://www.netlib.org/blast.
S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable crossplatform infrastructure for application performance tuning using hardware counters. In Proceedings of Supercomputing, November 2000.
J.W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
B.B. Fraguela, R. Doallo, and E.L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), 1999.
W.D. Gropp, D.K. Kasushik, D.E. Keyes, and B.F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241–248, 1999.
G. Heber, A.J. Dolgert, M. Alt, K.A. Mazurkiewicz, and L. Stringer. Fracture mechanics on the intel itanium architecture: A case study. In Workshop on EPIC Architectures and Compiler Technology (ACM MICRO 34), Austin, TX, 2001.
E.-J. Im and K.A. Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of ICCS, pages 127–136, May 2001.
J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): 604–632, 1999.
Y. Saad. SPARSKIT: A basic toolkit for sparse matrix computations, 1994. http://www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html.
P. Stodghill. A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997.
O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing’ 92, 1992.
R. Vuduc, J.W. Demmel, K.A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of Supercomputing, Baltimore, MD, USA, November 2002.
R. Vuduc, A. Gyulassy, J.W. Demmel, and K.A. Yelick. Memory hierarchy optimizations and performance bounds for sparse ATAx. Technical Report UCB/CS-03-1232, University of California, Berkeley, February 2003.
R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J.W. Demmel, and K.A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: POHLL Workshop, New York, USA, June 2002.
W. Wang and D.P. O’Leary. Adaptive use of iterative methods in interior point methods for linear programming. Technical Report UMIACS-95-111, University of Maryland at College Park, College Park, MD, USA, 1995.
C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In Proc. of Supercomp., Orlando, FL, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vuduc, R., Gyulassy, A., Demmel, J.W., Yelick, K.A. (2003). Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax . In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J.J., Zomaya, A.Y. (eds) Computational Science — ICCS 2003. ICCS 2003. Lecture Notes in Computer Science, vol 2659. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44863-2_69
Download citation
DOI: https://doi.org/10.1007/3-540-44863-2_69
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40196-4
Online ISBN: 978-3-540-44863-1
eBook Packages: Springer Book Archive