Abstract
As multi and many core chips steadily increase their core count, we observe a phenomenon we call memory hierarchy capacity per capita inversion. To overcome this inversion while remaining energy-efficient, we present a dynamic tiling scheme which we apply to solve the classic Matrix Multiply algorithm. The tiling scheme follows a Hilbert-Inspired Curve strategy to minimize data movement energy, while still allowing for slack and variance within the computation and memory usage of a chip. Our algorithm is energy-conscious: it uses a machine model which does not require symmetric memory (in size or addressing) anywhere in the hierarchy. It only concerns itself with the energy consumption of all memories. This property makes it very robust to chip variance and allows all possible resources to be utilized, which is necessary for future near-threshold voltage designs. Initial results, obtained on a future many-core simulator targeting the Traleika Glacier architecture, give initial estimates of memory reads and writes to all parts of the chip as well as relative energy consumption.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the plasma and magma projects. J. Phys.: Conf. Ser. 180(1), 012037 (2009)
Baboulin, M., Donfack, S., Dongarra, J., Grigori, L., Rémy, A., Tomov, S.: A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines. Procedia Comput. Sci. 9, 17–26 (2012). Proceedings of the International Conference on Computational Science, ICCS 2012
Bader, M., Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl. 417(23), 301–313 (2006). Special Issue in Honor of Friedrich Ludwig Bauer
Ballard, G., Demmel, J., Lipshitz, B., Schwartz, O., Toledo, S.: Communication efficient Gaussian elimination with partial pivoting using a shape morphing data layout. In: SPAA 2013, Montréal, Québec, Canada. ACM (2013)
Borkar, S.: Role of interconnects in the future of computing. J. Lightwave Technol. 31(24) (2013). ISSN: 0733-8724
Carter, N.P., Agrawal, A., Borkar, S., Cledat, R., David, H., Dunning, D., Fryman, J.B., Ganev, I., Golliver, R.A., Knauerhase, R.C., et al.: Runnemede: an architecture for ubiquitous high-performance computing. In: HPCA (2013)
Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thottethodi, M.: Recursive array layouts and fast parallel matrix multiplication. In: SPAA, Saint Malo, France. ACM (1999)
Chen, G., Anders, M., Kaul, H., Satpathy, S., Mathew, S., Hsu, S., Agarwal, A., Krishnamurthy, R., Borkar, S., De, V.: 16.1 a 340mv-to-0.9v 20.2tb/s source-synchronous hybrid packet/circuit-switched \(16\times 16\) network-on-chip in 22nm tri-gate CMOS. In: 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (2014)
Chung, K.-L., Huang, Y.-L., Liu, Y.-W.: Efficient algorithms for coding Hilbert curve of arbitrary-sized image and application to window query. Inf. Sci. 177(10), 2130–2151 (2007). Including Special Issue on Hybrid Intelligent Systems
D’alberto, P., Bodrato, M., Nicolau, A.: Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation. ACM Trans. Math. Softw. 38(1) (2011)
Demmel, J.: Communication-avoiding algorithms for linear algebra and beyond. In: IPDPS 2013 (2013)
Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O., Spillinger, O.: Communication-optimal parallel recursive rectangular matrix multiplication. In: IPDPS (2013)
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, Washington, DC, USA. IEEE Computer Society (1999)
Garcia, E., Orozco, D., Khan, R., Venetis, I., Livingston, K., G. Gao.: A dynamic schema to increase performance in many-core architectures through Percolation operations. In: HiPC 2013, Bangalore, India. IEEE Computer Society (2013)
Hungershöfer, J., Wierum, J.-M.: On the quality of partitions based on space-filling curves. In: Sloot, P.M.A., Hoekstra, A.G., Tan, C.J.K., Dongarra, J.J. (eds.) ICCS 2002. LNCS, vol. 2331, pp. 36–45. Springer, Heidelberg (2002). doi:10.1007/3-540-47789-6_4
Intel: Strawman system architecture and evaluation (2004). http://tinyurl.com/j6xxg22. Accessed 10 July 2016
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)
Jaleel, A., Borch, E., Bhandaru, M., Steely Jr., S.C., Emer, J.: Achieving non-inclusive cache performance with inclusive caches: temporal locality aware (TLA) cache management policies. In: MICRO 2010, MICRO ’43, Washington, DC, USA. IEEE Computer Society (2010)
Juega, J., G’omez, J., Tenllado, C., Verdoolaege, S., Cohen, A., Catthoor, F.: Evaluation of state-of-the-art polyhedral tools for automatic code generation on GPUs (2012)
Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., Lethin, R.: A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In: GPGPU-3, March 2010
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13(4), 354–356 (1969)
Verdoolaege, S., Carlos Juega, J., Cohen, A., Ignacio Gómez, J., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9(4) (2013)
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: SuperComputing 1998, San Jose, CA. IEEE Computer Society (1998)
Zhang, J., Kamata, S., Ueshige, Y.: A pseudo-Hilbert scan algorithm for arbitrarily-sized rectangle region. In: Zheng, N., Jiang, X., Lan, X. (eds.) IWICPAS 2006. LNCS, vol. 4153, pp. 290–299. Springer, Heidelberg (2006). doi:10.1007/11821045_31
Acknowledgments
Authors would like to thank Shekhar Borkar, Joshua Fryman, Romain Cledat, Ivan Ganev, Bala Seshasayee and others on the Intel XStack team for information on memory energy ratios, use of FSim, and computing resources. This material is based upon work supported by the Department of Energy [Office of Science] under Award Number DE-SC0008717. This research is also based upon work supported by the National Science Foundation, under award XPS-1439097.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Livingston, K., Landwehr, A., Monsalve, J., Zuckerman, S., Meister, B., Gao, G.R. (2017). Energy Avoiding Matrix Multiply. In: Ding, C., Criswell, J., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2016. Lecture Notes in Computer Science(), vol 10136. Springer, Cham. https://doi.org/10.1007/978-3-319-52709-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-52709-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52708-6
Online ISBN: 978-3-319-52709-3
eBook Packages: Computer ScienceComputer Science (R0)