Energy Avoiding Matrix Multiply

Livingston, Kelly; Landwehr, Aaron; Monsalve, José; Zuckerman, Stéphane; Meister, Benoît; Gao, Guang R.

doi:10.1007/978-3-319-52709-3_5

Kelly Livingston¹⁶,
Aaron Landwehr¹⁶,
José Monsalve¹⁶,
Stéphane Zuckerman¹⁶,
Benoît Meister¹⁷ &
…
Guang R. Gao¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10136))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

Abstract

As multi and many core chips steadily increase their core count, we observe a phenomenon we call memory hierarchy capacity per capita inversion. To overcome this inversion while remaining energy-efficient, we present a dynamic tiling scheme which we apply to solve the classic Matrix Multiply algorithm. The tiling scheme follows a Hilbert-Inspired Curve strategy to minimize data movement energy, while still allowing for slack and variance within the computation and memory usage of a chip. Our algorithm is energy-conscious: it uses a machine model which does not require symmetric memory (in size or addressing) anywhere in the hierarchy. It only concerns itself with the energy consumption of all memories. This property makes it very robust to chip variance and allows all possible resources to be utilized, which is necessary for future near-threshold voltage designs. Initial results, obtained on a future many-core simulator targeting the Traleika Glacier architecture, give initial estimates of memory reads and writes to all parts of the chip as well as relative energy consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Empirical Study on the Energy Efficiency of Matrix Transposition Algorithms

Optimizing the LU Factorization for Energy Efficiency on a Many-Core Architecture

Many-Core Architecture for NTC: Energy Efficiency from the Ground Up

References

Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the plasma and magma projects. J. Phys.: Conf. Ser. 180(1), 012037 (2009)
Google Scholar
Baboulin, M., Donfack, S., Dongarra, J., Grigori, L., Rémy, A., Tomov, S.: A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines. Procedia Comput. Sci. 9, 17–26 (2012). Proceedings of the International Conference on Computational Science, ICCS 2012
Article Google Scholar
Bader, M., Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl. 417(23), 301–313 (2006). Special Issue in Honor of Friedrich Ludwig Bauer
Article MathSciNet MATH Google Scholar
Ballard, G., Demmel, J., Lipshitz, B., Schwartz, O., Toledo, S.: Communication efficient Gaussian elimination with partial pivoting using a shape morphing data layout. In: SPAA 2013, Montréal, Québec, Canada. ACM (2013)
Google Scholar
Borkar, S.: Role of interconnects in the future of computing. J. Lightwave Technol. 31(24) (2013). ISSN: 0733-8724
Google Scholar
Carter, N.P., Agrawal, A., Borkar, S., Cledat, R., David, H., Dunning, D., Fryman, J.B., Ganev, I., Golliver, R.A., Knauerhase, R.C., et al.: Runnemede: an architecture for ubiquitous high-performance computing. In: HPCA (2013)
Google Scholar
Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thottethodi, M.: Recursive array layouts and fast parallel matrix multiplication. In: SPAA, Saint Malo, France. ACM (1999)
Google Scholar
Chen, G., Anders, M., Kaul, H., Satpathy, S., Mathew, S., Hsu, S., Agarwal, A., Krishnamurthy, R., Borkar, S., De, V.: 16.1 a 340mv-to-0.9v 20.2tb/s source-synchronous hybrid packet/circuit-switched \(16\times 16\) network-on-chip in 22nm tri-gate CMOS. In: 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (2014)
Google Scholar
Chung, K.-L., Huang, Y.-L., Liu, Y.-W.: Efficient algorithms for coding Hilbert curve of arbitrary-sized image and application to window query. Inf. Sci. 177(10), 2130–2151 (2007). Including Special Issue on Hybrid Intelligent Systems
Article MathSciNet MATH Google Scholar
D’alberto, P., Bodrato, M., Nicolau, A.: Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation. ACM Trans. Math. Softw. 38(1) (2011)
Google Scholar
Demmel, J.: Communication-avoiding algorithms for linear algebra and beyond. In: IPDPS 2013 (2013)
Google Scholar
Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O., Spillinger, O.: Communication-optimal parallel recursive rectangular matrix multiplication. In: IPDPS (2013)
Google Scholar
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, Washington, DC, USA. IEEE Computer Society (1999)
Google Scholar
Garcia, E., Orozco, D., Khan, R., Venetis, I., Livingston, K., G. Gao.: A dynamic schema to increase performance in many-core architectures through Percolation operations. In: HiPC 2013, Bangalore, India. IEEE Computer Society (2013)
Google Scholar
Hungershöfer, J., Wierum, J.-M.: On the quality of partitions based on space-filling curves. In: Sloot, P.M.A., Hoekstra, A.G., Tan, C.J.K., Dongarra, J.J. (eds.) ICCS 2002. LNCS, vol. 2331, pp. 36–45. Springer, Heidelberg (2002). doi:10.1007/3-540-47789-6_4
Chapter Google Scholar
Intel: Strawman system architecture and evaluation (2004). http://tinyurl.com/j6xxg22. Accessed 10 July 2016
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)
Article MATH Google Scholar
Jaleel, A., Borch, E., Bhandaru, M., Steely Jr., S.C., Emer, J.: Achieving non-inclusive cache performance with inclusive caches: temporal locality aware (TLA) cache management policies. In: MICRO 2010, MICRO ’43, Washington, DC, USA. IEEE Computer Society (2010)
Google Scholar
Juega, J., G’omez, J., Tenllado, C., Verdoolaege, S., Cohen, A., Catthoor, F.: Evaluation of state-of-the-art polyhedral tools for automatic code generation on GPUs (2012)
Google Scholar
Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., Lethin, R.: A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In: GPGPU-3, March 2010
Google Scholar
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13(4), 354–356 (1969)
Article MathSciNet MATH Google Scholar
Verdoolaege, S., Carlos Juega, J., Cohen, A., Ignacio Gómez, J., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9(4) (2013)
Google Scholar
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: SuperComputing 1998, San Jose, CA. IEEE Computer Society (1998)
Google Scholar
Zhang, J., Kamata, S., Ueshige, Y.: A pseudo-Hilbert scan algorithm for arbitrarily-sized rectangle region. In: Zheng, N., Jiang, X., Lan, X. (eds.) IWICPAS 2006. LNCS, vol. 4153, pp. 290–299. Springer, Heidelberg (2006). doi:10.1007/11821045_31
Chapter Google Scholar

Download references

Acknowledgments

Authors would like to thank Shekhar Borkar, Joshua Fryman, Romain Cledat, Ivan Ganev, Bala Seshasayee and others on the Intel XStack team for information on memory energy ratios, use of FSim, and computing resources. This material is based upon work supported by the Department of Energy [Office of Science] under Award Number DE-SC0008717. This research is also based upon work supported by the National Science Foundation, under award XPS-1439097.

Author information

Authors and Affiliations

Computer Architecture and Parallel Systems Laboratory, Electrical and Computer Engineering Department, University of Delaware, Newark, Delaware, USA
Kelly Livingston, Aaron Landwehr, José Monsalve, Stéphane Zuckerman & Guang R. Gao
Reservoir Labs, 632 Broadway, New York, >New York, 10012, USA
Benoît Meister

Authors

Kelly Livingston
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Landwehr
View author publications
You can also search for this author in PubMed Google Scholar
José Monsalve
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Zuckerman
View author publications
You can also search for this author in PubMed Google Scholar
Benoît Meister
View author publications
You can also search for this author in PubMed Google Scholar
Guang R. Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kelly Livingston .

Editor information

Editors and Affiliations

University of Rochester , Rochester, New York, USA
Chen Ding
University of Rochester , Rochester, New York, USA
John Criswell
Huawei Inc. , Santa Clara, California, USA
Peng Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Livingston, K., Landwehr, A., Monsalve, J., Zuckerman, S., Meister, B., Gao, G.R. (2017). Energy Avoiding Matrix Multiply. In: Ding, C., Criswell, J., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2016. Lecture Notes in Computer Science(), vol 10136. Springer, Cham. https://doi.org/10.1007/978-3-319-52709-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-52709-3_5
Published: 24 January 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52708-6
Online ISBN: 978-3-319-52709-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Energy Avoiding Matrix Multiply

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Empirical Study on the Energy Efficiency of Matrix Transposition Algorithms

Optimizing the LU Factorization for Energy Efficiency on a Many-Core Architecture

Many-Core Architecture for NTC: Energy Efficiency from the Ground Up

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Energy Avoiding Matrix Multiply

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Empirical Study on the Energy Efficiency of Matrix Transposition Algorithms

Optimizing the LU Factorization for Energy Efficiency on a Many-Core Architecture

Many-Core Architecture for NTC: Energy Efficiency from the Ground Up

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation