Abstract
This paper explores the memory subsystem design through gem5 simulations of a non-uniform memory access (NUMA) architecture with ARM cores equipped with vector engines. And connected to a Network-on-Chip (NoC) following the Coherent Hub Interface (CHI) protocol. The study quantifies the benefits of vectorization, prefetching, and multichannel NoC configurations using a benchmark for generating memory patterns and indexed accesses. The outcomes provide insights into improving bus utilization and bandwidth and reducing stalls in the system. The paper proposes hardware/software (HW/SW) advancements to reach and use the HBM device with a higher percentage than 80% at the memory controllers in the simulated manycore system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sato, M., et al.: Co-design and system for the supercomputer “fugaku’’. IEEE Micro. 42(2), 26–34 (2022)
Monroe, D.: Fugaku takes the lead. Commun. ACM 64(1), 16–18 (2021)
Yamamura, S., et al.: A64FX: 52-core processor designed for the 442petaflops supercomputer fugaku. In: ISSCC, San Francisco, CA, USA, 20–26 February 2022, pp. 352–354. IEEE (2022)
Sato, M.: The supercomputer “fugaku” and ARM-SVE enabled A64FX processor for energy-efficiency and sustained application performance. In: ISPDC 2020, pp. 1–5 (2020)
Stephens, N., et al.: The ARM scalable vector extension. CoRR, abs/1803.06185 (2018)
Lee, J., et al.: Extending OpenMP SIMD support for target specific code and application to ARM SVE. In: Scaling OpenMP for Exascale Performance and Portability - 13th IWOMP (2017)
Reed, D., et al.: Reinventing high performance computing: Challenges and opportunities (2022)
Petitet, A., et al.: HPL - a portable implementation of the high-performance LINPACK benchmark for distributed-memory computers, December 2018
Wu, D., Li, J., Yin, R., Hsiao, H., Kim, Y., Miguel, J.S.: UGEMM: unary computing architecture for GEMM applications. In: ISCA, pp. 377–390 (2020)
Zaourar, L., et al.: Multilevel simulation-based co-design of next generation HPC microprocessors (PMBS), St. Louis, MO, USA, pp. 18–29 (2021)
Lavin, P., Riedy, E.J., Vuduc, R., Young, J.S.: Spatter: a benchmark suite for evaluating sparse access patterns. CoRR, abs/1811.03743 (2018)
Sato, M., et al.: Co-design for A64FX manycore processor and “Fugaku”. In: SC20: International Conference For High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2020)
Mathá, R., Kimovski, D., Zabrovskiy, A., Timmerer, C., Prodan, R.: Where to encode: a performance analysis of \(\times \)86 and ARM-based Amazon EC2 instances. In: eScience, pp. 118–127 (2021)
ARM: ARM® Neoverse™ V1- Amazon’s graviton3 server chip. https://www.nextplatform.com/2022/05/24/the-value-proposition-for-amazons-graviton3-server-chip/
ECP: Milestone M1 Report: HBM2/3 Evaluation on Many-core CPU WBS 2.4, Milestone ECP-MT-1000. Exascale Computing Project, June 2018
Biswas, A.: Sapphire Rapids. In: 2021 IEEE Hot Chips 33 Symposium (HCS), Palo Alto, CA, USA, pp. 1–22 (2021). https://doi.org/10.1109/HCS52781.2021.9566865
ARM: Learn the architecture - Introducing AMBA CHI, Non-Confidential. Issue 01, 102407_0100_01_e
High bandwidth memory (HBM) dram. JEDEC (2020)
Brank, B., Nassyr, S., Pouyan, F., Pleiter, D.: Porting applications to ARM-based processors. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 559–566 (2020)
McCalpin, J.: Memory bandwidth and machine balance in current high performance computers. (TCCA) Newsletter 2, 19–25 (1995)
McKee, S.A.: Reflections on the memory wall. In: Proceedings of the First Conference on Computing Frontiers, 2004, Ischia, Italy, 14–16 April 2004
Qureshi, Y., et al.: Gem5-X: a many-core heterogeneous simulation platform for architectural exploration and optimization. ACM Trans. Archit. Code Optim. 18, 1–27 (2021)
Okazaki, R., et al.: Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption. Fujitsu Technical ReviewNo.32020 (2020)
Hondou, M.: A64fx microarchitecture manual v1.8 released (2019). https://github.com/fujitsu/A64FX
Nakamura, Y., et al.: Fugaku codesign report. Technical report, FLAGSHIP 2020 Project, RIKEN Center for Computational Science (R-CCS), RIKEN (2022)
Smith, A.J.: Sequential program prefetching in memory hierarchies. Computer 11, 7–21 (1978)
Kritikakou, A., Catthoor, F., Goutis, C.: Scalable and Near-Optimal Design Space Exploration for Embedded Systems. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-04942-7
ARM: AMBA® 5 CHI architecture specification. https://developer.arm.com/documentation/ihi0050/ea/ (2020)
JEDEC: High bandwidth memory (HBM) dram. Standards JESD235D, Joint Electron Device Engineering Council, March 2021
ARM: Developer, ARM® neoverse™ v1 core, rev:r1p1. Technical reference manual. Technical report, ARM- Advanced RISC Machines (2021)
‘/’ Inside amazon’s graviton3 ARM server processor. https://www.nextplatform.com/2022/01/04/inside-amazons-graviton3-arm-server-processor. Accessed 17 Oct 2022
ARM: ARM® Neoverse™ N1 core - technical reference manual. https://developer.arm.com/documentation/100616/0401/?lang=en (2020)
Binkert, N., et al.: The gem5 simulator. ACM SIGARCH Comput. Archit. News. 39, 1–7 (2011)
Ventroux, N., et al.: SESAM: An MPSoC simulation environment for dynamic application processing. In: 2010 10th IEEE CIT, pp. 1880–1886 (2010)
Gómez, C., et al.: Design space exploration of next-generation HPC machines. IPDPS 2019, 54–65 (2019)
Hardavellas, N., et al.: SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. SIGMETRICS Perform. Eval. Rev. 31, 31–34 (2004)
Magnusson, P.S., et al.: Simics: a full system simulation platform. Computer 35, 50–58 (2002)
Carlson, et al. Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In: SC 2011, pp. 1–12 (2011)
Microarchitecture description ARM v1. ARM report (2022)
ARM: ARM® Neoverse™ CMN-700 Coherent Mesh Network, Technical Reference Manual, 102308_0300_05_en (2022)
Acknowledgment
This work has been performed in the context of the European Processor Initiative (EPI) project, which has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement №101036168 (EPI-SGA2).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Portero, A. et al. (2023). COMPESCE: A Co-design Approach for Memory Subsystem Performance Analysis in HPC Many-Cores. In: Goumas, G., Tomforde, S., Brehm, J., Wildermann, S., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2023. Lecture Notes in Computer Science, vol 13949. Springer, Cham. https://doi.org/10.1007/978-3-031-42785-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-42785-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42784-8
Online ISBN: 978-3-031-42785-5
eBook Packages: Computer ScienceComputer Science (R0)