Abstract
The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches.
Similar content being viewed by others
Notes
The experimental methodology is described in Sect. 3.
References
Alghazo J, Akaaboune A, Botros N (2004) Sf-lru cache replacement algorithm. In: Records of the international workshop on memory technology, design and testing
Bardine A, Foglia P, Gabrielli G, Prete CA (2007) Analysis of static and dynamic energy consumption in nuca caches: initial results. In: Proc of the workshop on memory performance: dealing with applications, systems and architecture
Beckmann BM, Wood DA (2004) Managing wire delay in large chip-multiprocessor caches. In: Proc of the 37th international symposium on microarchitecture
Belady LA (1966) A study of replacement algorithms for virtual-storage computer. IBM Syst J 5(2)
Chaudhuri M (2009) Pagenuca: selected policies for page-grain locality management in large shared chip-multiprocessors. In: Proc of the 15th international symposium on high-performance computer architecture
Chishti Z, Powell MD, Vijaykumar TN (2003) Distance associativity for high-performance energy-efficient non-uniform cache architectures. In: Proc of the 36th international symposium on microarchitecture, MICRO-36
Chou S, Chen C, Wen C, Chan Y, Chen T, Wang C, Wang J (2009) No cache-coherence: a single-cycle ring interconnection for multi-core L1-NUCA sharing on 3D chips. In: Proc of the 46th design automation conference
Cong J, Ghodrat MA, Gill M, Liu C, Reinman G (2012) BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs. In: Proc of the international symposium on low power electronics and design
Dybdahl H, Stenström P, Natvig L (2007) An lru-based replacement algorithm augmented with frequency of access in shared chip-multiprocessor caches. Comput Archit News 35
Grochowski E, Ronen R, Shen J, Wang H (2004) Best of both latency and throughput. In: Proc of the 22nd intl conference on computer design
Hammoud M, Cho S, Melhem R (2009) Acm: an efficient approach for managing shared caches in chip multiprocessors. In: Proc of the 4th intl conference on high performance and embedded architectures
Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive nuca: near-optimal block placement and replication in distributed caches. In: Proc of the 36th international symposium on computer architecture
Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler SW (2005) A nuca substrate for flexible cmp cache sharing. In: Proc of the 19th ACM international conference on supercomputing
Jouppi NP (1990) Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In: Proc of the 17th annual international symposium on computer architecture
Jung J, Kang K, Kyung CM (2011) Latency-aware utility-based NUCA cache partitioning in 3D-stacked multi-processor systems. In: Proc of the 21st edition of the great lakes symposium on Great Lakes symposium on VLSI
Kandemir M, Li F, Irwin MJ, Son SW (2008) A novel migration-based nuca design for chip multiprocessors. In: Proc of the international conference on supercomputing
Khan A, Kang K, Kyung CM (2011) Exploiting maximum throughput in 3D multicore architectures with stacked NUCA cache. In: Proc of the 19th IFIP/IEEE international conference on very large scale integration
Kharbutli M, Solihin Y (2005) Counter-based cache replacement algorithms. In: Proc of the 23rd international conference on computer design
Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proc of the 10th intl conf on architectural support for programming languages and operating systems
Lira J, Molina C, González A (2009) Last bank: dealing with address reuse in non-uniform cache architecture for cmps. In: Proc of the 15th international Euro-Par conference (Euro-Par)
Lira J, Molina C, González A (2011) Hk-nuca: boosting data searches in dynamic non-uniform cache architectures for chip multiprocessors. In: Proc of the 25th IEEE international parallel and distributed processing symposium (IPDPS)
Lira J, Molina C, Brooks D, González A (2011) Implementing a hybrid sram/edram nuca architecture. In: Proc of the 18th annual international conference on high performance computing (HiPC’11)
Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Högberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulator platform. Computer 35(2):50–58
Malkowski K, Raghavan P, Kandemir MT, Irwin MJ (2010) T-NUCA—a novel approach to non-uniform access latency cache architectures for 3D CMPs. In: Proc of the 24th IEEE international symposium on parallel and distributed processing
Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. Comput Archit News
Merino J, Puente V, Gregorio JA (2010) ESP-NUCA: a low-cost adaptive non-uniform cache architecture. In: Proc of the 17th IEEE international symposium on high performance computer architecture
Micron (2009) System power calculator. http://www.micron.com/
Muralimanohar N, Balasubramonian R (2007) Interconnect design considerations for large nuca caches. In: Proc of the 34th international symposium on computer architecture
Muralimanohar N, Balasubramonian R, Jouppi NP (2007) Cacti 6.0: A tool to understand large caches. Tech rep, University of Utah and Hewlett Packard Laboratories
Muralimanohar N, Balasubramonian R, Jouppi NP (2007) Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In: Proc of the 40th international symposium on microarchitecture
Qureshi MK, Jaleel A, Patt YN (2007) Adaptive insertion policies for high-performance caching. In: Proc of the 34th international symposium on computer architecture
Qureshi MK, Suleman MA, Patt YN (2007) Line distillation: increasing cache capacity by filtering unused words in cache lines. In: Proc of the 13th international symposium of high-performance computer architecture
Ricci R, Barrus S, Balasubramonian R Leveraging bloom filters for smart search within nuca caches. In: Proc of the 7th workshop on complexity-effective
Smith AJ (1982) Cache memories. ACM Comput Surv 14(3)
Thoziyoor S, Muralimanohar N, Ahn JH, Jouppi NP (2008) Cacti 5.1. Tech rep, HP
Wang HS, Zhu X, Peh LS, Malik S (2002) Orion: a power-performance simulator for interconnection networks. In: Proc of the 35th international symposium on microarchitecture
Wenisch TF, Wunderlich RE, Ferdman M, Ailamaki A, Falsafi B, Hoe JC (2006) Simflex: statistical sampling of computer system simulation. IEEE MICRO 26(4):18–31
Wong W, Baer J (2000) Modified lru policies for improving second-level cache behavior. In: Proc of the 6th international symposium on high-performance computer architecture
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lira, J., Molina, C., Rakvic, R.N. et al. Replacement techniques for dynamic NUCA cache designs on CMPs. J Supercomput 64, 548–579 (2013). https://doi.org/10.1007/s11227-012-0859-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-012-0859-6