{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,6,20]],"date-time":"2024-06-20T17:53:51Z","timestamp":1718906031540},"reference-count":37,"publisher":"Association for Computing Machinery (ACM)","issue":"4","funder":[{"name":"NSF IUCRC","award":["I\/UCRC-1439722"]},{"name":"Hewlett Packard Enterprise"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2022,12,31]]},"abstract":"Application virtual memory footprints are growing rapidly in all systems from servers down to smartphones. To address this growing demand, system integrators are incorporating ever larger amounts of main memory, warranting rethinking of memory management. In current systems, applications produce page fault exceptions whenever they access virtual memory regions that are not backed by a physical page. As application memory footprints grow, they induce more and more minor page faults. Handling of each minor page fault can take a few thousands of CPU cycles and blocks the application till the OS kernel finds a free physical frame. These page faults can be detrimental to the performance when their frequency of occurrence is high and spread across application runtime. Specifically, lazy allocation-induced minor page faults are increasingly impacting application performance. Our evaluation of several workloads indicates an overhead due to minor page faults as high as 29% of execution time.<\/jats:p>\n In this article, we propose to mitigate this problem through a hardware, software co-design approach. Specifically, we first propose to parallelize portions of the kernel page allocation to run ahead of fault time in a separate thread. Then we propose the Minor Fault Offload Engine (MFOE), a per-core hardware accelerator for minor fault handling. MFOE is equipped with a pre-allocated page frame table that it uses to service a page fault. On a page fault, MFOE quickly picks a pre-allocated page frame from this table, makes an entry for it in the TLB, and updates the page table entry to satisfy the page fault. The pre-allocation frame tables are periodically refreshed by a background kernel thread, which also updates the data structures in the kernel to account for the handled page faults. We evaluate this system in the gem5 architectural simulator with a modified Linux kernel running on top of simulated hardware containing the MFOE accelerator. Our results show that MFOE improves the average critical path fault handling latency by 33\u00d7 and tail critical path latency by 51\u00d7. Among the evaluated applications, we observed an improvement of runtime by an average of 6.6%.<\/jats:p>","DOI":"10.1145\/3547142","type":"journal-article","created":{"date-parts":[[2022,7,11]],"date-time":"2022-07-11T11:26:22Z","timestamp":1657538782000},"page":"1-26","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Reducing Minor Page Fault Overheads through Enhanced Page Walker"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"http:\/\/orcid.org\/0000-0001-8333-7896","authenticated-orcid":false,"given":"Chandrahas","family":"Tirumalasetty","sequence":"first","affiliation":[{"name":"Department of Electrical & Computer Engineering, TAMU, College Station, TX"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-3094-6951","authenticated-orcid":false,"given":"Chih Chieh","family":"Chou","sequence":"additional","affiliation":[{"name":"Amazon Lab 126, Sunnyvale, USA"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-4625-8819","authenticated-orcid":false,"given":"Narasimha","family":"Reddy","sequence":"additional","affiliation":[{"name":"Department of Electrical & Computer Engineering, TAMU, College Station, TX"}]},{"ORCID":"http:\/\/orcid.org\/0000-0001-7120-7189","authenticated-orcid":false,"given":"Paul","family":"Gratz","sequence":"additional","affiliation":[{"name":"Department of Electrical & Computer Engineering, TAMU, USA and Departmentof Computer Science & Engineering, TAMU, USA"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-4618-2618","authenticated-orcid":false,"given":"Ayman","family":"Abouelwafa","sequence":"additional","affiliation":[{"name":"Hewlett Packard Enterprise, San Jose, CA"}]}],"member":"320","published-online":{"date-parts":[[2022,9,16]]},"reference":[{"key":"e_1_3_2_2_2","volume-title":"Chapter 6 Physical Page Allocation","unstructured":"[n. d.]. Chapter 6 Physical Page Allocation. https:\/\/www.kernel.org\/doc\/gorman\/html\/understand\/understand009.html."},{"key":"e_1_3_2_3_2","volume-title":"Lockless Ring Buffer Design","unstructured":"[n. d.]. Lockless Ring Buffer Design. https:\/\/www.kernel.org\/doc\/Documentation\/trace\/ring-buffer-design.txt."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378468"},{"key":"e_1_3_2_5_2","article-title":"Revisiting hardware-assisted page walks for virtualized systems","author":"Ahn J.","year":"2012","unstructured":"J. Ahn, S. Jin, and J. Huh. 2012. Revisiting hardware-assisted page walks for virtualized systems. In ACM ISCA Conference.","journal-title":"ACM ISCA Conference"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080209"},{"key":"e_1_3_2_7_2","first-page":"27","volume-title":"2017 USENIX Annual Technical Conference (USENIX ATC\u201917)","author":"Amit Nadav","year":"2017","unstructured":"Nadav Amit. 2017. Optimizing the TLB shootdown algorithm with page access tracking. In 2017 USENIX Annual Technical Conference (USENIX ATC\u201917). USENIX Association, Santa Clara, CA, 27\u201339. https:\/\/www.usenix.org\/conference\/atc17\/technical-sessions\/presentation\/amit."},{"key":"e_1_3_2_8_2","unstructured":"Scott Beamer Krste Asanovi\u0107 and David Patterson. 2015. The GAP Benchmark Suite. arxiv:1508.03619 [cs.DC]"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/1346281.1346286"},{"key":"e_1_3_2_10_2","article-title":"Translation-triggered prefetching","author":"Bhattacharjee Abhishek","year":"2017","unstructured":"Abhishek Bhattacharjee. 2017. Translation-triggered prefetching. In ACM ASPLOS.","journal-title":"ACM ASPLOS"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454128"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_3_2_13_2","article-title":"vNVML: An efficient shared library for virtualizing and sharing non-volatile memories","author":"Chou C.-C.","year":"2019","unstructured":"C.-C. Chou, J. Jung, N. Reddy, P. Gratz, and D. Voight. 2019. vNVML: An efficient shared library for virtualizing and sharing non-volatile memories. In IEEE Mass Storage Symposium.","journal-title":"IEEE Mass Storage Symposium"},{"key":"e_1_3_2_14_2","article-title":"Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook","author":"Cao Z.","year":"2020","unstructured":"Z. Cao, S. Dong, S. Vemuri, and D. Du. 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. USNIX FAST Conference.","journal-title":"USNIX FAST Conference"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/1807128.1807152"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3132402.3132409"},{"key":"e_1_3_2_17_2","article-title":"Efficient memory virtualization: Reducing dimensionality of nested page walks","author":"Gandhi Jayneel","year":"2014","unstructured":"Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2014. Efficient memory virtualization: Reducing dimensionality of nested page walks. In IEEE MICRO Conference.","journal-title":"IEEE MICRO Conference."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2016.10"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507762"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00047"},{"key":"e_1_3_2_21_2","unstructured":"H. Jin and M. A. Frumkin. 2000. The OpenMP implementation of NAS parallel benchmarks and its performance."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132770"},{"key":"e_1_3_2_23_2","article-title":"Coordinated and efficient huge page management with ingens","author":"Kwon Y.","year":"2016","unstructured":"Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel. 2016. Coordinated and efficient huge page management with ingens. In USENIX OSDI.","journal-title":"USENIX OSDI"},{"key":"e_1_3_2_24_2","article-title":"A case for hardware-based demand paging","author":"Lee Gyusun","year":"2020","unstructured":"Gyusun Lee, Wenjing Jin, Wonsuk Song, Jeonghun Gong, Jonghyun Bae, Tae Jun Ham, Jae W. Lee, and Jinkyu Jeong. 2020. A case for hardware-based demand paging. In ACM SIGARCH Architecture Conference.","journal-title":"ACM SIGARCH Architecture Conference"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037710"},{"key":"e_1_3_2_26_2","article-title":"Breeze: User-level access to non-volatile main memories for legacy software","author":"Memaripour A.","year":"2018","unstructured":"A. Memaripour and S. Swanson. 2018. Breeze: User-level access to non-volatile main memories for legacy software. IEEE ICDC Conference.","journal-title":"IEEE ICDC Conference."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/844128.844138"},{"key":"e_1_3_2_28_2","article-title":"HawkEye: Efficient finegrained OS support for huge pages","author":"Panwar A.","year":"2019","unstructured":"A. Panwar, S. Bansal, and K. Gopinath. 2019. HawkEye: Efficient finegrained OS support for huge pages. In ACM ASPLOS.","journal-title":"ACM ASPLOS"},{"key":"e_1_3_2_29_2","article-title":"Perforated page: Supporting fragmented memory allocation for large pages","author":"Park C. H.","year":"2020","unstructured":"C. H. Park, S. Cha, B. Kim, Y. Kwon, D. Black-Schaffer, and J. Huh. 2020. Perforated page: Supporting fragmented memory allocation for large pages. In ACM SIGARCH Conference.","journal-title":"ACM SIGARCH Conference."},{"key":"e_1_3_2_30_2","volume-title":"Enabling Efficient and Transparent Remote Memory Access in Disaggregated Datacenters","author":"Pemberton Nathan","unstructured":"Nathan Pemberton, John D. Kubiatowicz, and Randy H. Katz. [n. d.]. Enabling Efficient and Transparent Remote Memory Access in Disaggregated Datacenters. Ph.D. Dissertation."},{"key":"e_1_3_2_31_2","article-title":"Increasing TLB reach by exploiting clustering in page translations","author":"Pham Binh","year":"2014","unstructured":"Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In IEEE HPCA Conf.","journal-title":"IEEE HPCA Conf."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358296"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378472"},{"key":"e_1_3_2_34_2","volume-title":"The Role of Reactor Physics toward a Sustainable Future (PHYSOR\u201914)","author":"Tramm John R.","year":"2014","unstructured":"John R. Tramm, Andrew R. Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench - The development and verification of a performance abstraction for Monte Carlo reactor analysis. In The Role of Reactor Physics toward a Sustainable Future (PHYSOR\u201914). https:\/\/www.mcs.anl.gov\/papers\/P5064-0114.pdf."},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.5555\/1060289.1060307"},{"key":"e_1_3_2_36_2","first-page":"24","volume-title":"ISCA\u201995","author":"Woo Steven Cameron","year":"1995","unstructured":"Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In ISCA\u201995. ACM, 24\u201336."},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063436"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3053277.3053279"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3547142","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,2]],"date-time":"2023-01-02T08:20:20Z","timestamp":1672647620000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3547142"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,16]]},"references-count":37,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,12,31]]}},"alternative-id":["10.1145\/3547142"],"URL":"https:\/\/doi.org\/10.1145\/3547142","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,9,16]]},"assertion":[{"value":"2021-12-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-06-17","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-09-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}