{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,20]],"date-time":"2024-08-20T00:04:50Z","timestamp":1724112290307},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2024,1,11]],"date-time":"2024-01-11T00:00:00Z","timestamp":1704931200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,1,11]],"date-time":"2024-01-11T00:00:00Z","timestamp":1704931200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2021YFF0704000"],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["CCF Trans. HPC"],"published-print":{"date-parts":[[2024,8]]},"abstract":"Abstract<\/jats:title>Since specific hardware characteristics and low-level programming model are adapted to both NVIDIA GPU and new generation Sunway architecture, automatically translating mature CUDA kernels to Sunway ATHREAD kernels are realistic but challenging work. To address this issue, swCUDA<\/jats:italic>, an auto parallel code translation framework is proposed. To that end, we create scale affine translation to transform CUDA thread hierarchy to Sunway index, directive based memory hierarchy and data redirection optimization to assign optimal memory usage and data stride strategy, directive based grouping-calculation-asynchronous-reduction (GCAR) algorithm to provide general solution for random access issue. swCUDA<\/jats:italic> utilizes code generator ANTLR as compiler frontend to parse CUDA kernel and integrate novel algorithms in the node of abstracted syntax tree (AST) depending on directives. Automatically translation is performed on the entire Polybench suite and NBody simulation benchmark. We get an average 40x speedup compared with baseline on the Sunway architecture, average speedup of 15x compared to x86 CPU and average 27 percentage higher than NVIDIA GPU. Further, swCUDA<\/jats:italic> is implemented to translate major kernels of the real world application Gromacs. The translated version achieves up to 17x speedup.<\/jats:p>","DOI":"10.1007\/s42514-023-00159-7","type":"journal-article","created":{"date-parts":[[2024,1,11]],"date-time":"2024-01-11T08:02:14Z","timestamp":1704960134000},"page":"439-458","update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer"],"prefix":"10.1007","volume":"6","author":[{"given":"Maoxue","family":"Yu","sequence":"first","affiliation":[]},{"given":"Guanghao","family":"Ma","sequence":"additional","affiliation":[]},{"given":"Zhuoya","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Shuai","family":"Tang","sequence":"additional","affiliation":[]},{"given":"Yuhu","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Yucheng","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Yuanyuan","family":"Liu","sequence":"additional","affiliation":[]},{"ORCID":"http:\/\/orcid.org\/0000-0001-5805-4931","authenticated-orcid":false,"given":"Dongning","family":"Jia","sequence":"additional","affiliation":[]},{"given":"Zhiqiang","family":"Wei","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,1,11]]},"reference":[{"issue":"1","key":"159_CR1","doi-asserted-by":"publisher","first-page":"123","DOI":"10.1007\/s11390-020-9826-z","volume":"36","author":"JS Chen","year":"2021","unstructured":"Chen, J.S., An, H., Han, W.T., et al.: Towards efficient short-range pair interaction on sunway many-core architecture. J. Comput. Sci. Technol. 36(1), 123\u2013139 (2021). https:\/\/doi.org\/10.1007\/s11390-020-9826-z","journal-title":"J. Comput. Sci. Technol."},{"key":"159_CR2","volume-title":"Professional CUDA C Programming","author":"J Cheng","year":"2014","unstructured":"Cheng, J., Grossman, M., McKercher, T.: Professional CUDA C Programming, 1st edn. Wrox Press Ltd (2014)","edition":"1"},{"key":"159_CR3","doi-asserted-by":"crossref","unstructured":"Chu, G., Li, Y., Zhao, R.: et\u00a0al Md simulation of hundred-billion-metal-atom cascade collision on sunway taihulight. ArXiv (2021) https:\/\/arxiv.org\/abs\/2107.07866","DOI":"10.1016\/j.cpc.2021.108128"},{"key":"159_CR4","doi-asserted-by":"publisher","unstructured":"Dong, W., Kang, L., Quan, Z.: et\u00a0al Implementing molecular dynamics simulation on sunway taihulight system. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications. In: IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC\/SmartCity\/DSS), pp 443\u2013450, https:\/\/doi.org\/10.1109\/HPCC-SmartCity-DSS.2016.0070 (2016)","DOI":"10.1109\/HPCC-SmartCity-DSS.2016.0070"},{"key":"159_CR5","doi-asserted-by":"publisher","unstructured":"Duan, X., Gao, P., Zhang, T.: et al. Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: SC18: International Conference for High Performance Computing Networking, Storage and Analysis Doi: https:\/\/doi.org\/10.1109\/SC.2018.00015(2018)","DOI":"10.1109\/SC.2018.00015"},{"key":"159_CR6","doi-asserted-by":"publisher","first-page":"8577","DOI":"10.1063\/1.470117","volume":"103","author":"U Essmann","year":"1995","unstructured":"Essmann, U., Perera, L., Berkowitz, M., et al.: A smooth particle mesh ewald method. J. Chem. Phys. 103, 8577 (1995). https:\/\/doi.org\/10.1063\/1.470117","journal-title":"J. Chem. Phys."},{"key":"159_CR7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s11432-016-5588-7","volume":"59","author":"H Fu","year":"2016","unstructured":"Fu, H., Liao, J., Yang, J., et al.: The sunway taihulight supercomputer: system and applications. Sci. China Informat. Sci. 59, 1\u201316 (2016). https:\/\/doi.org\/10.1007\/s11432-016-5588-7","journal-title":"Sci. China Informat. Sci."},{"issue":"4","key":"159_CR8","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1109\/MM.2008.57","volume":"28","author":"M Garland","year":"2008","unstructured":"Garland, M., Le Grand, S., Nickolls, J., et al.: Parallel computing experiences with cuda. IEEE Micro 28(4), 13\u201327 (2008). https:\/\/doi.org\/10.1109\/MM.2008.57","journal-title":"IEEE Micro"},{"key":"159_CR9","doi-asserted-by":"publisher","unstructured":"Grauer-Gray, S., Xu, L., Searles, R. et\u00a0al.: Auto-tuning a high-level language targeted to gpu codes. In: 2012 Innovative Parallel Computing (InPar), pp 1\u201310, https:\/\/doi.org\/10.1109\/InPar.2012.6339595 (2012)","DOI":"10.1109\/InPar.2012.6339595"},{"issue":"1","key":"159_CR10","doi-asserted-by":"publisher","first-page":"78","DOI":"10.1109\/TPDS.2010.62","volume":"22","author":"TD Han","year":"2011","unstructured":"Han, T.D., Abdelrahman, T.S.: hicuda: High-level gpgpu programming. IEEE Transact. Parall. Distribut. Syst. 22(1), 78\u201390 (2011). https:\/\/doi.org\/10.1109\/TPDS.2010.62","journal-title":"IEEE Transact. Parall. Distribut. Syst."},{"key":"159_CR11","doi-asserted-by":"publisher","DOI":"10.1021\/ct900275y","author":"M Harvey","year":"2009","unstructured":"Harvey, M., De Fabritiis, G.: An implementation of the smooth particle mesh ewald method on gpu hardware. J. Chem. Theory Comput. (2009). https:\/\/doi.org\/10.1021\/ct900275y","journal-title":"J. Chem. Theory Comput."},{"issue":"3","key":"159_CR12","doi-asserted-by":"publisher","first-page":"435","DOI":"10.1021\/ct700301q","volume":"4","author":"B Hess","year":"2008","unstructured":"Hess, B., Kutzner, C., van der Spoel, D., et al.: Gromacs 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 4(3), 435\u2013447 (2008). https:\/\/doi.org\/10.1021\/ct700301q","journal-title":"J. Chem. Theory Comput."},{"key":"159_CR13","doi-asserted-by":"publisher","DOI":"10.1021\/acs.jctc.0c00744","author":"S Jing","year":"2012","unstructured":"Jing, S., Li, X., Liu, Z., et al.: Gpu-enabled implementations of particle-mesh-ewald method. Comp. Appl. Chem. (2012). https:\/\/doi.org\/10.1021\/acs.jctc.0c00744","journal-title":"Comp. Appl. Chem."},{"key":"159_CR14","unstructured":"Kutzner, C.: Improving pme on distributed computer systems. (2008) https:\/\/www.mpinat.mpg.de\/632110\/kutzner08talk-workshop.pdf"},{"key":"159_CR15","doi-asserted-by":"publisher","first-page":"2418","DOI":"10.48550\/arXiv.1903.05918","volume":"40","author":"C Kutzner","year":"2019","unstructured":"Kutzner, C., P\u00e1ll, S., Fechner, M.: More bang for your buck Improved use of gpu nodes for gromacs 2018. J. Comput. Chem. 40, 2418\u20132431 (2019). https:\/\/doi.org\/10.48550\/arXiv.1903.05918","journal-title":"J. Comput. Chem."},{"key":"159_CR16","doi-asserted-by":"publisher","unstructured":"Lee, J., Kim, J., Seo, S et\u00a0al.: (2010) An opencl framework for heterogeneous multicores with local memory. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. Association for Computing Machinery, New York, NY, USA, PACT \u201910, p 193-204, (2010) https:\/\/doi.org\/10.1145\/1854273.1854301","DOI":"10.1145\/1854273.1854301"},{"issue":"3","key":"159_CR17","doi-asserted-by":"publisher","first-page":"228","DOI":"10.1109\/2945.620490","volume":"3","author":"S Lee","year":"1997","unstructured":"Lee, S., Wolberg, G., Shin, S.: Scattered data interpolation with multilevel b-splines. IEEE Transact. Visualizat. Comp. Graph. 3(3), 228\u2013244 (1997). https:\/\/doi.org\/10.1109\/2945.620490","journal-title":"IEEE Transact. Visualizat. Comp. Graph."},{"key":"159_CR18","doi-asserted-by":"publisher","unstructured":"Li, M., Pang, J., Yue, F. et\u00a0al.: Openmp automatic translation framework for sunway taihulight. In: 2021 International Conference on Communications, Information System and Computer Engineering (CISCE) (2021) Doi: https:\/\/doi.org\/10.1109\/CISCE52179.2021.9445916","DOI":"10.1109\/CISCE52179.2021.9445916"},{"key":"159_CR19","doi-asserted-by":"publisher","unstructured":"Liu, F., Ma, W., Zhao, Yea.: xmath2.0: a high-performance extended math library for sw26010-pro many-core processor. CCF Transactions on High Performance Computing pp 2524\u20134930. (2022) https:\/\/doi.org\/10.1007\/s42514-022-00126-8","DOI":"10.1007\/s42514-022-00126-8"},{"key":"159_CR20","doi-asserted-by":"publisher","unstructured":"Liu, Y., Liu, X., Li, F. et\u00a0al.: Closing the \"quantum supremacy\" gap: Achieving real-time simulation of a random quantum circuit using a new sunway supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA, SC \u201921, (2021) https:\/\/doi.org\/10.1145\/3458817.3487399","DOI":"10.1145\/3458817.3487399"},{"key":"159_CR21","doi-asserted-by":"publisher","unstructured":"Martinez, G., Gardner, M., Feng, Wc.: Cu2cl: A cuda-to-opencl translator for multi- and many-core architectures. In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems, pp 300\u2013307, (2011) https:\/\/doi.org\/10.1109\/ICPADS.2011.48","DOI":"10.1109\/ICPADS.2011.48"},{"issue":"1","key":"159_CR22","doi-asserted-by":"publisher","first-page":"210","DOI":"10.1109\/TPDS.2015.2394802","volume":"27","author":"R Membarth","year":"2016","unstructured":"Membarth, R., Reiche, O., Hannig, F., et al.: Hipacc: a domain-specific language and compiler for image processing. IEEE Transact. Parall. Distribut. Syst. 27(1), 210\u2013224 (2016). https:\/\/doi.org\/10.1109\/TPDS.2015.2394802","journal-title":"IEEE Transact. Parall. Distribut. Syst."},{"key":"159_CR23","doi-asserted-by":"publisher","DOI":"10.1145\/3084540","author":"G Mendon\u00e7a","year":"2017","unstructured":"Mendon\u00e7a, G., Guimar\u00e3es, B.: Dawncc: Automatic annotation for data parallelism and offloading. ACM Trans. Archit. Code Optim. (2017). https:\/\/doi.org\/10.1145\/3084540","journal-title":"ACM Trans. Archit. Code Optim."},{"key":"159_CR24","unstructured":"Milakov, M.: Gpu pro tip: Fast dynamic indexing of private arrays in cuda. https:\/\/developer.nvidia.com\/blog\/fast-dynamic-indexing-private-arrays-cuda\/ (2015)"},{"key":"159_CR25","unstructured":"Nvidia, C.: Gpu-accelerated applications. (2018) https:\/\/www.nvidia.cn\/content\/gpu-applications\/PDF\/gpu-applications-catalog.pdf"},{"key":"159_CR26","unstructured":"Nvidia, C.: Nvidia v100 tensor core gpu. (2020) https:\/\/images.nvidia.cn\/content\/technologies\/volta\/pdf\/volta-v100-datasheet-update-us-1165301-r5.pdf"},{"key":"159_CR27","unstructured":"Nvidia, C.: Cuda c++ programming guide. (2023) https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html"},{"key":"159_CR28","unstructured":"Parr, T.: The definitive antlr 4 reference. The Definitive ANTLR 4 Reference pp 1\u2013326 (2013)"},{"issue":"6","key":"159_CR29","doi-asserted-by":"publisher","first-page":"425","DOI":"10.1145\/1993316.1993548","volume":"46","author":"T Parr","year":"2011","unstructured":"Parr, T., Fisher, K.: Ll(*): The foundation of the antlr parser generator. SIGPLAN Not 46(6), 425\u2013436 (2011). https:\/\/doi.org\/10.1145\/1993316.1993548","journal-title":"SIGPLAN Not"},{"issue":"10","key":"159_CR30","doi-asserted-by":"publisher","first-page":"579","DOI":"10.1145\/2714064.2660202","volume":"49","author":"T Parr","year":"2014","unstructured":"Parr, T., Harwell, S., Fisher, K.: Adaptive ll(*) parsing: the power of dynamic analysis. SIGPLAN Not 49(10), 579\u2013598 (2014). https:\/\/doi.org\/10.1145\/2714064.2660202","journal-title":"SIGPLAN Not"},{"key":"159_CR31","doi-asserted-by":"publisher","unstructured":"Shang, H., Li, F., Zhang, Y. et\u00a0al.: Extreme-scale ab initio quantum raman spectra simulations on the leadership hpc system in china. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA, SC \u201921, (2021) https:\/\/doi.org\/10.1145\/3458817.3487402","DOI":"10.1145\/3458817.3487402"},{"key":"159_CR32","unstructured":"Strohmaier, E., Dongarra, J., Simon, H. et\u00a0al.: Top 500 supercomputer lists. https:\/\/top500.org\/ (2022)"},{"issue":"1145\/2400682","key":"159_CR33","first-page":"2400713","volume":"10","author":"S Verdoolaege","year":"2013","unstructured":"Verdoolaege, S., Carlos Juega, J., Cohen, A.: Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim. 10(1145\/2400682), 2400713 (2013)","journal-title":"ACM Trans. Archit. Code Optim."},{"key":"159_CR34","doi-asserted-by":"publisher","first-page":"18","DOI":"10.1007\/978-3-319-65482-9_2","volume-title":"Algorithms and Architectures for Parallel Processing","author":"Y Yu","year":"2017","unstructured":"Yu, Y., An, H., Chen, J. et al.: Pipelining computation and optimization strategies for scaling gromacs on the sunway many-core processor. In: Ibrahim, S., Choo, K.K.R., Yan, Z., et al. (eds.) Algorithms and Architectures for Parallel Processing, pp. 18\u201332. Springer International Publishing, Cham (2017)"},{"issue":"1","key":"159_CR35","first-page":"9","volume":"42","author":"L Zeng","year":"2021","unstructured":"Zeng, L., Zheng, W., Hong, A.: Porting and optimizing pme algorithm on sunway taihulight system. J. Chin. Comp. Syst. 42(1), 9 (2021)","journal-title":"J. Chin. Comp. Syst."},{"issue":"1","key":"159_CR36","first-page":"9","volume":"42","author":"LIN Zeng","year":"2021","unstructured":"Zeng, L.I.N., Zheng, A.H.W.U., Jun-shi, C.: Porting and optimizing pme algorithm on sunway taihulight system. J. Chin. Comp. Syst. 42(1), 9 (2021)","journal-title":"J. Chin. Comp. Syst."},{"key":"159_CR37","doi-asserted-by":"publisher","unstructured":"Zhang, T., Li, Y., Gao, P. et\u00a0al.: Sw_gromacs: Accelerate gromacs on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA, SC \u201919, (2019) https:\/\/doi.org\/10.1145\/3295500.3356190","DOI":"10.1145\/3295500.3356190"},{"key":"159_CR38","doi-asserted-by":"publisher","unstructured":"Zhu, Q., Luo, H., Yang, C. et\u00a0al.: Enabling and scaling the hpcg benchmark on the newest generation sunway supercomputer with 42 million heterogeneous cores. In: SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1\u201313, (2021) https:\/\/doi.org\/10.1145\/3458817.3476158","DOI":"10.1145\/3458817.3476158"}],"container-title":["CCF Transactions on High Performance Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-023-00159-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42514-023-00159-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-023-00159-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,19]],"date-time":"2024-08-19T13:05:52Z","timestamp":1724072752000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42514-023-00159-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,11]]},"references-count":38,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["159"],"URL":"https:\/\/doi.org\/10.1007\/s42514-023-00159-7","relation":{},"ISSN":["2524-4922","2524-4930"],"issn-type":[{"type":"print","value":"2524-4922"},{"type":"electronic","value":"2524-4930"}],"subject":[],"published":{"date-parts":[[2024,1,11]]},"assertion":[{"value":"21 March 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 June 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 January 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"On behalf of all authors, the corresponding author states that there is no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}