{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,22]],"date-time":"2025-02-22T00:48:48Z","timestamp":1740185328407,"version":"3.37.3"},"reference-count":76,"publisher":"Association for Computing Machinery (ACM)","issue":"FSE","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2024,7,12]]},"abstract":"Reverse engineers would acquire valuable insights from descriptive function names, which are absent in publicly released binaries. Recent advances in binary function name prediction using data-driven machine learning show promise. However, existing approaches encounter difficulties in capturing function semantics in diverse optimized binaries and fail to reserve the meaning of labels in function names. We propose Epitome, a framework that enhances function name prediction using votes-based name tokenization and multi-task learning, specifically tailored for different compilation optimization binaries. Epitome learns comprehensive function semantics by pre-trained assembly language model and graph neural network, incorporating function semantics similarity prediction task, to maximize the similarity of function semantics in the context of different compilation optimization levels. In addition, we present two data preprocessing methods to improve the comprehensibility of function names. We evaluate the performance of Epitome using 2,597,346 functions extracted from binaries compiled with 5 optimizations (O0-Os) for 4 architectures (x64, x86, ARM, and MIPS). Epitome outperforms the state-of-the-art function name prediction tool by up to 44.34%, 64.16%, and 54.44% in precision, recall, and F1 score, while also exhibiting superior generalizability.<\/jats:p>","DOI":"10.1145\/3660782","type":"journal-article","created":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T14:22:09Z","timestamp":1720794129000},"page":"1679-1702","source":"Crossref","is-referenced-by-count":0,"title":["Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-task Learning"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-0906-1817","authenticated-orcid":false,"given":"Xiaoling","family":"Zhang","sequence":"first","affiliation":[{"name":"Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China \/ University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8390-7518","authenticated-orcid":false,"given":"Zhengzi","family":"Xu","sequence":"additional","affiliation":[{"name":"Nanyang Technological University, singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4385-8261","authenticated-orcid":false,"given":"Shouguo","family":"Yang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China \/ University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7071-2976","authenticated-orcid":false,"given":"Zhi","family":"Li","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China \/ University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6168-8003","authenticated-orcid":false,"given":"Zhiqiang","family":"Shi","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China \/ University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2745-7521","authenticated-orcid":false,"given":"Limin","family":"Sun","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China \/ University of Chinese Academy of Sciences, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2024,7,12]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359591.3359735"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1808.01400"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1912.07946"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPC.2017.27"},{"key":"e_1_2_1_5_1","unstructured":"Kent Beck. 2007. Implementation patterns. Pearson Education."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/S10579-010-9124-X"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_2_1_8_1","unstructured":"Buildroot. 2023. Buildroot: making embedded linux easy. https:\/\/buildroot.org"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPC.2017.2"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3548606.3559367"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2019.2940439"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/52.43044"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3428293"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cose.2022.102846"},{"key":"e_1_2_1_15_1","unstructured":"Anderson Derek and Randal Scott. 2023. Word ninja. https:\/\/github.com\/keredson\/wordninja"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2019.00003"},{"key":"e_1_2_1_18_1","volume-title":"2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Du Yufei","year":"2022","unstructured":"Yufei Du, Kevin Snow, and Fabian Monrose. 2022. Automatic Recovery of Fine-grained Compiler Artifacts at the Binary Level. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA. 853\u2013868. isbn:978-1-939133-29-49 https:\/\/www.usenix.org\/conference\/atc22\/presentation\/du"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2020.2976920"},{"key":"e_1_2_1_20_1","volume-title":"29th USENIX Security Symposium (USENIX Security 20)","author":"Feng Bo","year":"2020","unstructured":"Bo Feng, Alejandro Mera, and Long Lu. 2020. P2IM: Scalable and Hardware-independent Firmware Testing via Automatic Peripheral Interface Modeling. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1237\u20131254. isbn:978-1-939133-17-5 https:\/\/www.usenix.org\/conference\/usenixsecurity20\/presentation\/feng"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1811.01824"},{"key":"e_1_2_1_22_1","unstructured":"Free Software Foundation. 2023. Coreutils - gnu core utilities. https:\/\/www.gnu.org\/software\/coreutils\/"},{"key":"e_1_2_1_23_1","unstructured":"Free Software Foundation. 2023. Gnu binutilss. https:\/\/www.gnu.org\/software\/binutils\/"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3460319.3464804"},{"key":"e_1_2_1_25_1","unstructured":"GCC. 2023. Options That Control Optimization. https:\/\/gcc.gnu.org\/onlinedocs\/gcc\/Optimize-Options.html"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","unstructured":"Jiatao Gu Zhengdong Lu Hang Li and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393 https:\/\/doi.org\/10.48550\/arXiv.1603.06393 10.48550\/arXiv.1603.06393","DOI":"10.48550\/arXiv.1603.06393"},{"key":"e_1_2_1_27_1","volume-title":"Mar","author":"Guyon Isabelle","year":"2003","unstructured":"Isabelle Guyon and Andr\u00e9 Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research, 3, Mar (2003), 1157\u20131182."},{"key":"e_1_2_1_28_1","unstructured":"Antti Haapala. 2023. Python- Levenshtein. https:\/\/github.com\/ztane\/python-Levenshtein"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3243734.3243866"},{"key":"e_1_2_1_30_1","unstructured":"Hex-Rays. 2023. IDA Pro. https:\/\/hex-rays.com\/ida-pro\/"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2902362"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-03013-0_14"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3196321.3196334"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","unstructured":"Hamel Husain Ho-Hsiang Wu Tiferet Gazit Miltiadis Allamanis and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 https:\/\/doi.org\/10.48550\/arXiv.1909.09436 10.48550\/arXiv.1909.09436","DOI":"10.48550\/arXiv.1909.09436"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-0716-1418-1"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3548606.3560612"},{"key":"e_1_2_1_39_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). arxiv:1412.6980"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","unstructured":"Anton Kolonin and Vignav Ramesh. 2022. Unsupervised Tokenization Learning. 3649\u20133664. https:\/\/doi.org\/10.18653\/V1\/2022.EMNLP-MAIN.239 10.18653\/V1\/2022.EMNLP-MAIN.239","DOI":"10.18653\/V1"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-57868-4_57"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE.2019.00064"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3387904.3389268"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE.2019.00087"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","unstructured":"Mike Lewis Yinhan Liu Naman Goyal Marjan Ghazvininejad Abdelrahman Mohamed Omer Levy Veselin Stoyanov and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation Translation and Comprehension. 7871\u20137880. https:\/\/doi.org\/10.18653\/V1\/2020.ACL-MAIN.703 10.18653\/V1\/2020.ACL-MAIN.703","DOI":"10.18653\/V1"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3460120.3484587"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00060"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-30164-8_306"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133908"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","unstructured":"Tomas Mikolov Kai Chen Greg Corrado and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 https:\/\/doi.org\/10.48550\/arXiv.1301.3781 10.48550\/arXiv.1301.3781","DOI":"10.48550\/arXiv.1301.3781"},{"volume-title":"Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML\u201910)","author":"Nair Vinod","key":"e_1_2_1_51_1","unstructured":"Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML\u201910). Omnipress, Madison, WI, USA. 807\u2013814. isbn:9781605589077"},{"key":"e_1_2_1_52_1","unstructured":"OpenAI. 2023. GPT-4. https:\/\/platform.openai.com\/docs\/models\/gpt-4"},{"key":"e_1_2_1_53_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32 (2019)."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3427228.3427265"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2022.3231621"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11431-020-1647-3"},{"key":"e_1_2_1_57_1","first-page":"2","article-title":"Gensim\u2013python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno","volume":"3","author":"Rehurek Radim","year":"2011","unstructured":"Radim Rehurek and Petr Sojka. 2011. Gensim\u2013python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 3, 2 (2011), 2.","journal-title":"Czech Republic"},{"key":"e_1_2_1_58_1","unstructured":"Wind River. 2023. VxWorks:The Leading RTOS for the Intelligent Edge. https:\/\/www.windriver.com\/products\/vxworks"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1025667309714"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3104029"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"e_1_2_1_62_1","unstructured":"CEA IT Security. 2023. Miasm. https:\/\/github.com\/cea-sec\/miasm"},{"key":"e_1_2_1_63_1","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association","author":"Seeha Suteera","year":"2020","unstructured":"Suteera Seeha, Ivan Bilan, Liliana Mamani Sanchez, Johannes Huber, Michael Matuschek, and Hinrich Sch\u00fctze. 2020. ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France. 6947\u20136957. isbn:979-10-95546-34-4 https:\/\/aclanthology.org\/2020.lrec-1.858"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1016\/0022-2836(81)90087-5"},{"key":"e_1_2_1_65_1","unstructured":"Synopsys. 2022. Synopsys 2022 open source security and risk analysis report.. https:\/\/www.synopsys.com\/software-integrity\/resources\/analyst-reports\/open-source-security-risk-analysis.html"},{"key":"e_1_2_1_66_1","unstructured":"Princeton University. 2023. WordNet A Lexical Database for English. https:\/\/wordnet.princeton.edu\/"},{"key":"e_1_2_1_67_1","volume-title":"An Observational Investigation of Reverse Engineers\u2019 Processes. In 29th USENIX Security Symposium (USENIX Security 20)","author":"Votipka Daniel","year":"1875","unstructured":"Daniel Votipka, Seth Rabin, Kristopher Micinski, Jeffrey S. Foster, and Michelle L. Mazurek. 2020. An Observational Investigation of Reverse Engineers\u2019 Processes. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1875\u20131892. isbn:978-1-939133-17-5 https:\/\/www.usenix.org\/conference\/usenixsecurity20\/presentation\/votipka-observational"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2018.00003"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.2978386"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2019"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","unstructured":"Xiangzhe Xu Zhuo Zhang Shiwei Feng Yapeng Ye Zian Su Nan Jiang Siyuan Cheng Lin Tan and Xiangyu Zhang. 2023. LmPa: Improving Decompilation by Synergy of Large Language Model and Program Analysis. arXiv preprint arXiv:2306.02546 https:\/\/doi.org\/10.48550\/ARXIV.2306.02546 10.48550\/ARXIV.2306.02546","DOI":"10.48550\/ARXIV.2306.02546"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2016.18"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compeleceng.2021.107354"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN48987.2021.00036"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i01.5466"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/SMC52423.2021.9658619"}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3660782","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,19]],"date-time":"2024-08-19T18:30:02Z","timestamp":1724092202000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3660782"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,12]]},"references-count":76,"journal-issue":{"issue":"FSE","published-print":{"date-parts":[[2024,7,12]]}},"alternative-id":["10.1145\/3660782"],"URL":"https:\/\/doi.org\/10.1145\/3660782","relation":{},"ISSN":["2994-970X"],"issn-type":[{"type":"electronic","value":"2994-970X"}],"subject":[],"published":{"date-parts":[[2024,7,12]]}}}