Abstract
Nowadays, computer software is an essential part in our lives and is used in various fields. While software gives us convenience, it also causes many problems. Various research efforts are needed to defend against software plagiarism, attacks using malware/software, and so on. Analysis techniques of binary executable files can be applied to investigate and defend these problems. However, it is relatively hard to analyze binary executable files without source code information, because executable files only have the information for execution and discard semantic information during the compiling process. In this paper, we proposed a similarity calculation method for binary executable files, based on function matching techniques. Attributes of a function are extracted and these attributes are used to match functions of two binary files. Our function matching process is composed of three steps: the function name matching step, the N-tuple matching step, and the final n-gram-based matching step. After the function matching process is performed, the overall similarity is calculated based on similarities of matched functions. Experimental results show that similarity accuracy of our binary-based similarity calculation method is similar to those of a well-known source-code-based method, call MOSS.
Similar content being viewed by others
References
Internet security threat report. https://resource.elq.symantec.com/LP=2899. Accessed 31 Aug 2016
Statista report. http://www.statista.com/statistics/203428/totalenterprise-software-revenue-forecast/. Accessed 31 Aug 2016
Slagter K, Hsu CH, Chung YC, Zhang D (2016) An improved partitioning mechanism for optimizing massive data analysis using mapreduce. J Supercomput 66(1):539–555
Viswanathan V (2016) Discovery of semantic associations in an rdf graph using bi-directional bfs on massively parallel hardware. Int J Big Data Intell 3(3):176–181
Kunfang S, Lu H (2016) Efficient querying dristributed big-xml data using mapreduce. Int J Grid High Perform Comput 8(3):72–82
Ahmadi M, Ulyanov D, Semenov S, Tromov M, Giacinto G (2016) Novel feature extraction, selection and fusion for effective malware family classication. In: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, ACM, pp 183–194
Abawajy J, Chowdhury M, Kelarev A (2015) Hybrid consensus pruning of ensemble classfiers for big data malware detection. IEEE Trans Cloud Comput 3(2):1–11
Ida pro disassembler. http://www.datarescue.com/idabase. Accessed 31 Aug 2016
Pin user manual. http://rogue.colorado.edu/. Accessed 31 Aug 2016
Allen FE (1970) Control flow analysis. Proc Symp Compil Optim 5(7):1–19
Measure of software similarity. http://theory.stanford.edu/aiken/moss/. Accessed 31 Aug 2016
Cavnar WB, Trenkle JM et al (1994) N-gram based text categorization. Ann Arbor MI 48113(2):161–175
Cesare S, Xiang Y, Zhou W (2014) Control flow-based malware variant detection. IEEE Trans Depend Secure Comput 11(4):307–317
Chilowicz M, Duris E, Roussel G (2009) Finding similarities in source code through factorization. Electron Notes Theor Comput Sci 238(5):47–62
Flake H (2004) Structural comparison of executable objects. In: Proceedings of the DIMVA, pp 161–173
Jaccard P (1901) Etude comparative de la distribution orale dans une portion des Alpes et du Jura. Impr, Corbaz
Jang J, Brumley D, Venkataraman S (2011) Bitshred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, ACM, pp 309–320
Kang B, Kim T, Kwon H, Choi Y, Im EG (2012) Malware classication method via binary content comparison. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, ACM, pp 316–321
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. North-Holland Publishing Co, North-Holland
Kinable J, Kostakis O (2011) Malware classifcation based on call graph clustering. J Comput Virol 7(4):233–245
Lee YR, Kang B, Im EG (2013) Function matching based binary-level software similarity calculation. In: Proceedings of the 2013 Research in Adaptive and Convergent Systems, ACM, pp 322–327
OKane, Sezer, McLaughlin, and Im] OKane P, Sezer S, McLaughlin K, Im EG, (2013) SVM training phase reduction using dataset feature filtering for malware detection. IEEE Trans Inf Forensics Secur 8(3):500–509
Rad BB, Masrom M (2011) Metamorphic virus variants classifcation using opcode frequency histogram. arXiv preprint: arXiv:1104.3228
Ryder BG (1979) Constructing the call graph of a program. IEEE Trans Softw Eng 3:216–226
Santos I, Penya YK, Devesa J, Bringas PG (2009) N-grams-based file signatures for malware detection. Proc Int Conf Enterp Inf Syst (ICEIS) (2)9:317–320
Shang S, Zheng N, Xu J, Xu M, Zhang H (2010) Detecting malware variants via function-call graph similarity. In: Proceedings of 5th International Conference on the Malicious and Unwanted Software (MALWARE), IEEE, pp 113–120
Walenstein, Venable, Hayes, Thompson, and Lakhotia] Walenstein A, Venable M, Hayes M, Thompson C, Lakhotia A (2007) Exploiting similarity between variants to defeat malware. In: Proceedings of the BlackHat DC Conf
Walters B (1999) Vmware virtual platform. Linux J 1999(63es):6
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIP) (No. NRF-2016R1A2B4015254).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kim, T., Lee, Y.R., Kang, B. et al. Binary executable file similarity calculation using function matching. J Supercomput 75, 607–622 (2019). https://doi.org/10.1007/s11227-016-1941-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1941-2