Abstract
The surge of malware poses a huge threat to cyberspace security. The existing malware analysis methods based on machine learning mainly rely on feature engineering. These methods need to extract many handcrafted features from the malware to improve accuracy, which increases the complexity of malware analysis. In order to solve this problem, this paper proposes GEMAL, a new malware analysis method based on function call graph (FCG) and graph embedding network. FCG contains the structure information of the binary file and has been used in various research of malware analysis. Inspired by natural language processing tasks, we treat instructions as words and functions as sentences, so that we can automatically extract semantic features using the natural language processing method. We use an attention mechanism based graph embedding network to combine structural features and semantic features to generate embedding vectors of malware for automatic and efficient malware analysis. We use two datasets to test the efficiency of GEMAL. One is a self-created dataset named WUFCG, which contains 70,188 real-world samples. The other one is the public dataset of the Microsoft Malware Classification Challenge, which contains 10,868 samples. Experimental results show that GEMAL can detect real-world malware with 99.16% accuracy and classify malware families with the best accuracy of 99.81%.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Gandotra E, Bansal D, Sofat S (2014) Malware analysis and classification: a survey. JIS 05:56–64
Alazab M (2015) Profiling and classifying the behavior of malicious codes. J Syst Softw 100:91–102
Epps C (2017) Best practices to deal with top cybercrime activities. Comput Fraud Secur 2017:13–15
Meng G, Xue Y, Mahinthan C, et al (2016) Mystique: evolving android malware for auditing anti-malware tools. In: Proceedings of the 11th ACM on Asia conference on computer and communications security. ACM, Xi’an China pp 365–376
Vemparala S, Di Troia F, Corrado VA, et al (2016) Malware detection using dynamic birthmarks. In: Proceedings of the 2016 ACM on international workshop on security and privacy analytics. ACM, New Orleans Louisiana USA, pp 41–46
Dang-Pham D, Pittayachawan S (2015) Comparing intention to avoid malware across contexts in a BYOD-enabled Australian university: a protection motivation theory approach. Comput & Secur 48:281–297
Meng G, Xue Y, Xu Z, et al (2016) Semantic modelling of android malware for effective malware comprehension, detection, and classification. In: Proceedings of the 25th international symposium on software testing and analysis. ACM, Saarbrücken Germany, pp 306–317
Farivar F, Haghighi MS, Jolfaei A, Alazab M (2020) Artificial intelligence for detection, estimation, and compensation of malicious attacks in nonlinear cyber-physical systems and industrial IoT. IEEE Trans Ind Inf 16:2716–2725
https://www.mcafee.com/enterprise/en-us/lp/threats-reports/nov-2020.html
Jung B, Bae SI, Choi C, Im EG (2020) Packer identification method based on byte sequences. Concurr Computat Pract Exp 32(8):e5082
Fraley JB, Cannady J (2017) The promise of machine learning in cybersecurity. SoutheastCon, IEEE. Concord, NC, USA pp 1–6
Raff E, Barker J, Sylvester J, et al (2018) Malware detection by eating a whole exe. In: Workshops at the thirty-second AAAI conference on artificial intelligence
Krčál M, Švec O, Bálek M, Jašek O (2018) Deep convolutional malware classifiers can learn from raw executables and labels only. In: 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings. Vancouver, BC, Canada, pp 1–4
Souri A, Hosseini R (2018) A state-of-the-art survey of malware detection approaches using data mining techniques. Hum Cent Comput Inf Sci 8:3
Aslan O, Samet R, (2017) Investigation of possibilities to detect malware using existing tools. In: IEEE/ACS 14th international conference on computer systems and applications (AICCSA). IEEE, Hammamet pp 1277–1284
Das S, Liu Y, Zhang W, Chandramohan M (2016) Semantics-based online malware detection: towards efficient real-time protection against malware. IEEE Trans Inf Secur 11:289–302
Suaboot J, Tari Z, Mahmood A et al (2020) Sub-curve HMM: a malware detection approach based on partial analysis of API call sequences. Comput & Secur 92:101773
Karbab EB, Debbabi M (2019) MalDy: Portable, data-driven malware detection using natural language processing and machine learning techniques on behavioral analysis reports. Digit Investig 28:S77–S87
Bazrafshan Z, Hashemi H, Fard SMH, Hamzeh A (2013) A survey on heuristic malware detection techniques. In: The 5th conference on information and knowledge technology. IEEE, shiraz, Iran pp 113–120
Raff E, Zak R, Cox R et al (2018) An investigation of byte n-gram features for malware classification. J Comput Virol Hack Tech 14:1–20
Choi C, Esposito C, Lee M, Choi J (2019) Metamorphic malicious code behavior detection using probabilistic inference methods. Cognit Syst Res 56:142–150
Azmoodeh A, Dehghantanha A, Conti M, Choo K-KR (2018) Detecting crypto-ransomware in IoT networks based on energy consumption footprint. J Ambient Intell Human Comput 9:1141–1152
Huang W, Stokes JW (2016) MtNet: a multi-task neural network for dynamic malware classification. In: Caballero J, Zurutuza U, Rodríguez RJ (eds) Detection of intrusions and malware, and vulnerability assessment. Springer International Publishing, Cham, pp 399–418
Wang W, Zhao M, Wang J (2019) Effective android malware detection with a hybrid model based on deep autoencoder and convolutional neural network. J Ambient Intell Human Comput 10:3035–3043
Zhang H, Zhang W, Lv Z et al (2020) MALDC: a depth detection method for malware based on behavior chains. World Wide Web 23:991–1010
Naeem H, Ullah F, Naeem MR et al (2020) Malware detection in industrial internet of things based on hybrid image visualization and deep learning model. Ad Hoc Netw 105:102154
Huang L-C, Chang C-H, Hwang M-S (2020) Research on malware detection and classification based on artificial intelligence. Int J Netw Secur 22:717–727
Su J, Danilo Vasconcellos V, Prasad S, (2018) Lightweight classification of IoT malware based on image recognition. et al (2018). In: IEEE 42nd annual computer software and applications conference (COMPSAC). IEEE, Tokyo pp 664–669
Gibert D, Mateu C, Planes J, Vicens R (2019) Using convolutional neural networks for classification of malware represented as images. J Comput Virol Hack Tech 15:15–28
Vasan D, Alazab M, Wassan S et al (2020) Image-based malware classification using ensemble of CNN architectures (IMCEC). Comput & Secur 92:101748
Pektaş A, Acarman T (2020) Deep learning for effective android malware detection using API call graph embeddings. Soft Comput 24:1027–1043
Zhao J, Liu X, Yan Q et al (2020) Multi-attributed heterogeneous graph convolutional network for bot detection. Inf Sci 537:380–393
Sun X, Yang J, Wang Z, Liu H (2020) HGDom: heterogeneous graph convolutional networks for malicious domain detection. In: NOMS 2020–2020 IEEE/IFIP network operations and management symposium. IEEE, Budapest, Hungary pp 1–9
Oba T, Taniguchi T (2020) Graph convolutional network-based suspicious communication pair estimation for industrial control systems. arXiv:2007.10204
Gong J, Wang S, Wang J, et al (2020) Attentional graph convolutional networks for knowledge concept recommendation in MOOCs in a heterogeneous View. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. ACM, Virtual Event China pp 79–88
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Massarelli L, Di Luna GA, Petroni F, (2019) Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. In: Proceedings workshop on binary analysis research. Internet Society, San Diego
Ahmadi M, Ulyanov D, Semenov S, et al (2016) Novel feature extraction, selection and fusion for effective malware family classification. In: Proceedings of the sixth ACM conference on data and application security and privacy. ACM, New Orleans Louisiana USA, pp 183–194
Yousefi-Azar M, Varadharajan V, Hamey L, Tupakula U, (2017) Autoencoder-based feature learning for cyber security applications.In: International Joint conference on neural networks (IJCNN). IEEE, Anchorage, AK, USA pp 3854–3861
Yan J, Yan G, Jin D, (2019) Classifying malware represented as control flow graphs using deep graph convolutional neural network.In: 49th Annual IEEE/IFIP international conference on dependable systems and networks (DSN). IEEE, Portland, OR, USA pp 52–63
Zhang Y, Huang Q, Ma X, (2016) Using multi-features and ensemble learning method for imbalanced malware classification. In: IEEE Trustcom/BigDataSE/ISPA. IEEE, Tianjin, China pp 965–973
Hassen M, Carvalho MM, Chan PK, (2017) Malware classification using static analysis based features. In: IEEE symposium series on computational intelligence (SSCI). IEEE, Honolulu, HI pp 1–7
Hassen M, Chan PK (2017) Scalable function call graph-based malware classification. In: Proceedings of the seventh ACM on conference on data and application security and privacy. ACM, scottsdale arizona USA pp 239–248
McLaughlin N, Martinez del Rincon J, Kang B, et al (2017) Deep android malware detection. In: Proceedings of the seventh ACM on conference on data and application security and privacy. ACM, scottsdale Arizona USA pp 301–308
Drew J, Moore T, Hahsler M, (2016) Polymorphic malware detection using sequence classification methods. In: IEEE security and privacy workshops (SPW). IEEE, San Jose, CA pp 81–87
Mays M, Drabinsky N, Brandle S (2017) Feature selection for malware classification. In: 28th Modern Artificial Intelligence and Cognitive Science Conference, MAICS 2017. Fort Wayne, IN, United states, pp 165–170
Gibert D, Mateu C, Planes J (2020) HYDRA: a multimodal deep learning framework for malware classification. Comput Secur 95:101873
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wu, XW., Wang, Y., Fang, Y. et al. Embedding vector generation based on function call graph for effective malware detection and classification. Neural Comput & Applic 34, 8643–8656 (2022). https://doi.org/10.1007/s00521-021-06808-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06808-8