Embedding vector generation based on function call graph for effective malware detection and classification | Neural Computing and Applications Skip to main content

Advertisement

Log in

Embedding vector generation based on function call graph for effective malware detection and classification

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The surge of malware poses a huge threat to cyberspace security. The existing malware analysis methods based on machine learning mainly rely on feature engineering. These methods need to extract many handcrafted features from the malware to improve accuracy, which increases the complexity of malware analysis. In order to solve this problem, this paper proposes GEMAL, a new malware analysis method based on function call graph (FCG) and graph embedding network. FCG contains the structure information of the binary file and has been used in various research of malware analysis. Inspired by natural language processing tasks, we treat instructions as words and functions as sentences, so that we can automatically extract semantic features using the natural language processing method. We use an attention mechanism based graph embedding network to combine structural features and semantic features to generate embedding vectors of malware for automatic and efficient malware analysis. We use two datasets to test the efficiency of GEMAL. One is a self-created dataset named WUFCG, which contains 70,188 real-world samples. The other one is the public dataset of the Microsoft Malware Classification Challenge, which contains 10,868 samples. Experimental results show that GEMAL can detect real-world malware with 99.16% accuracy and classify malware families with the best accuracy of 99.81%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Gandotra E, Bansal D, Sofat S (2014) Malware analysis and classification: a survey. JIS 05:56–64

    Article  Google Scholar 

  2. Alazab M (2015) Profiling and classifying the behavior of malicious codes. J Syst Softw 100:91–102

    Article  Google Scholar 

  3. Epps C (2017) Best practices to deal with top cybercrime activities. Comput Fraud Secur 2017:13–15

    Article  Google Scholar 

  4. Meng G, Xue Y, Mahinthan C, et al (2016) Mystique: evolving android malware for auditing anti-malware tools. In: Proceedings of the 11th ACM on Asia conference on computer and communications security. ACM, Xi’an China pp 365–376

  5. Vemparala S, Di Troia F, Corrado VA, et al (2016) Malware detection using dynamic birthmarks. In: Proceedings of the 2016 ACM on international workshop on security and privacy analytics. ACM, New Orleans Louisiana USA, pp 41–46

  6. Dang-Pham D, Pittayachawan S (2015) Comparing intention to avoid malware across contexts in a BYOD-enabled Australian university: a protection motivation theory approach. Comput & Secur 48:281–297

    Article  Google Scholar 

  7. Meng G, Xue Y, Xu Z, et al (2016) Semantic modelling of android malware for effective malware comprehension, detection, and classification. In: Proceedings of the 25th international symposium on software testing and analysis. ACM, Saarbrücken Germany, pp 306–317

  8. Farivar F, Haghighi MS, Jolfaei A, Alazab M (2020) Artificial intelligence for detection, estimation, and compensation of malicious attacks in nonlinear cyber-physical systems and industrial IoT. IEEE Trans Ind Inf 16:2716–2725

    Article  Google Scholar 

  9. https://www.mcafee.com/enterprise/en-us/lp/threats-reports/nov-2020.html

  10. Jung B, Bae SI, Choi C, Im EG (2020) Packer identification method based on byte sequences. Concurr Computat Pract Exp 32(8):e5082

    Google Scholar 

  11. Fraley JB, Cannady J (2017) The promise of machine learning in cybersecurity. SoutheastCon, IEEE. Concord, NC, USA pp 1–6

  12. Raff E, Barker J, Sylvester J, et al (2018) Malware detection by eating a whole exe. In: Workshops at the thirty-second AAAI conference on artificial intelligence

  13. Krčál M, Švec O, Bálek M, Jašek O (2018) Deep convolutional malware classifiers can learn from raw executables and labels only. In: 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings. Vancouver, BC, Canada, pp 1–4

  14. Souri A, Hosseini R (2018) A state-of-the-art survey of malware detection approaches using data mining techniques. Hum Cent Comput Inf Sci 8:3

    Article  Google Scholar 

  15. Aslan O, Samet R, (2017) Investigation of possibilities to detect malware using existing tools. In: IEEE/ACS 14th international conference on computer systems and applications (AICCSA). IEEE, Hammamet pp 1277–1284

  16. Das S, Liu Y, Zhang W, Chandramohan M (2016) Semantics-based online malware detection: towards efficient real-time protection against malware. IEEE Trans Inf Secur 11:289–302

    Article  Google Scholar 

  17. Suaboot J, Tari Z, Mahmood A et al (2020) Sub-curve HMM: a malware detection approach based on partial analysis of API call sequences. Comput & Secur 92:101773

    Article  Google Scholar 

  18. Karbab EB, Debbabi M (2019) MalDy: Portable, data-driven malware detection using natural language processing and machine learning techniques on behavioral analysis reports. Digit Investig 28:S77–S87

    Article  Google Scholar 

  19. Bazrafshan Z, Hashemi H, Fard SMH, Hamzeh A (2013) A survey on heuristic malware detection techniques. In: The 5th conference on information and knowledge technology. IEEE, shiraz, Iran pp 113–120

  20. Raff E, Zak R, Cox R et al (2018) An investigation of byte n-gram features for malware classification. J Comput Virol Hack Tech 14:1–20

    Article  Google Scholar 

  21. Choi C, Esposito C, Lee M, Choi J (2019) Metamorphic malicious code behavior detection using probabilistic inference methods. Cognit Syst Res 56:142–150

    Article  Google Scholar 

  22. Azmoodeh A, Dehghantanha A, Conti M, Choo K-KR (2018) Detecting crypto-ransomware in IoT networks based on energy consumption footprint. J Ambient Intell Human Comput 9:1141–1152

    Article  Google Scholar 

  23. Huang W, Stokes JW (2016) MtNet: a multi-task neural network for dynamic malware classification. In: Caballero J, Zurutuza U, Rodríguez RJ (eds) Detection of intrusions and malware, and vulnerability assessment. Springer International Publishing, Cham, pp 399–418

    Chapter  Google Scholar 

  24. Wang W, Zhao M, Wang J (2019) Effective android malware detection with a hybrid model based on deep autoencoder and convolutional neural network. J Ambient Intell Human Comput 10:3035–3043

    Article  Google Scholar 

  25. Zhang H, Zhang W, Lv Z et al (2020) MALDC: a depth detection method for malware based on behavior chains. World Wide Web 23:991–1010

    Article  Google Scholar 

  26. Naeem H, Ullah F, Naeem MR et al (2020) Malware detection in industrial internet of things based on hybrid image visualization and deep learning model. Ad Hoc Netw 105:102154

    Article  Google Scholar 

  27. Huang L-C, Chang C-H, Hwang M-S (2020) Research on malware detection and classification based on artificial intelligence. Int J Netw Secur 22:717–727

    Google Scholar 

  28. Su J, Danilo Vasconcellos V, Prasad S, (2018) Lightweight classification of IoT malware based on image recognition. et al (2018). In: IEEE 42nd annual computer software and applications conference (COMPSAC). IEEE, Tokyo pp 664–669

  29. Gibert D, Mateu C, Planes J, Vicens R (2019) Using convolutional neural networks for classification of malware represented as images. J Comput Virol Hack Tech 15:15–28

    Article  Google Scholar 

  30. Vasan D, Alazab M, Wassan S et al (2020) Image-based malware classification using ensemble of CNN architectures (IMCEC). Comput & Secur 92:101748

    Article  Google Scholar 

  31. Pektaş A, Acarman T (2020) Deep learning for effective android malware detection using API call graph embeddings. Soft Comput 24:1027–1043

    Article  Google Scholar 

  32. Zhao J, Liu X, Yan Q et al (2020) Multi-attributed heterogeneous graph convolutional network for bot detection. Inf Sci 537:380–393

    Article  Google Scholar 

  33. Sun X, Yang J, Wang Z, Liu H (2020) HGDom: heterogeneous graph convolutional networks for malicious domain detection. In: NOMS 2020–2020 IEEE/IFIP network operations and management symposium. IEEE, Budapest, Hungary pp 1–9

  34. Oba T, Taniguchi T (2020) Graph convolutional network-based suspicious communication pair estimation for industrial control systems. arXiv:2007.10204

  35. Gong J, Wang S, Wang J, et al (2020) Attentional graph convolutional networks for knowledge concept recommendation in MOOCs in a heterogeneous View. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. ACM, Virtual Event China pp 79–88

  36. https://hex-rays.com/ida-pro

  37. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  38. Massarelli L, Di Luna GA, Petroni F, (2019) Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. In: Proceedings workshop on binary analysis research. Internet Society, San Diego

  39. https://archive.org/details/vxheavens-2010-05-18

  40. https://www.virustotal.com/

  41. Ahmadi M, Ulyanov D, Semenov S, et al (2016) Novel feature extraction, selection and fusion for effective malware family classification. In: Proceedings of the sixth ACM conference on data and application security and privacy. ACM, New Orleans Louisiana USA, pp 183–194

  42. Yousefi-Azar M, Varadharajan V, Hamey L, Tupakula U, (2017) Autoencoder-based feature learning for cyber security applications.In: International Joint conference on neural networks (IJCNN). IEEE, Anchorage, AK, USA pp 3854–3861

  43. Yan J, Yan G, Jin D, (2019) Classifying malware represented as control flow graphs using deep graph convolutional neural network.In: 49th Annual IEEE/IFIP international conference on dependable systems and networks (DSN). IEEE, Portland, OR, USA pp 52–63

  44. Zhang Y, Huang Q, Ma X, (2016) Using multi-features and ensemble learning method for imbalanced malware classification. In: IEEE Trustcom/BigDataSE/ISPA. IEEE, Tianjin, China pp 965–973

  45. Hassen M, Carvalho MM, Chan PK, (2017) Malware classification using static analysis based features. In: IEEE symposium series on computational intelligence (SSCI). IEEE, Honolulu, HI pp 1–7

  46. Hassen M, Chan PK (2017) Scalable function call graph-based malware classification. In: Proceedings of the seventh ACM on conference on data and application security and privacy. ACM, scottsdale arizona USA pp 239–248

  47. McLaughlin N, Martinez del Rincon J, Kang B, et al (2017) Deep android malware detection. In: Proceedings of the seventh ACM on conference on data and application security and privacy. ACM, scottsdale Arizona USA pp 301–308

  48. Drew J, Moore T, Hahsler M, (2016) Polymorphic malware detection using sequence classification methods. In: IEEE security and privacy workshops (SPW). IEEE, San Jose, CA pp 81–87

  49. Mays M, Drabinsky N, Brandle S (2017) Feature selection for malware classification. In: 28th Modern Artificial Intelligence and Cognitive Science Conference, MAICS 2017. Fort Wayne, IN, United states, pp 165–170

  50. Gibert D, Mateu C, Planes J (2020) HYDRA: a multimodal deep learning framework for malware classification. Comput Secur 95:101873

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Jia.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, XW., Wang, Y., Fang, Y. et al. Embedding vector generation based on function call graph for effective malware detection and classification. Neural Comput & Applic 34, 8643–8656 (2022). https://doi.org/10.1007/s00521-021-06808-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06808-8

Keywords