Abstract
Identifying a network misuse takes days or even weeks, and network administrators usually neglect zero-day threats until a large number of malicious users exploit them. Besides, security applications, such as anomaly detection and attack mitigation systems, must apply real-time monitoring to reduce the impacts of security incidents. Thus, information processing time should be as small as possible to enable an effective defense against attacks. In this paper, we present a fast preprocessing method for network traffic classification based on feature correlation and feature normalization. Our proposed method couples a normalization and feature selection algorithms. We evaluate the proposed algorithms against three different datasets for eight different machine learning classification algorithms. Our proposed normalization algorithm reduces the classification error rate when compared with traditional methods. Our feature selection algorithm chooses an optimized subset of features improving accuracy by more than 11% within a 100-fold reduction in processing time when compared to traditional feature selection and feature reduction algorithms. The preprocessing method is performed in batch and streaming data, being able to detect concept-drift.
Similar content being viewed by others
Notes
Features refer to the original set of attributes that describe the data. Variables refer to the input of the machine learning algorithms applied over the data. If no preprocessing method handles the original data, the set of variables and the set of features are the same.
Anonymized data can be asked by sending an email contact to the authors
References
Hu P, Li H, Fu H, Cansever D, Mohapatra P (2015) Dynamic defense strategy against advanced persistent threat with insiders. In: IEEE conference on computer communications (INFOCOM), vol 4, pp 747–755
Andreoni Lopez M, Ferrazani Mattos DM, Duarte OCMB (2016) An elastic intrusion detection system for software networks. Ann Telecommun 71(11):595–605. https://doi.org/10.1007/s12243-016-0506-y
Ferrazani Mattos DM, Duarte OCMB (2016) AuthFlow: authentication and access control mechanism for software defined networking. Ann Telecommun 71(11):607–615. https://doi.org/10.1007/s12243-016-0505-z
Paxson V (1999) Bro: a system for detecting network intruders in real-time. Comput Netw 31(23–24):2435–2463
Roesch M (1999) Snort-lightweight intrusion detection for networks. In: Proceedings of the 13th USENIX Conference on System Administration. USENIX Association, pp 229–238
Vallentin M, Sommer R, Lee J, Leres C, Paxson V, Tierney B (2007) The NIDS cluster: scalable, stateful network intrusion detection on commodity hardware. In: Recent advances in intrusion detection. Springer, Berlin, pp 107–126
Bar A, Finamore A, Casas P, Golab l., Mellia M (2014) Large-scale network traffic monitoring with DBStream, a system for rolling big data analysis. In: 2014 IEEE International Conference on Big Data (Big Data). IEEE, vol 10, pp 165–170
Stonebraker M, Çetintemel U, Zdonik S (2005) The 8 requirements of real-time stream processing. ACM SIGMOD Rec 34(4):42–47
Mayhew M, Atighetchi M, Adler A, Greenstadt R (2015) Use of machine learning in big data analytics for insider threat detection. In: IEEE Military Communications Conference. MILCOM, vol 10, pp 915–922
Mladenić D (2006) Feature selection for dimensionality reduction. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection (slsfs): statistical and optimization perspectives workshop, pp 84–102. Springer, Bohinj
Bifet A, Morales GDF (2014) Big data stream learning with Samoa. In: 2014 IEEE International Conference on Data Mining Workshop, pp 1199–1202
Khamassi I, Sayed-Mouchaweh M, Hammami M, Ghédira K (2018) Discussion and review on evolving data streams and concept drift adapting. Evol Syst 9(1):1–23
Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Bullet Tech Comm Data Eng 23(4):3–13
García S, Luengo J, Herrera F (2016) Data preprocessing in data mining. Springer, Berlin
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53(1/2):23– 69
Schölkopf B, Smola AJ, Müller K-R (1999) Kernel principal component analysis. In: Advances in kernel methods. MIT Press, Cambridge, pp 327–352
García S, Luengo J, Herrera F (2016) Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl-Based Syst 98:1–29. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S0950705115004785
Zhang S, Zhang C, Yang Q (2003) Data preparation for data mining. Appl Artif Intell 17(5–6):375–381
Tan S (2005) Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst Appl 28(4):667–671
Ramérez-Gallego S, Krawczyk B, García S, Woźniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing
Van Der Maaten L, Postma E, den Herik J (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10:66–71
Ang JC, Mirzal A, Haron H, Hamed HNA (2016) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans Comput Biol Bioinform 13(5):971–989
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1-3):389–422
Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. dissertation, The University of Waikato
Kumar A, Sung M, Xu JJ, Wang J (2004) Data streaming algorithms for efficient and accurate estimation of flow size distribution. In: ACM SIGMETRICS performance evaluation review. ACM, vol 132, no. 1, pp 177-188
Ben-Haim Y, Tom-tov E (2010) A streaming parallel decision tree algorithm. J Mach Learn Res 11:849–872
Webb GI (2014) Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In: IEEE International Conference on Data Mining (ICDM). IEEE, pp 1031–1036
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp 1–6
Lobato A, Andreoni Lopez M, Sanz IJ, Cárdenas A, Duarte OCMB, Pujolle G (2018) An adaptive real-time architecture for zero-day threat detection. In: IEEE ICC 2018 Next Generation Networking and Internet Symposium (ICC’18 NGNI), Kansas City, USA
Andreoni Lopez M, Silva RS, Alvarenga ID, Rebello GAF, Sanz IJ, Lobato AGP, Mattos DMF, Duarte OCMB, Pujolle G (2017) Collecting and characterizing a real broadband access network traffic dataset. In: IEEE/IFIP 1st Cyber Security in Networking Conference (CSNet), pp 1–8
Hu H, Kantardzic M (2016) Smart preprocessing improves data stream mining. In: 49th Hawaii International Conference on System Sciences (HICSS). IEEE, pp 1749–1757
Buczak AL, Guven E (2016) A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun Surv Tutorials 18(2):1153–1176. https://doi.org/10.1109/COMST.2015.2494502
Prasath VBS, Alfeilat HAA, Lasassmeh O, Hassanat ABA Distance and similarity measures effect on the performance of k-nearest neighbor classifier - a review, CoRR. [Online]. arXiv:1708.04321
Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning. ACM, pp 116
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Perkins S, Theiler J (2003) Online feature selection using grafting. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp 592–599
Zhou J, Foster DP, Stine RA, Ungar LH (2006) Streamwise feature selection. J Mach Learn Res 7 (Sep):1861–1885
Wu X, Yu K, Ding W, Wang H, Zhu X (2013) Online feature selection with streaming features. IEEE Trans Pattern Anal Mach Intell 35(5):1178–1192
Acknowledgments
The authors would like to thank Antonio Lobato, Igor Alvarenga, and Igor Sanz for their significant contributions to obtain the results.
Funding
This research is supported by CNPq, CAPES, FAPERJ, and FAPESP (2015/24514-9, 2015/24485-9, and 2014/50937-1).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Andreoni Lopez, M., Mattos, D.M.F., Duarte, O.C.M.B. et al. A fast unsupervised preprocessing method for network monitoring. Ann. Telecommun. 74, 139–155 (2019). https://doi.org/10.1007/s12243-018-0663-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12243-018-0663-2