Abstract
Machine learning mechanisms for network intrusion detection systems lack accurate evaluation, comparison, and deployment due to the scarcity of well-constructed datasets. In this paper, we propose a statistical analysis of the features contained in four highly used security datasets. We conclude that the analyzed datasets should not be used as a benchmark for creating novel anomaly-based mechanisms for intrusion detection systems. The analyzed datasets introduce a biased classification since features are over-correlated, and most of the features are capable of making a complete distinction between normal and attack flows. Our proposed methodology analyzes the correlation among features instead of checking for redundant values or data imbalance. The results align with the performance of three machine learning techniques. We show that biased classification occurs due to a significant difference between attack and normal data. The syntactically generated features are statistically different between normal and attack classes, which implies overfitting in the machine learning approaches.
Similar content being viewed by others
Notes
Available at https://www.unb.ca/cic/datasets/nsl.html.
Available at https://tcpreplay.appneta.com/.
Available at https://github.com/DanielArndt/flowtbag.
References
Lopez MA, Ferrazani Mattos DM, Duarte OCMB (2016) An elastic intrusion detection system for software networks. Ann Telecommun 71(11):595–605
Andreoni Lopez M, Mattos DMF, Duarte OCMB, Pujolle G (2019) A fast unsupervised preprocessing method for network monitoring. Ann Telecommun 74(3):139–155
Andreoni Lopez M, Mattos DMF, Duarte OCMB, Pujolle G (2019) Toward a monitoring and threat detection system based on stream processing as a virtual network function for big data. Concurrency Comput Pract Exp 31(20):e5344
Mattos D. M. F., Ferraz L. H. G, Costa L. H. M. K., Duarte O. C. M. B. (2012) Evaluating virtual router performance for a pluralist future internet. In: Proceedings of the 3rd International Conference on Information and Communication Systems, ser. ICICS’12 Irbid. Association for Computing Machinery, Jordan
Cic ids dataset, accessed: 2020-03-22
Unsw-nb15 dataset, accessed: 2021-01-26
Cic botnet 2014 dataset, accessed: 2020-04-02
Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y (2013) Intrusion detection system: a comprehensive review. J Netw Comput Appl 36:16–24
Mrutyunjaya Panda MRP, Abrahamb A (2012) A hybrid intelligent approach for network intrusion detection, vol. 30 Elsevier
Wathiq Laftah Al-Yaseen MZAN, Othman ZA (2017) Multi-level hybrid support vector machine and extreme learning machine based on modified k-means for intrusion detection system Expert Systems With Applications
Sanz IJ, Mattos DMF, Duarte OCMB (2018) Sfcperf: An automatic performance evaluation framework for service function chaining. In: NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium 1–9
Depren O, Topallar M, Anarim E, Ciliz MK (2005) An intelligent intrusion detection system (ids) for anomaly and misuse detection in computer networks, vol. 29 Elsevier, 713–722
1998 darpa intrusion detection evaluation dataset, accessed: 2020-04-02
Kdd cup 1999 data, accessed: 2020-02-22
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the kdd cup 99 data set IEEE
Moustafa N, Slay J (2015) Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (milCIS). IEEE, 1–6
Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSp, pp 108–116
Biglar Beigi E, Hadian Jazi H, Stakhanova N, Ghorbani AA (2014) Towards effective feature selection in machine learning-based botnet detection approaches. In: 2014 IEEE Conference on Communications and Network Security, pp 247–255
Boutaba R, Salahuddin MA, Limam N, Ayoubi S, Shahriar N, Estrada-Solano F, Caicedo OM (2018) A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications 9(1):16
Hastie T, Tibshirani R, Friedman J, Franklin J (2004) The elements of statistical learning: Data mining, inference, and prediction. Math Intell 27:83–85, 11
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: Synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357
Nachar N, et al. (2008) The mann-whitney u: a test for assessing whether two independent samples come from the same distribution. Tutorials in quantitative Methods for Psychology 4(1):13–20
Olusola DOA, Adetunmbi A., Adeola S (2010) Oladele, Analysis of kdd ’99’ intrusion detection dataset for selection of relevance features, vol. 1
Mohammad khubeb siddiqui SN (2013) Analysis of kdd cup 99 dataset using clustering base data mining. 45:23–34
Al Mehedi Hasan BPM, Mohammed N (2013) On kdd’99 dataset: Support vector machine based intrusion detection system (ids) with different kernels. Int J Electron Commun Comput Eng 4:2278–4209
Hasan MAM, Nasser M, Pal B, Ahmad S (2014) Support vector machine and random forest modeling for intrusion detection system (ids). 6:45–52
Funding
This research was made possible for the funding from CNPq, CAPES, FAPERJ, FAPESP (2018 / 23062-5), RNP and the Niterói City Hall (PDPA PMN/UFF/FEC).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Silva, J.V.V., de Oliveira, N.R., Medeiros, D.S.V. et al. A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms. Ann. Telecommun. 77, 555–571 (2022). https://doi.org/10.1007/s12243-021-00904-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12243-021-00904-5