Abstract
Real-world datasets in many domains like medical, intrusion detection, fraud transactions and bioinformatics are highly imbalanced. In classification problems, imbalanced datasets negatively affect the accuracy of class predictions. This skewness can be handled either by oversampling minority class examples or by undersampling majority class. In this work, popular methods of both categories have been evaluated for their capability of improving the imbalanced ratio of five highly imbalanced datasets from different application domains. Effect of balancing on classification results has been also investigated. It has been observed that adaptive synthetic oversampling approach can best improve the imbalance ratio as well as classification results. However, undersampling approaches gave better overall performance on all datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Provost, F.: Machine learning from imbalanced data sets 101. Invited paper for the AAAI2000. Workshop on Imbalanced Data Sets, Menlo Park, CA (2000)
Hanskunatai, A.: A new hybrid sampling approach for classification of imbalanced datasets. In: 3rd International Conference on Computer and Communication Systems (ICCCS), pp. 67–71. Nagoya (2018)
Liu, X., Wu, J., Zhou, Z.: Exploratory undersampling for class-imbalance learning. In: in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550 (2009)
Zhang, J., Cui, X., Li, J., Wang, R.: Imbalanced classification of mental workload using a cost-sensitive majority weighted minority oversampling strategy. Cogn. Technol. Work (2017)
Practical Guide to deal with Imbalanced Classification Problems in R: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems
Package unbalanced Documentation in R: https://cran.r-project.org/web/packages/unbalanced/unbalanced.pdf
Introduction to k-Nearest Neighbour Classification and Condensed Nearest Neighbour Data Reduction: http://www.math.le.ac.uk/people/ag153/homepage/KNN/OliverKNNTalk.pdf
Tomek, I.: Two modifications of cnn. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: Machine Learning-International Workshop Then Conference, pp. 179–186. Morgan Kaufmann Publishers, Inc (1997)
Oversampling: https://en.wikipedia.org/wiki/Oversampling/
Wilson.D.: Asymptotic properties of nearest neighbor rules using edited data. In: IEEE Transactions on Systems, Man and Cybernetics, pp. 408–421 (1972)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Artificial Intelligence in Medicine, pp. 63–66 (2001)
Padmaja T.M., Dhulipalla N., Krishna P.R., Bapi R.S., Laha A.: An unbalanced data classification model using hybrid sampling technique for fraud detection. In: Lecture Notes in Computer Science, vol. 4815. Springer, Berlin, Heidelberg (2007)
Smotefamily Package Documetation in R: https://cran.r-project.org/web/packages/smotefamily/smotefamily.pdf
Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. arXiv preprint arXiv:1106.1813 (2011)
Imabalnce Package Documetation in R: https://cran.r-project.org/web/packages/imbalance/imbalance.pdf
Barua, S., Islam, M.M., Yao, X., Murase, K.: Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405425
Gao, M., Hong, X., Chen, S., Harris, C.J., Khalaf, E.: PDFOS: Pdf estimation based oversampling for imbalanced two-class problems. Neurocomputing 138 (2014)
Zhang, H., Li, M.: Rwo-sampling: a random walk over-sampling approach to imbalanced data classification. Info. Fusion 20, 99116
He, H., Bai, Y., Garcia, E., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of IJCNN (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference. pp. 1322–1328 (2009)
Das, B., Krishnan, N.C., Cook, D.J.: RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans. Knowl. Data Eng. 27(1), 222234
Newman, D.J., Asuncion, A.: UCI machine learning repository. Transformed datasets are available at http://www.ulb.ac.be/di/map/adalpozz/imbalanced-datasets.zip(2007)
PbChem Bioassay Data, UCI machine learning repository datasets are available at: http://archive.ics.uci.edu/ml/datasets/PbChem+Bioassay+Data
Yeast DataSet, UCI machine learning repository datasets are available at: http://archive.ics.uci.edu/ml/datasets/Yeast?ref=datanews.io
Statlog, UCI machine learning repository datasets are available at: https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)
Pima Indians Diabetes Database: https://www.kaggle.com/uciml/pima-indians-diabetes-database
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Tyagi, S., Mittal, S. (2020). Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning. In: Singh, P., Kar, A., Singh, Y., Kolekar, M., Tanwar, S. (eds) Proceedings of ICRIC 2019 . Lecture Notes in Electrical Engineering, vol 597. Springer, Cham. https://doi.org/10.1007/978-3-030-29407-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-29407-6_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29406-9
Online ISBN: 978-3-030-29407-6
eBook Packages: EngineeringEngineering (R0)