Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning | SpringerLink
Skip to main content

Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning

  • Conference paper
  • First Online:
Proceedings of ICRIC 2019

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 597))

  • 4143 Accesses

Abstract

Real-world datasets in many domains like medical, intrusion detection, fraud transactions and bioinformatics are highly imbalanced. In classification problems, imbalanced datasets negatively affect the accuracy of class predictions. This skewness can be handled either by oversampling minority class examples or by undersampling majority class. In this work, popular methods of both categories have been evaluated for their capability of improving the imbalanced ratio of five highly imbalanced datasets from different application domains. Effect of balancing on classification results has been also investigated. It has been observed that adaptive synthetic oversampling approach can best improve the imbalance ratio as well as classification results. However, undersampling approaches gave better overall performance on all datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 22879
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
JPY 28599
Price includes VAT (Japan)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Provost, F.: Machine learning from imbalanced data sets 101. Invited paper for the AAAI2000. Workshop on Imbalanced Data Sets, Menlo Park, CA (2000)

    Google Scholar 

  2. Hanskunatai, A.: A new hybrid sampling approach for classification of imbalanced datasets. In: 3rd International Conference on Computer and Communication Systems (ICCCS), pp. 67–71. Nagoya (2018)

    Google Scholar 

  3. Liu, X., Wu, J., Zhou, Z.: Exploratory undersampling for class-imbalance learning. In: in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550 (2009)

    Google Scholar 

  4. Zhang, J., Cui, X., Li, J., Wang, R.: Imbalanced classification of mental workload using a cost-sensitive majority weighted minority oversampling strategy. Cogn. Technol. Work (2017)

    Google Scholar 

  5. Practical Guide to deal with Imbalanced Classification Problems in R: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems

  6. Package unbalanced Documentation in R: https://cran.r-project.org/web/packages/unbalanced/unbalanced.pdf

  7. Introduction to k-Nearest Neighbour Classification and Condensed Nearest Neighbour Data Reduction: http://www.math.le.ac.uk/people/ag153/homepage/KNN/OliverKNNTalk.pdf

  8. Tomek, I.: Two modifications of cnn. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)

    MathSciNet  MATH  Google Scholar 

  9. Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: Machine Learning-International Workshop Then Conference, pp. 179–186. Morgan Kaufmann Publishers, Inc (1997)

    Google Scholar 

  10. Oversampling: https://en.wikipedia.org/wiki/Oversampling/

  11. Wilson.D.: Asymptotic properties of nearest neighbor rules using edited data. In: IEEE Transactions on Systems, Man and Cybernetics, pp. 408–421 (1972)

    Google Scholar 

  12. Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Artificial Intelligence in Medicine, pp. 63–66 (2001)

    Google Scholar 

  13. Padmaja T.M., Dhulipalla N., Krishna P.R., Bapi R.S., Laha A.: An unbalanced data classification model using hybrid sampling technique for fraud detection. In: Lecture Notes in Computer Science, vol. 4815. Springer, Berlin, Heidelberg (2007)

    Google Scholar 

  14. Smotefamily Package Documetation in R: https://cran.r-project.org/web/packages/smotefamily/smotefamily.pdf

  15. Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. arXiv preprint arXiv:1106.1813 (2011)

  16. Imabalnce Package Documetation in R: https://cran.r-project.org/web/packages/imbalance/imbalance.pdf

  17. Barua, S., Islam, M.M., Yao, X., Murase, K.: Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405425

    Google Scholar 

  18. Gao, M., Hong, X., Chen, S., Harris, C.J., Khalaf, E.: PDFOS: Pdf estimation based oversampling for imbalanced two-class problems. Neurocomputing 138 (2014)

    Google Scholar 

  19. Zhang, H., Li, M.: Rwo-sampling: a random walk over-sampling approach to imbalanced data classification. Info. Fusion 20, 99116

    Google Scholar 

  20. He, H., Bai, Y., Garcia, E., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of IJCNN (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference. pp. 1322–1328 (2009)

    Google Scholar 

  21. Das, B., Krishnan, N.C., Cook, D.J.: RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans. Knowl. Data Eng. 27(1), 222234

    Google Scholar 

  22. Newman, D.J., Asuncion, A.: UCI machine learning repository. Transformed datasets are available at http://www.ulb.ac.be/di/map/adalpozz/imbalanced-datasets.zip(2007)

  23. PbChem Bioassay Data, UCI machine learning repository datasets are available at: http://archive.ics.uci.edu/ml/datasets/PbChem+Bioassay+Data

  24. Yeast DataSet, UCI machine learning repository datasets are available at: http://archive.ics.uci.edu/ml/datasets/Yeast?ref=datanews.io

  25. Statlog, UCI machine learning repository datasets are available at: https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)

  26. Pima Indians Diabetes Database: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shivani Tyagi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tyagi, S., Mittal, S. (2020). Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning. In: Singh, P., Kar, A., Singh, Y., Kolekar, M., Tanwar, S. (eds) Proceedings of ICRIC 2019 . Lecture Notes in Electrical Engineering, vol 597. Springer, Cham. https://doi.org/10.1007/978-3-030-29407-6_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29407-6_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29406-9

  • Online ISBN: 978-3-030-29407-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics