Minimizing Classification Errors in Imbalanced Dataset Using Means of Sampling | SpringerLink
Skip to main content

Minimizing Classification Errors in Imbalanced Dataset Using Means of Sampling

  • Conference paper
  • First Online:
Advances in Visual Informatics (IVIC 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13051))

Included in the following conference series:

Abstract

Classification, a significant application of machine learning, labels each instance of the dataset into one of the predefined classes. Problems occur when the number of instances in the classes is not uniform. The exceptional lyuneven class distribution gives rise to class imbalancing issues which tend to demote the overall performance of the classifier. A set of data-level algorithms are available which are applied to adjust the class distribution. The class imbalancing emerges frequently in datasets from educational domains where the number of students with unsatisfactory performance general appears in low number comparing to the students with satisfactory outcomes. This paper applies a set of data-level sampling algorithms over a dataset taken from an educational domain. It underlines the consequences rising from classification with imbalanced dataset. This research confirms that a classification model achieving higher accuracy may not appear effective in correct identification of instances in minority class. Classification with an imbalance dataset may produce low recall, precision and F-Measure for classes with lower number of instances. The performance of classification model improves with application of data level algorithm. However, it highlights the supremacy of oversampling algorithm over undersampling algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 12583
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 15729
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Luque, A., et al.: The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 91, 216–231 (2019)

    Article  Google Scholar 

  2. Tyagi, S., Mittal, S.: Sampling approaches for imbalanced data classification problem in machine learning. In: Proceedings of ICRIC 2019, pp. 209–221. Springer (2020). https://doi.org/10.1007/978-3-030-29407-6_17

  3. Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5(1), 1–30 (2018). https://doi.org/10.1186/s40537-018-0151-6

    Article  Google Scholar 

  4. Elreedy, D., Atiya, A.F.: A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf. Sci. 505, 32–64 (2019)

    Article  Google Scholar 

  5. Raghuwanshi, B.S., Shukla, S.: SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl.-Based Syst. 187, 104814 (2019)

    Google Scholar 

  6. Romero, C., Ventura, S.: Educational data mining and learning analytics: an updated survey. Wiley Interdisc. Rev. Data Mining Knowl. Disc. 10(3), e1355 (2020)

    Google Scholar 

  7. Leitner, P., Khalil, M., Ebner, M.: Learning analytics in higher education—a literature review. Learn. Anal. Fundaments Appl. Trends 94, 1–23 (2017). https://doi.org/10.1007/978-3-319-52977-6_1

  8. Khan, I., et al.: A conceptual framework to aid attribute selection in machine learning student performance prediction models. Int. J. Interactive Mob. Technol. 15(15) (2021)

    Google Scholar 

  9. Osmanbegovic, E., Suljic, M.: Data mining approach for predicting student performance. Econ. Rev. J. Econ. Bus. 10(1), 3–12 (2012)

    Google Scholar 

  10. Asif, R., Merceron, A., Pathan, M.K.: Predicting student academic performance at degree level: a case study. Int. J. Intell. Syst. Appl. 7(1), 49 (2014)

    Google Scholar 

  11. Kabakchieva, D.: Predicting student performance by using data mining methods for classification. Cybern. Inf. Technol. 13(1), 61–72 (2013)

    MathSciNet  Google Scholar 

  12. Ramesh, V., Parkavi, P., Ramar, K.: Predicting student performance: a statistical and data mining approach. Int. J. Comput. Appl. 63(8), 35–39 (2013)

    Google Scholar 

  13. Kaur, P., Singh, M., Josan, G.S.: Classification and prediction based data mining algorithms to predict slow learners in education sector. Procedia Comput. Sci. 57, 500–508 (2015)

    Article  Google Scholar 

  14. Ali, A., Shamsuddin, S.M., Ralescu, A.L.: Classification with class imbalance problem. Int. J. Advance Soft Compu. Appl. 5(3) (2013)

    Google Scholar 

  15. Huang, Y.-M., Du, S.-X.: Weighted support vector machine for classification with uneven training class sizes. In: 2005 International Conference on Machine Learning and Cybernetics. IEEE (2005)

    Google Scholar 

  16. Khan, I., et al.: Tracking student performance in introductory programming by means of machine learning. In: 2019 4th MEC International Conference on Big Data and Smart City (ICBDSC). IEEE (2019)

    Google Scholar 

  17. Loyola-González, O., et al.: An empirical study of oversampling and undersampling methods for lcmine an emerging pattern based classifier. In: Mexican Conference on Pattern Recognition, Springer (2019). https://doi.org/10.1007/978-3-642-38989-4_27

  18. Verbiest, N., et al.: Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Appl. Soft Comput. 22, 511–517 (2014)

    Article  Google Scholar 

  19. Mohammed, R., Rawashdeh, J., Abdullah, M.: Machine learning with oversampling and undersampling techniques: overview study and experimental results. In: 2020 11th International Conference on Information and Communication Systems (ICICS). IEEE (2020)

    Google Scholar 

  20. Hernandez, J., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: An empirical study of oversampling and undersampling for instance selection methods on imbalance datasets. In: Iberoamerican Congress on Pattern Recognition. Springer (2013). https://doi.org/10.1007/978-3-642-41822-8_33

  21. Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  22. García, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst. 98, 1–29 (2016)

    Article  Google Scholar 

  23. Fernández, A., et al.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)

    Article  MathSciNet  Google Scholar 

  24. Elreedy, D., Atiya, A.F.: A novel distribution analysis for smote oversampling method in handling class imbalance. In: International Conference on Computational Science. Springer (2019). https://doi.org/10.1007/978-3-030-22744-9_18

  25. Hall, M., et al.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  MathSciNet  Google Scholar 

  26. Franklin, J.: The elements of statistical learning: data mining, inference and prediction. Math. Intelligencer 27(2), 83–85 (2005). https://doi.org/10.1007/BF02985802

    Article  Google Scholar 

  27. Tharwat, A.: Classification assessment methods. Appl. Comput. Inf. (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ijaz Khan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Khan, I., Ahmad, A.R., Jabeur, N., Mahdi, M.N. (2021). Minimizing Classification Errors in Imbalanced Dataset Using Means of Sampling. In: Badioze Zaman, H., et al. Advances in Visual Informatics. IVIC 2021. Lecture Notes in Computer Science(), vol 13051. Springer, Cham. https://doi.org/10.1007/978-3-030-90235-3_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-90235-3_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-90234-6

  • Online ISBN: 978-3-030-90235-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics