Abstract
Classification, a significant application of machine learning, labels each instance of the dataset into one of the predefined classes. Problems occur when the number of instances in the classes is not uniform. The exceptional lyuneven class distribution gives rise to class imbalancing issues which tend to demote the overall performance of the classifier. A set of data-level algorithms are available which are applied to adjust the class distribution. The class imbalancing emerges frequently in datasets from educational domains where the number of students with unsatisfactory performance general appears in low number comparing to the students with satisfactory outcomes. This paper applies a set of data-level sampling algorithms over a dataset taken from an educational domain. It underlines the consequences rising from classification with imbalanced dataset. This research confirms that a classification model achieving higher accuracy may not appear effective in correct identification of instances in minority class. Classification with an imbalance dataset may produce low recall, precision and F-Measure for classes with lower number of instances. The performance of classification model improves with application of data level algorithm. However, it highlights the supremacy of oversampling algorithm over undersampling algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Luque, A., et al.: The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 91, 216–231 (2019)
Tyagi, S., Mittal, S.: Sampling approaches for imbalanced data classification problem in machine learning. In: Proceedings of ICRIC 2019, pp. 209–221. Springer (2020). https://doi.org/10.1007/978-3-030-29407-6_17
Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5(1), 1–30 (2018). https://doi.org/10.1186/s40537-018-0151-6
Elreedy, D., Atiya, A.F.: A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf. Sci. 505, 32–64 (2019)
Raghuwanshi, B.S., Shukla, S.: SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl.-Based Syst. 187, 104814 (2019)
Romero, C., Ventura, S.: Educational data mining and learning analytics: an updated survey. Wiley Interdisc. Rev. Data Mining Knowl. Disc. 10(3), e1355 (2020)
Leitner, P., Khalil, M., Ebner, M.: Learning analytics in higher education—a literature review. Learn. Anal. Fundaments Appl. Trends 94, 1–23 (2017). https://doi.org/10.1007/978-3-319-52977-6_1
Khan, I., et al.: A conceptual framework to aid attribute selection in machine learning student performance prediction models. Int. J. Interactive Mob. Technol. 15(15) (2021)
Osmanbegovic, E., Suljic, M.: Data mining approach for predicting student performance. Econ. Rev. J. Econ. Bus. 10(1), 3–12 (2012)
Asif, R., Merceron, A., Pathan, M.K.: Predicting student academic performance at degree level: a case study. Int. J. Intell. Syst. Appl. 7(1), 49 (2014)
Kabakchieva, D.: Predicting student performance by using data mining methods for classification. Cybern. Inf. Technol. 13(1), 61–72 (2013)
Ramesh, V., Parkavi, P., Ramar, K.: Predicting student performance: a statistical and data mining approach. Int. J. Comput. Appl. 63(8), 35–39 (2013)
Kaur, P., Singh, M., Josan, G.S.: Classification and prediction based data mining algorithms to predict slow learners in education sector. Procedia Comput. Sci. 57, 500–508 (2015)
Ali, A., Shamsuddin, S.M., Ralescu, A.L.: Classification with class imbalance problem. Int. J. Advance Soft Compu. Appl. 5(3) (2013)
Huang, Y.-M., Du, S.-X.: Weighted support vector machine for classification with uneven training class sizes. In: 2005 International Conference on Machine Learning and Cybernetics. IEEE (2005)
Khan, I., et al.: Tracking student performance in introductory programming by means of machine learning. In: 2019 4th MEC International Conference on Big Data and Smart City (ICBDSC). IEEE (2019)
Loyola-González, O., et al.: An empirical study of oversampling and undersampling methods for lcmine an emerging pattern based classifier. In: Mexican Conference on Pattern Recognition, Springer (2019). https://doi.org/10.1007/978-3-642-38989-4_27
Verbiest, N., et al.: Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Appl. Soft Comput. 22, 511–517 (2014)
Mohammed, R., Rawashdeh, J., Abdullah, M.: Machine learning with oversampling and undersampling techniques: overview study and experimental results. In: 2020 11th International Conference on Information and Communication Systems (ICICS). IEEE (2020)
Hernandez, J., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: An empirical study of oversampling and undersampling for instance selection methods on imbalance datasets. In: Iberoamerican Congress on Pattern Recognition. Springer (2013). https://doi.org/10.1007/978-3-642-41822-8_33
Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
García, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst. 98, 1–29 (2016)
Fernández, A., et al.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Elreedy, D., Atiya, A.F.: A novel distribution analysis for smote oversampling method in handling class imbalance. In: International Conference on Computational Science. Springer (2019). https://doi.org/10.1007/978-3-030-22744-9_18
Hall, M., et al.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Franklin, J.: The elements of statistical learning: data mining, inference and prediction. Math. Intelligencer 27(2), 83–85 (2005). https://doi.org/10.1007/BF02985802
Tharwat, A.: Classification assessment methods. Appl. Comput. Inf. (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Khan, I., Ahmad, A.R., Jabeur, N., Mahdi, M.N. (2021). Minimizing Classification Errors in Imbalanced Dataset Using Means of Sampling. In: Badioze Zaman, H., et al. Advances in Visual Informatics. IVIC 2021. Lecture Notes in Computer Science(), vol 13051. Springer, Cham. https://doi.org/10.1007/978-3-030-90235-3_38
Download citation
DOI: https://doi.org/10.1007/978-3-030-90235-3_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90234-6
Online ISBN: 978-3-030-90235-3
eBook Packages: Computer ScienceComputer Science (R0)