Abstract
Class imbalance is one of the well-known challenges in machine learning. Class imbalance occurs when one class dominates the other class in terms of the number of observations. Due to this imbalance, conventional classifiers fail to classify the minority class correctly. The challenges become even more severe when class overlap occurs in imbalanced data. Though literature is available to sequentially deal with class imbalance and class overlap, these methods are quite complex and not so efficient. In this paper, we propose an overlap-sensitive artificial neural network that can handle the problem of class overlapping and class imbalance simultaneously, along with noisy and outlier observations. The strength of this method lies in identifying the overlapping observations rather than the region and in not using multiple classifiers unlike the other existing methods. The key idea of the proposed method is in weighing the observations based on its location in the feature space before training the neural network. The performance of the proposed method is evaluated on 12 simulated data sets and 23 real-life data sets and compared with other well known methods.The results clearly indicate the strength and ability of the proposed method for a wide variety of imbalance ratio and levels of overlapping. Also, it is shown that the proposed method is statistically superior to the other methods in terms of different performance measures.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput 17:255–287
Alibeigi M, Hashemi S, Hamzeh A (2012) DBFS: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets. Data Knowl Eng 81:67–103
Alshomrani S, Bawakid A, Shim S-O, Fernández A, Herrera F (2015) A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl-Based Syst 73:1–17
Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Batista GE, Prati RC, Monard MC (2005) Balancing strategies and class overlapping. In: International symposium on intelligent data analysis. Springer, Berlin, pp 24–35
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159
Burez J, Van den Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36(3):4626–4636
Ceci M, Pio G, Kuzmanovski V, Džeroski S (2015) Semi-supervised multi-view learning for gene network reconstruction. PLoS ONE 10(12):e0144031
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6
Cleofas-Sánchez L, García V, Marqués A, Sánchez JS (2016) Financial distress prediction using the hybrid associative memory with translation. Appl Soft Comput 44:144–152
Cui Y, Jia M, Lin T-Y, Song Y, Belongie S (2019) Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9268–9277
Das B, Krishnan NC, Cook DJ (2013) Handling class overlap and imbalance to detect prompt situations in smart homes. In: 2013 IEEE 13th international conference on data mining workshops. IEEE, pp 266–273
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. vol 17. Lawrence Erlbaum Associates Ltd, pp 973–978
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Guo H, Viktor HL (2004a) Boosting with data generation: improving the classification of hard to learn examples. In: International conference on industrial, engineering and other applications of applied intelligent systems. Springer Berlin, pp 1082–1091
Guo H, Viktor HL (2004b) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor Newsl 6(1):30–39
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, pp 878–887
He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data Eng 9:1263–1284
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, 2008. IJCNN 2008. IEEE world congress on computational intelligence. IEEE, pp 1322–1328
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intelligent data analysis 6(5):429–449
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49
Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst Appl 98:72–83
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp 2980–2988
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
McClelland JL, Rumelhart DE, Hinton GE (1988) The appeal of parallel distributed processing. Morgan Kaufmann, Burlington
Piras L, Giacinto G (2012) Synthetic pattern generation for imbalanced learning in image retrieval. Pattern Recognit Lett 33(16):2198–2205
Prati RC, Batista GE, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, Berlin, pp 312–321
Provost FJ, Fawcett T et al (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: KDD-97 Proceedings, vol. 97. American Association for Artificial Intelligence, pp 43–48
Qu Y, Su H, Guo L, Chu J (2011) A novel SVM modeling approach for highly imbalanced and overlapping classification. Intell Data Anal 15(3):319–341
Richardson A (2010) Nonparametric statistics for non-statisticians: a step-by-step approach by Gregory W. Corder, Dale I. Foreman. Int Stat Rev 78(3):451–452
Shahee SA, Ananthakumar U (2018a) An adaptive oversampling technique for imbalanced datasets. In: Industrial conference on data mining. Springer, Berlin, pp 1–16
Shahee SA, Ananthakumar U (2018b) Synthetic sampling approach based on model-based clustering for imbalanced data. Int J Artif Intell Soft Comput 6(4):348–364
Shahee SA, Ananthakumar U (2019) An effective distance based feature selection approach for imbalanced data. Appl Intell 5:1–29
Simard PY, Steinkraus D, Platt JC et al (2003) Best practices for convolutional neural networks applied to visual document analysis. In: Icdar. vol 3
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Tang Y, Gao J (2007) Improved classification for problem involving overlapping patterns. IEICE Trans Inf Syst 90(11):1787–1795
Tang W, Mao K, Mak LO, Ng GW (2010) Classification for overlapping classes using optimized overlapping region detection and soft decision. In: 2010 13th international conference on information fusion. IEEE, pp 1–8
Tax DM, Duin RP (2004) Support vector data description. Mach Learn 54(1):45–66
Thanathamathee P, Lursinsap C (2013) Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and adaboost techniques. Pattern Recogn Lett 34(12):1339–1347
Tharwat A (2018) Classification assessment methods. Appl Comput Inform 17(1):168–192
Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 3:659–665
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybernet 6:769–772
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 3:408–421
Xiong H, Wu J, Liu L (2010) Classification with classoverlapping: a systematic study. In: Proceedings of the 1st international conference on E-business intelligence (ICEBI2010). pp Atlantis Press
Yin L, Ge Y, Xiao K, Wang X, Quan X (2013) Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11
Zhou L (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: the effect of sampling methods. Knowl-Based Syst 41:16–25
Zikeba M, Tomczak SK, Tomczak JM (2016) Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Syst Appl 58:93–101
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Pierre Baldi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shahee, S.A., Ananthakumar, U. An overlap sensitive neural network for class imbalanced data. Data Min Knowl Disc 35, 1654–1687 (2021). https://doi.org/10.1007/s10618-021-00766-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-021-00766-4