Abstract
Class imbalance problem poses a difficulty to learning algorithms in pattern classification. Oversampling techniques is one of the most widely used techniques to solve these problems, but the majority of them use the sample size ratio as an imbalanced standard. This paper proposes a fuzzy representativeness difference-based oversampling technique, using affinity propagation and the chromosome theory of inheritance (FRDOAC). The fuzzy representativeness difference (FRD) is adopted as a new imbalance metric, which focuses on the importance of samples rather than the number. FRDOAC firstly finds the representative samples of each class according to affinity propagation. Secondly, fuzzy representativeness of every sample is calculated by the Mahalanobis distance. Finally, synthetic positive samples are generated by the chromosome theory of inheritance until the fuzzy representativeness difference of two classes is small. A thorough experimental study on 16 benchmark datasets was performed and the results show that our method is better than other advanced imbalanced classification algorithms in terms of various evaluation metrics.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Cordón I, García S, Fernández A, Herrera F (2018) Imbalance: oversampling algorithms for imbalanced classification in r. Knowl-Based Syst 161:329–341
Chawla NV (2009) Data mining for imbalanced datasets: an overview. In: Maimon O (ed). Springer, Boston
Zhou L (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: the effect of sampling methods. Knowl-Based Syst 41:16–25
Lee Y-H, Hu PJH, Cheng TH, Huang T-C, Chuang W-Y (2013) A preclustering-based ensemble learning technique for acute appendicitis diagnoses. Artif Intell Med 58(2):115–124
Seiffert C, Khoshgoftaar TM, Hulse JV, Folleco A (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595
Ali A, Shamsuddin SM, Ralescu AL (2015) Classification with class imbalance problem: a review[j]. Int J Adv Soft Compu Appl 7(3):176–204
Bo T, He H (2017) Gir-based ensemble sampling approaches for imbalanced learning. Pattern Recogn 71:306–319
Silvia C, Valentina C, Marco V (2014) A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135:8
Akkasi A, Varoglu E, Dimililer N (2018) Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. Appl Intell 48(8):1965–1978
Wang Z, Wang B, Cheng Y, et al. (2019) Cost-sensitive fuzzy multiple kernel learning for imbalanced problem[j]. Neurocomputing 366:178–193
Singh RB, Sanyam S (2018) Class-specific cost-sensitive boosting weighted elm for class imbalance learning[j]. Memetic Computing
Zhu Z, Wang Z, Li D, et al. (2019) Tree-based space partition and merging ensemble learning framework for imbalanced problems[j]. Information Sciences
Lopez-Garcia P, Masegosa AD, Osaba E, Onieva E, Perallos A (2019) Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl Intell 49:2807–2822
Chawla NV, Bowyer KW, Hall LO, Philip Kegelmeyer W (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the international joint conference on neural networks, IJCNN, part of the IEEE world congress on computational intelligence, WCC, pp 1322–1328
Han H, Wang W, Mao B (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. Advances in intelligent computing. In: International conference on intelligent computing ICIC, Part I
Zhang Z (2016) Introduction to machine learning: k-nearest neighbors. Ann Transl Med 4:11
Sutton WS (1903) The chromosomes in heredity. Biol Bull 4(5):231–251
Liu G, Yang Y, Li B (2018) Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl-Based Syst 158:154–174
Li L, He H, Liy J, Li W (2018) Edos: entropy difference-based oversampling approach for imbalanced learning. In: 2018 International joint conference on neural NetworksIJCNN
Ho TK (2002) A data complexity analysis of comparative advantages of decision forest constructors. Pattern Anal Appl 5(2):102–112
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6 (5):429–449
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: Proceedings of the IEEE symposium on computational intelligence and data mining, CIDM 2011, part of the IEEE symposium series on computational intelligence, pp 104–111
Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Advances in knowledge discovery and data mining, 13th Pacific-Asia conference, PAKDD proceedings
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Zhang HX, Li MF (2014) Rwo-sampling: a random walk over-sampling approach to imbalanced data classification. Inf Fusion 20:99–116
Das B, Krishnan NC, Cook DJ (2015) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291(5):184–203
Liu S, Zhang J, Xiang Y, Zhou W (2017) Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Trans Fuzzy Syst 25(6):1476–1490
Liu X-Y, Wu J, Zhou Z-H (2006) Exploratory under-sampling for class-imbalance learning. In: Proceedings of the 6th IEEE international conference on data mining(ICDM 2006), 18-22 December 2006, Hong Kong, China, pp 965–969
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: onesided selection. Proc Int Conf Mach Learn 97:179–186
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6(1):40–49
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684
Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5728
Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17– 26
Xindong W, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Philip SY, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of the IEEE symposium on computational intelligence and data mining, CIDM, part of the IEEE symposium series on computational intelligence, pp 324–331
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm, machine learning. In: Proceedings of the thirteenth international conference, pp 148–156
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. Knowledge discovery in databases: PKDD 2003, 7th European conference on principles and practice of knowledge discovery in databases, pp 107–119
APAZimmermann HJ (2010) Fuzzy set theory. Wiley Interdisciplinary Reviews Computational Statistics 2 (3):317–332
Tang B, He H (2015) ENN: extended nearest neighbor method for pattern recognition [research frontier]. IEEE Comp Int Mag 10(3):52–60
Frey Brendan J, Dueck D (2007) Clustering by passing messages between data points. Science 315 (5814):972–976
Bennin KE, Student Member IEEE, Keung J, Member, IEEE, Phannachitta P, Monden A, Member IEEE, Mensah S (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalanceissue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550
Zhang X, Song Q, Wang G, Zhang K, He L, Jia X (2015) A dissimilarity-based imbalance data classification algorithm. Appl Intell 42(3):544–565
Mahalanobis PC (1936) On the generalized distance in statistics. Proc Nat Inst Sci (Calcutta) 2:49–55
Bache K, Lichman M (2013) UCI machine learning repository
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) Keel data-mining software tool: data set repository integration of algorithms and experimental analysis framework. Multiple-Valued Logic Soft Comput 17(2-3):255–287
Liaw A, Wiener M, et al. (2002) Classification and regression randomforest. R news 2(3):18–22
Acknowledgements
This research was supported by National Natural Science Foundation of China(61573266).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by National Natural Science Foundation of China (61573266).
Rights and permissions
About this article
Cite this article
Ren, R., Yang, Y. & Sun, L. Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50, 2465–2487 (2020). https://doi.org/10.1007/s10489-020-01644-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01644-0