Abstract
Imbalanced data classification remains a research hotspot and a challenging problem in the field of machine learning. The challenge of imbalanced learning lies not only in class imbalance problem, but also in the class overlapping problem which is complex. However, most of the existing algorithms mainly focus on the former. The limitation prevents the existing methods from breaking through. To address this limitation, this paper proposes an ensemble algorithm based on dual clustering and stage-wise hybrid sampling (DCSHS) to address both class imbalance and class overlapping problems. The DCSHS has three main parts: projection clustering combination framework (PCC), stage-wise hybrid sampling (SHS) and envelope clustering transfer mapping mechanism (CTM). PCC is to create multiple subsets through projective clustering. SHS is to identify the overlapping region of each subset and conduct hybrid sampling. CTM is to explore more information of samples in each subset by combining the clustering and transfer learning. At first, we design a PCC framework guided by Davies-Bouldin clustering effectiveness index (DBI), which is used to obtain high-quality clusters and combine them to obtain a set of cross-complete subsets (CCS) with low overlapping. Secondly, according to the characteristics of subset classes, a SHS algorithm is designed to realize the de-overlapping and balancing of subsets. Finally, an envelope clustering transfer mapping mechanism (CTM) is constructed for all processed subsets by means of transfer learning, thereby reducing class overlapping and explore structural information of samples. Weak classifiers are trained on the balanced subsets, and fused as all the imbalanced ensemble algorithms did. The major advantage of our algorithm is that it can exploit the intersectionality of the CCS to realize the soft elimination of overlapping majority samples, and learn as much information of overlapping samples as possible, thereby enhancing the class overlapping while class balancing. In the experimental section, more than 30 public datasets and over ten representative algorithms are chosen for verification. The experimental results show that the DCSHS is significantly best in terms of anti-overlapping, Recall, F1-M, G-M, AUC, and diversity.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data and codes can be found at: (https://pan.baidu.com/s/1M0N39gEIc4bK2qwg9EYTMQ, extraction code:1111).
References
Mirzaei B, Nikpour B, Nezamabadi-pour H (2021) CDBH: A clustering and density-based hybrid approach for imbalanced data classification[J]. Expert Syst Appl 164:114035
Vuttipittayamongkol P, Elyan E, Petrovski A (2021) On the class overlapping problem in imbalanced data classification[J]. Knowl-Based Syst 212:106631
Santos M S, Abreu P H, Japkowicz N, et al. (2022) On the joint-effect of class imbalance and overlap: a critical review[J]. Artificial Intelligence Review 1–69
Li Z, Huang M, Liu G et al (2021) A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection[J]. Expert Syst Appl 175:114750
Yuan BW, Luo XG, Zhang ZL et al (2021) A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets[J]. Neural Comput Appl 33:4457–4481
Vuttipittayamongkol P, Elyan E (2020) Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and parkinson’s disease[J]. Int J Neural Syst 30(08):2050043
Mayabadi S, Saadatfar H (2022) Two density-based sampling approaches for imbalanced and overlapping data[J]. Knowl-Based Syst 241:108217
Vuttipittayamongkol P, Elyan E. (2020) Overlap-based undersampling method for classification of imbalanced medical datasets[C]//IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, Cham 358–369
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data[J]. Inf Sci 509:47–70
Dai Q, Liu J, Liu Y (2022) Multi-granularity relabeled under-sampling algorithm for imbalanced data[J]. Appl Soft Comput 124:109083
Zhou J, Pedrycz W, Gao C et al (2021) Robust jointly sparse fuzzy clustering with neighborhood structure preservation[J]. IEEE Trans Fuzzy Syst 30(4):1073–1087
Xie X, Liu H, Zeng S et al (2021) A novel progressively undersampling method based on the density peaks sequence for imbalanced data[J]. Knowl-Based Syst 213:106689
Wang X, Wang H, Wang Y (2020) A density weighted fuzzy outlier clustering approach for class imbalanced learning[J]. Neural Comput Appl 32:13035–13049
Tsai CF, Lin WC, Hu YH et al (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection[J]. Inf Sci 477:47–54
Chawla NV, Bowyer KW, Hall LO et al (2002) SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of Artificial Intelligence Research 16(1):321–357
Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem[J]. Inf Sci 542:92–111
Ladeira Marques M, MoraesVillela S, Hasenclever Borges CC (2020) Large margin classifiers to generate synthetic data for imbalanced datasets[J]. Appl Intell 50(11):3678–3694
Liang XW, Jiang AP, Li T et al (2020) LR-SMOTE—An improved unbalanced data set oversampling based on K-means and SVM[J]. Knowl-Based Syst 196:105845
Xu Z, Shen D, Nie T et al (2021) A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data[J]. Inf Sci 572:574–589
Ming ZA, Tong L, Rui Z et al (2020) Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification [J]. Inf Sci 512:1009–1023
Yue ZA, Zga B (2020) Gaussian Discriminative Analysis aided GAN for imbalanced big data augmentation and fault classification[J]. J Process Control 92:271–287
Min Z, Zou B, Wei F, et al. (2016) Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data[C]//Online Analysis & Computing Science. IEEE
Gao X, Ren B, Zhang H et al (2020) An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling[J]. Expert Syst Appl 160:113660
Xu Z, Shen D, Nie T et al (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data[J]. J Biomed Inform 107:103465
Liu CL, Chang YH (2022) Learning from imbalanced data with deep density hybrid sampling[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems 52(11):7065–7077
Huang S, Chen H, Li T et al (2022) Feature selection via minimizing global redundancy for imbalanced data[J]. Appl Intell 52(8):8685–8707
Tao X, Li Q, Guo W et al (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification[J]. Inf Sci 487:31–56
Zheng W, Zhao H (2020) Cost-sensitive hierarchical classification for imbalance classes[J]. Appl Intell 50(8):2328–2338
Ren Z, Zhu Y, Kang W et al (2022) Adaptive cost-sensitive learning: Improving the convergence of intelligent diagnosis models under imbalanced data[J]. Knowl-Based Syst 241:108296
Yang K, Yu Z, Wen X et al (2019) Hybrid classifier ensemble for imbalanced data[J]. IEEE transactions on neural networks and learning systems 31(4):1387–1400
Guo Y, Feng J, Jiao B et al (2021) Manifold cluster-based evolutionary ensemble imbalance learning[J]. Comput Ind Eng 159:107523
Sun J, Lang J, Fujita H et al (2018) Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates[J]. Inf Sci 425:76–91
Zhao J, Jin J, Chen S et al (2020) A weighted hybrid ensemble method for classifying imbalanced data[J]. Knowl-Based Syst 203:106087
Niu K, Zhang Z, Liu Y et al (2020) Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending[J]. Inf Sci 536:120–134
Tao X, Zheng Y, Chen W et al (2022) SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning[J]. Inf Sci 588:13–51
Zhu Y, Yan Y, Zhang Y et al (2020) EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning[J]. Neurocomputing 417:333–346
Fu GH, Wu YJ, Zong MJ et al (2020) Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics[J]. Chemom Intell Lab Syst 196:103906
Zhou F, Gao S, Ni L et al (2022) Dynamic self-paced sampling ensemble for highly imbalanced and class-overlapped data classification[J]. Data Min Knowl Disc 36(5):1601–1622
Fernandes ERQ, de Carvalho AC (2019) Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning[J]. Inf Sci 494:141–154
Chen X, Zhang L, Wei X et al (2021) An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets[J]. Appl Intell 51:1918–1933
Yuan BW, Zhang ZL, Luo XG et al (2021) OIS-RF: A novel overlap and imbalance sensitive random forest[J]. Eng Appl Artif Intell 104:104355
Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data[J]. Expert Syst Appl 98:72–83
Hartigan J A, Wong M A. (1979) A K-means Clustering Algorithm[J]. Applied Statistics 28(1)
Wilson Dennis L (2007) Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. [J]. IEEE Transactions on Systems Man and Cybernetics 2(3):408–421
Chen Y, Song S, Li S et al (2019) A graph embedding framework for maximum mean discrepancy-based domain adaptation algorithms[J]. IEEE Trans Image Process 29:199–213
Pan S J, Kwok J T, Yang Q. (2008) Transfer Learning via Dimensionality Reduction[C]//Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13–17, 2008. DBLP
Wei W, Dai H, Liang W (2020) Regularized least squares locality preserving projections with applications to image recognition[J]. Neural Netw 128:322–330
Tao X, Li Q, Guo W et al (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering[J]. Inf Sci 519:43–73
Feng S, Zhao C, Fu P (2020) A cluster-based hybrid sampling approach for imbalanced data classification[J]. Rev Sci Instrum 91(5):055101
Sun Z, Song Q, Zhu X et al (2015) A novel ensemble method for classifying imbalanced data[J]. Pattern Recogn 48(5):1623–1637
Ren J, Wang Y, Mao M et al (2022) Equalization ensemble for large scale highly imbalanced data classification[J]. Knowl-Based Syst 242:108295
Y. Xu, Z. Yu, C. L. P. Chen and Z. Liu, (2021) Adaptive Subspace Optimization Ensemble Method for High-Dimensional Imbalanced Data Classification[J]. IEEE Transactions on Neural Networks and Learning Systems, Early Access https://doi.org/10.1109/TNNLS.2021.3106306
Acknowledgements
We are grateful for the support of the National Natural Science Foundation of China NSFC (No. U21A20448 and 61771080); Natural Science Foundation of Chongqing (cstc2020jcyj-msxmX0100, cstc2020jscx-gksb0010, cstc2020jscx-msxm0369); Basic and Advanced Research Project in Chongqing (cstc2020jscx-fyzx0212, cstc2020jscx-msxm0369, cstc2020jcyj-msxmX0523); Chongqing Social Science Planning Project (2018YBYY133); and Special Project of Improving Scientific and Technological Innovation Ability of the Army Medical University (2019XLC3055).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Consent for publication
Not applicable.
Conflict of interest
None.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
See Table 7.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, F., Wang, B., Wang, P. et al. An imbalanced ensemble learning method based on dual clustering and stage-wise hybrid sampling. Appl Intell 53, 21167–21191 (2023). https://doi.org/10.1007/s10489-023-04650-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04650-0