An intra-class distribution-focused generative adversarial network approach for imbalanced tabular data learning

Chen, Qiuling; Ye, Ayong; Zhang, Yuexin; Chen, Jianwei; Huang, Chuan

doi:10.1007/s13042-023-02048-5

An intra-class distribution-focused generative adversarial network approach for imbalanced tabular data learning

Original Article
Published: 03 January 2024

Volume 15, pages 2551–2572, (2024)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Qiuling Chen^1,2,
Ayong Ye ORCID: orcid.org/0000-0002-2606-5406^1,2,
Yuexin Zhang^1,2,
Jianwei Chen^1,2 &
…
Chuan Huang^1,2

378 Accesses
1 Citation
Explore all metrics

Abstract

Data imbalance is a critical factor that adversely affects the performance of machine learning algorithms. It leads to deviations in decision boundaries, resulting in biased predictions towards the majority class and inaccurate classification of the minority class. Although oversampling the minority class using deep generative models is a popular strategy, many existing methods focus solely on enhancing data for the minority class while overlooking the distribution relationship within and between classes. Therefore, we propose an oversampling method that merges unsupervised clustering and generative adversarial network (GAN) to facilitate the imbalanced tabular data learning. First, we perform preprocessing (clustering) on the original data, remove clusters that do not require sampling and generate more samples for sparsely distributed minority class clusters to achieve sample balance within the minority class. Moreover, we design a CTGAN-based auxiliary classifier GAN (ACCTGAN) to generate the minority class. It enhances the semantic integrity of the synthetic data and avoids generating noisy samples. We conducted validation experiments comparing our approach to 7 typical methods on 12 real tabular datasets. Our method shows excellent performance in F1-measure and area under the curve (AUC), obtaining 19 and 20 best results on the three classifiers, respectively. It significantly enhances classification results and demonstrates good robustness and stability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

A clustering and generative adversarial networks-based hybrid approach for imbalanced data classification

Article 24 April 2023

BCGAN: A CGAN-based over-sampling model using the boundary class for data balancing

Article 05 March 2021

A synergistic fusion of shallow and deep generative model to enhance machine learning efficacy and classification performance in data-scarce environments

Article 09 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

Hassan SM, Ali SA, Hassan B et al (2022) Hybrid Features Binary Classification of Imbalance Stroke Patients Using Different Machine Learning Algorithms. Int J Bio Biomed Eng 16:154–160
Google Scholar
Sapre S, Islam K, Ahmadi P (2021) A comprehensive data sampling analysis applied to the classification of rare iot network intrusion types. IEEE 18th Annual Consumer Communications & Networking Conference (CCNC) 2021:1–2
Google Scholar
Jedrzejowicz J, Jedrzejowicz P (2021) GEP-based classifier for mining imbalanced data. Expert Syst Appl 164:114058
Google Scholar
Fernandez A, Garcia S, Herrera F, Chawla NV (2018) SMOTE for learning from imbalanced data:progress and challenges, marking the 15-year anniversary. J Artificial Intellig Res 61:863–905
MathSciNet Google Scholar
Zhang L, Zhang D (2016) Evolutionary cost-sensitive extreme learning machine. IEEE Trans Neural Netw Learn Syst 28(12):3045–3060
MathSciNet Google Scholar
Shi L, Ma X, Xi L, Duan Q, Zhao J (2011) Rough set and ensemble learning based semi-supervised algorithm for text classification. Expert Syst Appl 38(5):6300–6306
Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(4):463–484
Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29
Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Adv Neural Inform Process Syst 2014:2672–2680
Google Scholar
Chen MY, Chiang HS, Huang WK (2022) Efficient Generative Adversarial Networks for Imbalanced Traffic Collision Datasets. IEEE Trans Intel Trans Syst 23(10):19864–19873
Google Scholar
Dong Y, Xiao H, Dong Y (2022) SA-CGAN: An oversampling method based on single attribute guided conditional GAN for multi-class imbalanced learning. Neurocomputing 472:326–337
Google Scholar
Fan M, Yang Q, Zhang B, Zhang K, Xia J (2021) Cluster-based Generative Adversarial Network Imbalanced Data Generation Method. IEEE 10th Data Driven Control and Learning Systems Conference (DDCLS) 2021:547–552
Google Scholar
Chawla NV, Japkowicz N, Drive P (2004) Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Exp 6(1):1–6
Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artificial Intellig Res 16:321–357
Google Scholar
Zhu Y, Yan Y, Zhang Y, Zhang Y (2020) EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning. Neurocomputing 417:333–346
Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. Advances in Intelligent Computing: International Conference on Intelligent Computing 2005:878–887
Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) 2008:1322–1328
Google Scholar
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inform Sci 465:1–20
Google Scholar
Maldonado S, Vairetti C, Fernandez A, Herrera F (2022) FW-SMOTE: A feature-weighted oversampling approach for imbalanced classification. Pattern Recognition 124:108511
Google Scholar
Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) Loras: an oversampling approach for imbalanced datasets. Machine Learn 110:279–301
MathSciNet Google Scholar
Wang X, Xu J, Zeng T, Jing L (2021) Local distribution-based adaptive minority oversampling for imbalanced data classification. Neurocomput 422:200–213
Google Scholar
Xie X, Liu H, Zeng S, Lin L, Li W (2021) A novel progressively undersampling method based on the density peaks sequence for imbalanced data. Knowledge-Based Syst 213:106689
Google Scholar
Dai Q, Liu J, Shi Y (2023) Class-overlap undersampling based on Schur decomposition for Class-imbalance problems. Expert Syst Appl 221:119735
Google Scholar
Ng WWY, Xu S, Zhang J, Tian X, Rong T, Kwong S (2020) Hashing-Based Undersampling Ensemble for Imbalanced Pattern Classification Problems. IEEE Trans Cyber 52(2):1269–1279
Google Scholar
Mirzaei B, Nikpour B, Nezamabadi-pour H (2021) Cdbh: a clustering and density-based hybrid approach for imbalanced data classification. Expert Syst Appl 164:114035
Google Scholar
Khan SH, Hayat M, Bennamoun M, Sohel F, Togneri R (2017) Cost sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Network Learn Syst 29(8):3573–3587
Google Scholar
Fu S, Yu X, Tian Y (2022) Cost sensitive v-support vector machine with LINEX loss. Inform Processing Manag 59(2):102809
Google Scholar
Zhang S (2020) Cost-sensitive knn classification. Neurocomputing 391:234–242
Google Scholar
Zhang H, Jiang L, Li C (2021) CS-ResNet: Cost-sensitive residual convolutional neural network for PCB cosmetic defect detection. Expert Syst Appl 185(1):115673. https://doi.org/10.1016/j.eswa.2021.115673
Article Google Scholar
Chen Z, Duan J, Kang L, Qiu G (2021) Class-imbalanced deep learning via a class-balanced ensemble. IEEE Trans Neural Netw Learn Syst 33(10):5626–5640
Google Scholar
Yang K, Yu Z, Wen X, Cao W, Chen CP, Wong HS, You J (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learn syst 31(4):1387–1400
MathSciNet Google Scholar
Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802-05957
Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl 91:464–471
Google Scholar
Odena A, Olah C, Shlens J (2017) Conditional image synthesis with auxiliary classifier gans. Int Conference Machine Learn PMLR 2017:2642–2651
Google Scholar
Zheng M, Li T, Zhu R, Tang Y, Tang M, Lin L, Ma Z (2020) Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf Sci 512:1009–1023
Google Scholar
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. Adv Neural Inform Proces Sys 2017:30
Google Scholar
Engelmann J, Lessmann S (2021) Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst with Appl 174:114582
Google Scholar
Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling tabular data using conditional gan. Adv Neural Inform Proces Syst 2019:32
Google Scholar
Zhang Y, Liu Y, Wang Yan, Yang Jie (2023) An ensemble oversampling method for imbalanced classification with prior knowledge via generative adversarial network. Chemomet Intellig Labor Syst 2023(235):104775
Google Scholar
An C, Sun J, Wang Y, Wei Q (2021) A K-means Improved CTGAN Oversampling Method for Data Imbalance Problem. IEEE 21st International Conference on Software Quality, Reliability and Security (QRS) 2021:883–887
Google Scholar
Jo W, Kim D (2022) OBGAN: Minority oversampling near borderline with generative adversarial networks. Expert Syst Appl 197:116694
Google Scholar
Ding H, Sun Y, Huang N, Shen Z, Wang Z, Iftekhar A, Cui X (2023) RVGAN-TL: A generative adversarial networks and transfer learning-based hybrid approach for imbalanced data classification. Inform Sci 629:184–203
Google Scholar
Chinrungrueng C, Sequin CH (1995) Optimal adaptive k-means algorithm with dynamic adjustment of learning rate. IEEE Trans Neural Netw 6(1):157–169
Google Scholar
Lin Z, Khetan A, Fanti G, Oh S (2018) Pacgan: The power of two samples in generative adversarial networks. Adv Neural Inform Proces Syst 2018:31
Google Scholar
Sch olkopf, B., Smola A, M uller K, (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319
Kwedlo W (2011) A clustering method combining differential evolution with the K-means algorithm. Pattern Recog Lett 32(12):1613–1621
Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Google Scholar

Download references

Acknowledgements

This work is supported partially by the National Natural Science Foundation of China [61972096, 61771140, 61872088, 61872090, 61902289], and the University-Industry Cooperation of Fujian Province [2022H6025].

Funding

Not Applicable

Author information

Authors and Affiliations

College of Computer and Cyber Security, Fujian Normal University, Fuzhou, 350007, China
Qiuling Chen, Ayong Ye, Yuexin Zhang, Jianwei Chen & Chuan Huang
Fujian Provincial Key Laboratory of Network Security and Cryptology, Fuzhou, 350007, China
Qiuling Chen, Ayong Ye, Yuexin Zhang, Jianwei Chen & Chuan Huang

Authors

Qiuling Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ayong Ye
View author publications
You can also search for this author in PubMed Google Scholar
Yuexin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chuan Huang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Qiuling Chen: Methodology, Formal analysis, and Writing-Original Draft; Ayong Ye: Conceptualization, Resources, Writing-Review & Editing and Funding acquisition; Yuexin Zhang: Writing-Reviewing and Editing; Jianwei Chen: Writing-Reviewing and Editing; Chuan Huang: Supervision.

Corresponding author

Correspondence to Ayong Ye.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of the article.

Ethical approval

Not Applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Q., Ye, A., Zhang, Y. et al. An intra-class distribution-focused generative adversarial network approach for imbalanced tabular data learning. Int. J. Mach. Learn. & Cyber. 15, 2551–2572 (2024). https://doi.org/10.1007/s13042-023-02048-5

Download citation

Received: 18 July 2023
Accepted: 13 November 2023
Published: 03 January 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s13042-023-02048-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

An intra-class distribution-focused generative adversarial network approach for imbalanced tabular data learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A clustering and generative adversarial networks-based hybrid approach for imbalanced data classification

BCGAN: A CGAN-based over-sampling model using the boundary class for data balancing

A synergistic fusion of shallow and deep generative model to enhance machine learning efficacy and classification performance in data-scarce environments

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An intra-class distribution-focused generative adversarial network approach for imbalanced tabular data learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A clustering and generative adversarial networks-based hybrid approach for imbalanced data classification

BCGAN: A CGAN-based over-sampling model using the boundary class for data balancing

A synergistic fusion of shallow and deep generative model to enhance machine learning efficacy and classification performance in data-scarce environments

Explore related subjects

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation