RACOG and wRACOG: Two Probabilistic Oversampling Techniques

doi:10.1109/TKDE.2014.2324567

. 2015 Jan 1;27(1):222-234.

doi: 10.1109/TKDE.2014.2324567. Epub 2014 May 16.

RACOG and wRACOG: Two Probabilistic Oversampling Techniques

Barnan Das¹, Narayanan C Krishnan¹, Diane J Cook¹

Affiliations

PMID: 27041974
PMCID: PMC4814938
DOI: 10.1109/TKDE.2014.2324567

RACOG and wRACOG: Two Probabilistic Oversampling Techniques

Barnan Das et al. IEEE Trans Knowl Data Eng. 2015.

. 2015 Jan 1;27(1):222-234.

doi: 10.1109/TKDE.2014.2324567. Epub 2014 May 16.

Authors

Barnan Das¹, Narayanan C Krishnan¹, Diane J Cook¹

Affiliation

¹ School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164.

PMID: 27041974
PMCID: PMC4814938
DOI: 10.1109/TKDE.2014.2324567

Abstract

As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not well represented which leads to high misclassification error. We introduce two Gibbs sampling-based oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The Gibbs sampler uses the joint probability distribution of attributes of the data to generate new minority class samples in the form of Markov chain. While RACOG selects samples from the Markov chain based on a predefined lag, wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using five UCI datasets that were carefully modified to exhibit class imbalance and one new application domain dataset with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.

Keywords: Gibbs sampling; Imbalanced class distribution; Markov chain Monte Carlo (MCMC); oversampling.

PubMed Disclaimer

Figures

**Fig. 3**
Chow-Liu Dependence Tree Construction

**Fig. 4**
Trees for (*left*) abalone and (*right*) car datasets

**Fig. 7**
Sensitivity for C4.5 Decision Tree

**Fig. 8**
G-mean for C4.5 Decision Tree

**Fig. 9**
ROC curves produced by C4.5 Decision Tree

**Fig. 10**
*i-stat* values for RACOG and wRACOG

**Fig. 11**
Comparison of total number of iterations required by different methods to achieve given *i-stat*

**Fig. 12**
Comparison of *log*(number of instances added) by different methods

See this image and copyright information in PMC

Cited by

An Impartial Semi-Supervised Learning Strategy for Imbalanced Classification on VHR Images.
Sun F, Fang F, Wang R, Wan B, Guo Q, Li H, Wu X. Sun F, et al. Sensors (Basel). 2020 Nov 23;20(22):6699. doi: 10.3390/s20226699. Sensors (Basel). 2020. PMID: 33238513 Free PMC article.
Research on Adaptive 1DCNN Network Intrusion Detection Technology Based on BSGM Mixed Sampling.
Ma W, Gou C, Hou Y. Ma W, et al. Sensors (Basel). 2023 Jul 6;23(13):6206. doi: 10.3390/s23136206. Sensors (Basel). 2023. PMID: 37448055 Free PMC article.
Artificial-Intelligence-Based Prediction of Clinical Events among Hemodialysis Patients Using Non-Contact Sensor Data.
Thakur SS, Abdul SS, Chiu HS, Roy RB, Huang PY, Malwade S, Nursetyo AA, Li YJ. Thakur SS, et al. Sensors (Basel). 2018 Aug 27;18(9):2833. doi: 10.3390/s18092833. Sensors (Basel). 2018. PMID: 30150592 Free PMC article.
Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification.
Yao L, Lin TB. Yao L, et al. Sensors (Basel). 2021 Oct 4;21(19):6616. doi: 10.3390/s21196616. Sensors (Basel). 2021. PMID: 34640936 Free PMC article.

References

1. Woods K, Doss C, Bowyer K, Solka J, Priebe C, KW Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence. 1993;7(6):1417–1436.
1. Kubat M, Holte R, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine learning. 1998;30(2):195–215.
1. Phua C, Alahakoon D, Lee V. Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter. 2004;6(1):50–59.
1. Turney P, et al. Learning algorithms for keyphrase extraction. Information Retrieval. 2000;2(4):303–336.
1. Lewis D, Gale W. A sequential algorithm for training text classifiers. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Springer-Verlag New York, Inc; 1994. pp. 3–12.

Grants and funding

R01 EB009675/EB/NIBIB NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

[1] Woods K, Doss C, Bowyer K, Solka J, Priebe C, KW Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence. 1993;7(6):1417–1436.

[2] Woods K, Doss C, Bowyer K, Solka J, Priebe C, KW Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence. 1993;7(6):1417–1436.

[3] Kubat M, Holte R, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine learning. 1998;30(2):195–215.

[4] Kubat M, Holte R, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine learning. 1998;30(2):195–215.

[5] Phua C, Alahakoon D, Lee V. Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter. 2004;6(1):50–59.

[6] Phua C, Alahakoon D, Lee V. Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter. 2004;6(1):50–59.

[7] Turney P, et al. Learning algorithms for keyphrase extraction. Information Retrieval. 2000;2(4):303–336.

[8] Turney P, et al. Learning algorithms for keyphrase extraction. Information Retrieval. 2000;2(4):303–336.

[9] Lewis D, Gale W. A sequential algorithm for training text classifiers. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Springer-Verlag New York, Inc; 1994. pp. 3–12.

[10] Lewis D, Gale W. A sequential algorithm for training text classifiers. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Springer-Verlag New York, Inc; 1994. pp. 3–12.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RACOG and wRACOG: Two Probabilistic Oversampling Techniques

Affiliation

RACOG and wRACOG: Two Probabilistic Oversampling Techniques

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials