RACOG and wRACOG: Two Probabilistic Oversampling Techniques - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jan 1;27(1):222-234.
doi: 10.1109/TKDE.2014.2324567. Epub 2014 May 16.

RACOG and wRACOG: Two Probabilistic Oversampling Techniques

Affiliations

RACOG and wRACOG: Two Probabilistic Oversampling Techniques

Barnan Das et al. IEEE Trans Knowl Data Eng. .

Abstract

As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not well represented which leads to high misclassification error. We introduce two Gibbs sampling-based oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The Gibbs sampler uses the joint probability distribution of attributes of the data to generate new minority class samples in the form of Markov chain. While RACOG selects samples from the Markov chain based on a predefined lag, wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using five UCI datasets that were carefully modified to exhibit class imbalance and one new application domain dataset with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.

Keywords: Gibbs sampling; Imbalanced class distribution; Markov chain Monte Carlo (MCMC); oversampling.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
A Markov chain
Fig. 2
Fig. 2
Gibbs Sampling
Fig. 3
Fig. 3
Chow-Liu Dependence Tree Construction
Fig. 4
Fig. 4
Trees for (left) abalone and (right) car datasets
Fig. 5
Fig. 5
The RACOG Algorithm
Fig. 6
Fig. 6
The wRACOG Algorithm
Fig. 7
Fig. 7
Sensitivity for C4.5 Decision Tree
Fig. 8
Fig. 8
G-mean for C4.5 Decision Tree
Fig. 9
Fig. 9
ROC curves produced by C4.5 Decision Tree
Fig. 10
Fig. 10
i-stat values for RACOG and wRACOG
Fig. 11
Fig. 11
Comparison of total number of iterations required by different methods to achieve given i-stat
Fig. 12
Fig. 12
Comparison of log(number of instances added) by different methods

Similar articles

Cited by

References

    1. Woods K, Doss C, Bowyer K, Solka J, Priebe C, KW Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence. 1993;7(6):1417–1436.
    1. Kubat M, Holte R, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine learning. 1998;30(2):195–215.
    1. Phua C, Alahakoon D, Lee V. Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter. 2004;6(1):50–59.
    1. Turney P, et al. Learning algorithms for keyphrase extraction. Information Retrieval. 2000;2(4):303–336.
    1. Lewis D, Gale W. A sequential algorithm for training text classifiers. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Springer-Verlag New York, Inc; 1994. pp. 3–12.

LinkOut - more resources