A two-step anomaly detection based method for PU classification in imbalanced data sets | Data Mining and Knowledge Discovery Skip to main content
Log in

A two-step anomaly detection based method for PU classification in imbalanced data sets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Several machine learning applications, including genetics and fraud detection, suffer from incomplete label information. In such applications, a classifier can only train from positive and unlabeled (PU) examples in which the unlabeled data consist of both positive and negative examples. Despite a substantial presence of PU learning in the literature, few works have considered a class imbalance setting. Hence, we propose a novel two-step method that exploits anomaly detection to identify hidden positives within the unlabeled data. Our method allows the end-user to choose the anomaly detector depending on preference or domain knowledge. Moreover, we introduce Nearest-Neighbor Isolation Forest (NNIF), a novel semi-supervised anomaly detector based on the Isolation Forest. In contrast to unsupervised anomaly detectors, NNIF can utilize all available label information. Empirical analysis shows that our method generally outperforms, using NNIF as the anomaly detector, state-of-the-art PU learning methods for imbalanced data sets under different labeling mechanisms. Further experiments suggest that our two-step method shows strong robustness to wrong class prior estimates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. A software implementation of our two-step method is available at https://github.com/CarlosOrtegaV/PU_AnomalyDetection

References

  • Abellán J, Moral S (2003) Building classification trees using the total uncertainty criterion. Int J Intell Syst 18(12):1215–1225

    Article  MATH  Google Scholar 

  • Aggarwal CC (2017) Outlier Analysis. Springer, New York

    Book  MATH  Google Scholar 

  • Aggarwal CC, Sathe S (2017) Outlier Ensembles: An Introduction. Springer, Cham, Switzerland

    Book  Google Scholar 

  • Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic & Soft Comput. 17

  • Baesens B, Van Vlasselaer V, Verbeke W (2015) Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: a Guide to Data Science for Fraud Detection. John Wiley & Sons, New Jersey

    Book  Google Scholar 

  • Baesens B, Höppner S, Ortner I, Verdonck T (2021) robRose: a robust approach for dealing with imbalanced data in fraud detection. Stat Methods Appl 1–21

  • Bekker J, Davis J (2020) Learning from positive and unlabeled data: a survey. Mach Learn 109:719–760

    Article  MathSciNet  MATH  Google Scholar 

  • Bekker J, Davis J (2018) Estimating the class prior in positive and unlabeled data through decision tree induction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32

  • Bekker J, Robberechts P, Davis J (2019) Beyond the selected completely at random assumption for learning from positive and unlabeled data. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp 71–85. Springer

  • Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. Sigmod Rec 29(2):93–104

    Article  Google Scholar 

  • Brodley C, Friedl M (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167

    Article  MATH  Google Scholar 

  • Campos GO, Zimek A, Sander J, Campello RJ, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927

    Article  MathSciNet  Google Scholar 

  • Cao J, Kwong S, Wang R (2012) A noise-detection based adaboost algorithm for mislabeled data. Pattern Recogn 45(12):4451–4465

    Article  MATH  Google Scholar 

  • Caron L, Dionne G (1999) Insurance fraud estimation: more evidence from the quebec automobile insurance industry, 175–182

  • Chapelle O, Schölkopf B, Zien A, et al (2006) Semi-supervised learning, vol. 2. Cambridge: MIT Press. Cortes C, Mohri M(2014). Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science. 519:103126

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  • Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Disc 17(2):225–252

    Article  MathSciNet  Google Scholar 

  • Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Krishnapuram B, Shah M, Smola AJ, Aggarwal CC, Shen D, Rastogi R (eds.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 13-17, 2016, pp 785–794. ACM, San Francisco

  • Christoffel M, Niu G, Sugiyama M (2016) Class-prior estimation for learning from positive and unlabeled data. In: Asian Conference on Machine Learning, pp 221–236. PMLR

  • Claesen M, De Smet F, Suykens JA, De Moor B (2015) A robust ensemble approach to learn from positive and unlabeled data using svm base models. Neurocomputing 160:73–84

    Article  Google Scholar 

  • Dal Pozzolo A, Caelen O, Le Borgne Y-A, Waterschoot S, Bontempi G (2014) Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst Appl 41(10):4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026

    Article  Google Scholar 

  • De Comité F, Denis F, Gilleron R, Letouzey F (1999) Positive and unlabeled examples help learning. In: International Conference on Algorithmic Learning Theory, pp 219–230 . Springer

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. The J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Denis F, Gilleron R, Letouzey F (2005) Learning from positive and unlabeled examples. Theor Comput Sci 348(1):70–83

    Article  MathSciNet  MATH  Google Scholar 

  • du Plessis MC, Niu G, Sugiyama M (2017) Class-prior estimation for learning from positive and unlabeled data. Mach Learn 106(4):463–492

    Article  MathSciNet  MATH  Google Scholar 

  • Du Plessis M, Niu G, Sugiyama M (2015) Convex formulation for learning from positive and unlabeled data. In: International Conference on Machine Learning, pp 1386–1394. PMLR

  • Elkan C, Noto K (2008) Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 213–220

  • Emmott A, Das S, Dietterich T, Fern A, Wong W-K (2015) A meta-analysis of the anomaly detection problem. Preprint at https://arxiv.org/pdf/1503.01158.pdf

  • Fernández A, García S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):2378–2398

    Article  MathSciNet  Google Scholar 

  • Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from Imbalanced Data Sets vol. 11. Springer

  • Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869

    Article  MATH  Google Scholar 

  • García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21

    Article  Google Scholar 

  • Gerlach R, Stamey J (2007) Bayesian model selection for logistic regression with misclassified outcomes. Stat Model 7(3):255–273

    Article  MathSciNet  MATH  Google Scholar 

  • Hariri S, Kind MC, Brunner RJ (2018) Extended isolation forest. arXiv preprint arXiv:1811.02141

  • He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328 . IEEE

  • He F, Liu T, Webb GI, Tao D (2018) Instance-dependent PU learning by bayesian optimal relabeling. Preprint at https://arxiv.org/pdf/1808.02180.pdf

  • Huang L, Zhao J, Zhu B, Chen H, Broucke SV (2020) An experimental investigation of calibration techniques for imbalanced data. Ieee Access 8:127343–127352

    Article  Google Scholar 

  • Khoshgoftaar TM, Rebours P (2004) Generating multiple noise elimination filters with the ensemble-partitioning filter. In: Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004., pp 369–375 . IEEE

  • Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Syst Man Cybern-Part A: Syst Humans 41(3):552–568

    Article  Google Scholar 

  • Kiryo R, Niu G, du Plessis MC, Sugiyama M (2017) Positive-unlabeled learning with non-negative risk estimator. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, pp 1675–1685

  • Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 444–452

  • Lee WS, Liu B (2003) Learning with positive and unlabeled examples using weighted logistic regression. In: ICML, vol 3, pp 448–455

  • Li X, Liu B (2003) Learning to classify texts using positive and unlabeled data. In: IJCAI, vol 3, pp 587–592

  • Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp 413–422. IEEE

  • Liu FT, Ting KM, Zhou Z-H (2010) On detecting clustered anomalies using sciforest. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp 274–290. Springer

  • Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data (TKDD) 6(1):1–39

  • Liu B, Dai Y, Li X, Lee WS, Yu PS (2003) Building text classifiers using positive and unlabeled examples. In: Third IEEE International Conference on Data Mining, pp 179–186 . IEEE

  • Liu B, Lee WS, Yu PS, Li X (2002) Partially supervised classification of text documents. In: ICML, vol 2, pp 387–394. Citeseer

  • Lukashevich H, Nowak S, Dunker P (2009) Using one-class svm outliers detection for verification of collaboratively tagged image training sets, pp 682–685. IEEE, New York

  • Malossini A, Blanzieri E, Ng RT (2006) Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22(17):2114–2121

    Article  Google Scholar 

  • Manwani N, Sastry P (2013) Noise tolerance under risk minimization. IEEE Trans Cybern 43(3):1146–1151

    Article  Google Scholar 

  • Matic N, Guyon I, Bottou L, Denker J, Vapnik V (1992) Computer aided cleaning of large databases for character recognition. In: 11th IAPR International Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition Methodology and Systems, pp 330–331

  • Mignone P, Pio G, Džeroski S, Ceci M (2020a) Multi-task learning for the simultaneous reconstruction of the human and mouse gene regulatory networks. Sci Rep 10(1):1–15

  • Mignone P, Pio G, D’Elia D, Ceci M (2020b) Exploiting transfer learning for the reconstruction of the human gene regulatory network. Bioinformatics 36(5):1553–1561

  • Mordelet F, Vert J-P (2014) A bagging svm to learn from positive and unlabeled examples. Pattern Recogn Lett 37:201–209

    Article  Google Scholar 

  • Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decis Supp Syst 62:22–31

    Article  Google Scholar 

  • Northcutt CG, Wu T, Chuang IL (2017) Learning with confident examples: Rank pruning for robust classification with noisy labels. In: Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Aug 11-15, 2017. AUAI Press, Sydney

  • Oracle (2015) Oracle Database Online Documentation 12c . https://docs.oracle.com/database/121/

  • Pérez CJ, Girón FJ, Martín J, Ruiz M, Rojano C (2007) Misclassified multinomial data: a bayesian approach. RACSAM 101(1):71–80

    MathSciNet  MATH  Google Scholar 

  • Ramaswamy HG, Scott C, Tewari A (2016) Mixture proportion estimation via kernel embeddings of distributions. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. JMLR Workshop and Conference Proceedings, vol. 48, pp 2052–2060. JMLR.org

  • Scott C, Blanchard G, Handy G (2013) Classification with asymmetric label noise: consistency and maximal denoising. COLT 2013 - The 26th Annual Conference on Learning Theory, vol 30. JMLR Workshop and Conference Proceedings. JMLR.org, New Jersey, pp 489–511

  • Shebuti R (2016) ODDS library. http://odds.cs.stonybrook.edu

  • Stripling E, Baesens B, Chizi B, Vanden Broucke S (2018) Isolation-based conditional anomaly detection on mixed-attribute data to uncover workers’ compensation fraud. Decis Supp Syst 111:13–26

    Article  Google Scholar 

  • Šubelj L, Furlan v, Bajec M (2011) An expert system for detecting automobile insurance fraud using social network analysis. Expert Syst Appl 38(1):1039–1052

    Article  Google Scholar 

  • Su G, Chen W, Xu M (2021) Positive-unlabeled learning from imbalanced data. In: International Joint Conferences on Artificial Intelligence IJCAI, pp 2995–3001. ijcai.org, Montreal

  • Sun J, Zhao F, Wang C, Chen S (2007) Identifying and correcting mislabeled training instances. In: Future Generation Communication and Networking (FGCN 2007), vol 1, pp 244–250. https://doi.org/10.1109/FGCN.2007.146

  • Triguero I, García S, Herrera F (2015) Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl Inf Syst 42(2):245–284

    Article  Google Scholar 

  • Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68(12), 1513–1542 . https://doi.org/10.1016/j.datak.2009.08.005.Including Special Section: 21st IEEE International Symposium on Computer-Based Medical Systems (IEEE CBMS 2008)

  • Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) Openml: networked science in machine learning. SIGKDD Explor 15(2):49–60. https://doi.org/10.1145/2641190.2641198

    Article  Google Scholar 

  • Vasighizaker A, Jalili S (2018) C-pugp: a cluster-based positive unlabeled learning method for disease gene prediction and prioritization. Comput Biol Chem 76:23–31

    Article  Google Scholar 

  • Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3):304–319

    Article  Google Scholar 

  • Yu S, Li C (2007) Pe-puc: a graph based pu-learning approach for text classification. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp 574–584. Springer

  • Zhou Z-H (2018) A brief introduction to weakly supervised learning. Natl Sci Rev 5(1):44–53

    Article  Google Scholar 

  • Zhu B, Baesens B, Backiel A, vanden Broucke SK (2018) Benchmarking sampling techniques for imbalance learning in churn prediction. J Oper Res Soc 69(1):49–65

    Article  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the financial support provided by KBC Groep NV towards this research project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Ortega Vázquez.

Additional information

Responsible editor: Dragi Kocev.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ortega Vázquez, C., vanden Broucke, S. & De Weerdt, J. A two-step anomaly detection based method for PU classification in imbalanced data sets. Data Min Knowl Disc 37, 1301–1325 (2023). https://doi.org/10.1007/s10618-023-00925-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-023-00925-9

Keywords

Navigation