Abstract
Several machine learning applications, including genetics and fraud detection, suffer from incomplete label information. In such applications, a classifier can only train from positive and unlabeled (PU) examples in which the unlabeled data consist of both positive and negative examples. Despite a substantial presence of PU learning in the literature, few works have considered a class imbalance setting. Hence, we propose a novel two-step method that exploits anomaly detection to identify hidden positives within the unlabeled data. Our method allows the end-user to choose the anomaly detector depending on preference or domain knowledge. Moreover, we introduce Nearest-Neighbor Isolation Forest (NNIF), a novel semi-supervised anomaly detector based on the Isolation Forest. In contrast to unsupervised anomaly detectors, NNIF can utilize all available label information. Empirical analysis shows that our method generally outperforms, using NNIF as the anomaly detector, state-of-the-art PU learning methods for imbalanced data sets under different labeling mechanisms. Further experiments suggest that our two-step method shows strong robustness to wrong class prior estimates.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
A software implementation of our two-step method is available at https://github.com/CarlosOrtegaV/PU_AnomalyDetection
References
Abellán J, Moral S (2003) Building classification trees using the total uncertainty criterion. Int J Intell Syst 18(12):1215–1225
Aggarwal CC (2017) Outlier Analysis. Springer, New York
Aggarwal CC, Sathe S (2017) Outlier Ensembles: An Introduction. Springer, Cham, Switzerland
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic & Soft Comput. 17
Baesens B, Van Vlasselaer V, Verbeke W (2015) Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: a Guide to Data Science for Fraud Detection. John Wiley & Sons, New Jersey
Baesens B, Höppner S, Ortner I, Verdonck T (2021) robRose: a robust approach for dealing with imbalanced data in fraud detection. Stat Methods Appl 1–21
Bekker J, Davis J (2020) Learning from positive and unlabeled data: a survey. Mach Learn 109:719–760
Bekker J, Davis J (2018) Estimating the class prior in positive and unlabeled data through decision tree induction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32
Bekker J, Robberechts P, Davis J (2019) Beyond the selected completely at random assumption for learning from positive and unlabeled data. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp 71–85. Springer
Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. Sigmod Rec 29(2):93–104
Brodley C, Friedl M (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167
Campos GO, Zimek A, Sander J, Campello RJ, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927
Cao J, Kwong S, Wang R (2012) A noise-detection based adaboost algorithm for mislabeled data. Pattern Recogn 45(12):4451–4465
Caron L, Dionne G (1999) Insurance fraud estimation: more evidence from the quebec automobile insurance industry, 175–182
Chapelle O, Schölkopf B, Zien A, et al (2006) Semi-supervised learning, vol. 2. Cambridge: MIT Press. Cortes C, Mohri M(2014). Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science. 519:103126
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Disc 17(2):225–252
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Krishnapuram B, Shah M, Smola AJ, Aggarwal CC, Shen D, Rastogi R (eds.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 13-17, 2016, pp 785–794. ACM, San Francisco
Christoffel M, Niu G, Sugiyama M (2016) Class-prior estimation for learning from positive and unlabeled data. In: Asian Conference on Machine Learning, pp 221–236. PMLR
Claesen M, De Smet F, Suykens JA, De Moor B (2015) A robust ensemble approach to learn from positive and unlabeled data using svm base models. Neurocomputing 160:73–84
Dal Pozzolo A, Caelen O, Le Borgne Y-A, Waterschoot S, Bontempi G (2014) Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst Appl 41(10):4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026
De Comité F, Denis F, Gilleron R, Letouzey F (1999) Positive and unlabeled examples help learning. In: International Conference on Algorithmic Learning Theory, pp 219–230 . Springer
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. The J Mach Learn Res 7:1–30
Denis F, Gilleron R, Letouzey F (2005) Learning from positive and unlabeled examples. Theor Comput Sci 348(1):70–83
du Plessis MC, Niu G, Sugiyama M (2017) Class-prior estimation for learning from positive and unlabeled data. Mach Learn 106(4):463–492
Du Plessis M, Niu G, Sugiyama M (2015) Convex formulation for learning from positive and unlabeled data. In: International Conference on Machine Learning, pp 1386–1394. PMLR
Elkan C, Noto K (2008) Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 213–220
Emmott A, Das S, Dietterich T, Fern A, Wong W-K (2015) A meta-analysis of the anomaly detection problem. Preprint at https://arxiv.org/pdf/1503.01158.pdf
Fernández A, García S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):2378–2398
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from Imbalanced Data Sets vol. 11. Springer
Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869
García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21
Gerlach R, Stamey J (2007) Bayesian model selection for logistic regression with misclassified outcomes. Stat Model 7(3):255–273
Hariri S, Kind MC, Brunner RJ (2018) Extended isolation forest. arXiv preprint arXiv:1811.02141
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328 . IEEE
He F, Liu T, Webb GI, Tao D (2018) Instance-dependent PU learning by bayesian optimal relabeling. Preprint at https://arxiv.org/pdf/1808.02180.pdf
Huang L, Zhao J, Zhu B, Chen H, Broucke SV (2020) An experimental investigation of calibration techniques for imbalanced data. Ieee Access 8:127343–127352
Khoshgoftaar TM, Rebours P (2004) Generating multiple noise elimination filters with the ensemble-partitioning filter. In: Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004., pp 369–375 . IEEE
Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Syst Man Cybern-Part A: Syst Humans 41(3):552–568
Kiryo R, Niu G, du Plessis MC, Sugiyama M (2017) Positive-unlabeled learning with non-negative risk estimator. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, pp 1675–1685
Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 444–452
Lee WS, Liu B (2003) Learning with positive and unlabeled examples using weighted logistic regression. In: ICML, vol 3, pp 448–455
Li X, Liu B (2003) Learning to classify texts using positive and unlabeled data. In: IJCAI, vol 3, pp 587–592
Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp 413–422. IEEE
Liu FT, Ting KM, Zhou Z-H (2010) On detecting clustered anomalies using sciforest. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp 274–290. Springer
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data (TKDD) 6(1):1–39
Liu B, Dai Y, Li X, Lee WS, Yu PS (2003) Building text classifiers using positive and unlabeled examples. In: Third IEEE International Conference on Data Mining, pp 179–186 . IEEE
Liu B, Lee WS, Yu PS, Li X (2002) Partially supervised classification of text documents. In: ICML, vol 2, pp 387–394. Citeseer
Lukashevich H, Nowak S, Dunker P (2009) Using one-class svm outliers detection for verification of collaboratively tagged image training sets, pp 682–685. IEEE, New York
Malossini A, Blanzieri E, Ng RT (2006) Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22(17):2114–2121
Manwani N, Sastry P (2013) Noise tolerance under risk minimization. IEEE Trans Cybern 43(3):1146–1151
Matic N, Guyon I, Bottou L, Denker J, Vapnik V (1992) Computer aided cleaning of large databases for character recognition. In: 11th IAPR International Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition Methodology and Systems, pp 330–331
Mignone P, Pio G, Džeroski S, Ceci M (2020a) Multi-task learning for the simultaneous reconstruction of the human and mouse gene regulatory networks. Sci Rep 10(1):1–15
Mignone P, Pio G, D’Elia D, Ceci M (2020b) Exploiting transfer learning for the reconstruction of the human gene regulatory network. Bioinformatics 36(5):1553–1561
Mordelet F, Vert J-P (2014) A bagging svm to learn from positive and unlabeled examples. Pattern Recogn Lett 37:201–209
Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decis Supp Syst 62:22–31
Northcutt CG, Wu T, Chuang IL (2017) Learning with confident examples: Rank pruning for robust classification with noisy labels. In: Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Aug 11-15, 2017. AUAI Press, Sydney
Oracle (2015) Oracle Database Online Documentation 12c . https://docs.oracle.com/database/121/
Pérez CJ, Girón FJ, Martín J, Ruiz M, Rojano C (2007) Misclassified multinomial data: a bayesian approach. RACSAM 101(1):71–80
Ramaswamy HG, Scott C, Tewari A (2016) Mixture proportion estimation via kernel embeddings of distributions. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. JMLR Workshop and Conference Proceedings, vol. 48, pp 2052–2060. JMLR.org
Scott C, Blanchard G, Handy G (2013) Classification with asymmetric label noise: consistency and maximal denoising. COLT 2013 - The 26th Annual Conference on Learning Theory, vol 30. JMLR Workshop and Conference Proceedings. JMLR.org, New Jersey, pp 489–511
Shebuti R (2016) ODDS library. http://odds.cs.stonybrook.edu
Stripling E, Baesens B, Chizi B, Vanden Broucke S (2018) Isolation-based conditional anomaly detection on mixed-attribute data to uncover workers’ compensation fraud. Decis Supp Syst 111:13–26
Šubelj L, Furlan v, Bajec M (2011) An expert system for detecting automobile insurance fraud using social network analysis. Expert Syst Appl 38(1):1039–1052
Su G, Chen W, Xu M (2021) Positive-unlabeled learning from imbalanced data. In: International Joint Conferences on Artificial Intelligence IJCAI, pp 2995–3001. ijcai.org, Montreal
Sun J, Zhao F, Wang C, Chen S (2007) Identifying and correcting mislabeled training instances. In: Future Generation Communication and Networking (FGCN 2007), vol 1, pp 244–250. https://doi.org/10.1109/FGCN.2007.146
Triguero I, García S, Herrera F (2015) Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl Inf Syst 42(2):245–284
Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68(12), 1513–1542 . https://doi.org/10.1016/j.datak.2009.08.005.Including Special Section: 21st IEEE International Symposium on Computer-Based Medical Systems (IEEE CBMS 2008)
Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) Openml: networked science in machine learning. SIGKDD Explor 15(2):49–60. https://doi.org/10.1145/2641190.2641198
Vasighizaker A, Jalili S (2018) C-pugp: a cluster-based positive unlabeled learning method for disease gene prediction and prioritization. Comput Biol Chem 76:23–31
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3):304–319
Yu S, Li C (2007) Pe-puc: a graph based pu-learning approach for text classification. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp 574–584. Springer
Zhou Z-H (2018) A brief introduction to weakly supervised learning. Natl Sci Rev 5(1):44–53
Zhu B, Baesens B, Backiel A, vanden Broucke SK (2018) Benchmarking sampling techniques for imbalance learning in churn prediction. J Oper Res Soc 69(1):49–65
Acknowledgements
We gratefully acknowledge the financial support provided by KBC Groep NV towards this research project.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Dragi Kocev.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ortega Vázquez, C., vanden Broucke, S. & De Weerdt, J. A two-step anomaly detection based method for PU classification in imbalanced data sets. Data Min Knowl Disc 37, 1301–1325 (2023). https://doi.org/10.1007/s10618-023-00925-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-023-00925-9