A two-step anomaly detection based method for PU classification in imbalanced data sets

Ortega Vázquez, Carlos; vanden Broucke, Seppe; De Weerdt, Jochen

doi:10.1007/s10618-023-00925-9

A two-step anomaly detection based method for PU classification in imbalanced data sets

Published: 07 March 2023

Volume 37, pages 1301–1325, (2023)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

763 Accesses
1 Citation
5 Altmetric
Explore all metrics

Abstract

Several machine learning applications, including genetics and fraud detection, suffer from incomplete label information. In such applications, a classifier can only train from positive and unlabeled (PU) examples in which the unlabeled data consist of both positive and negative examples. Despite a substantial presence of PU learning in the literature, few works have considered a class imbalance setting. Hence, we propose a novel two-step method that exploits anomaly detection to identify hidden positives within the unlabeled data. Our method allows the end-user to choose the anomaly detector depending on preference or domain knowledge. Moreover, we introduce Nearest-Neighbor Isolation Forest (NNIF), a novel semi-supervised anomaly detector based on the Isolation Forest. In contrast to unsupervised anomaly detectors, NNIF can utilize all available label information. Empirical analysis shows that our method generally outperforms, using NNIF as the anomaly detector, state-of-the-art PU learning methods for imbalanced data sets under different labeling mechanisms. Further experiments suggest that our two-step method shows strong robustness to wrong class prior estimates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Hellinger distance decision trees for PU learning in imbalanced data sets

Article 28 March 2023

Development of Novel Framework for Identifying Anomalies in High Volume of Data Using Robust Machine Learning Algorithm

Article 27 April 2024

Detecting outliers with one-class selective transfer machine

Article 11 October 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

A software implementation of our two-step method is available at https://github.com/CarlosOrtegaV/PU_AnomalyDetection

References

Abellán J, Moral S (2003) Building classification trees using the total uncertainty criterion. Int J Intell Syst 18(12):1215–1225
Article MATH Google Scholar
Aggarwal CC (2017) Outlier Analysis. Springer, New York
Book MATH Google Scholar
Aggarwal CC, Sathe S (2017) Outlier Ensembles: An Introduction. Springer, Cham, Switzerland
Book Google Scholar
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic & Soft Comput. 17
Baesens B, Van Vlasselaer V, Verbeke W (2015) Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: a Guide to Data Science for Fraud Detection. John Wiley & Sons, New Jersey
Book Google Scholar
Baesens B, Höppner S, Ortner I, Verdonck T (2021) robRose: a robust approach for dealing with imbalanced data in fraud detection. Stat Methods Appl 1–21
Bekker J, Davis J (2020) Learning from positive and unlabeled data: a survey. Mach Learn 109:719–760
Article MathSciNet MATH Google Scholar
Bekker J, Davis J (2018) Estimating the class prior in positive and unlabeled data through decision tree induction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32
Bekker J, Robberechts P, Davis J (2019) Beyond the selected completely at random assumption for learning from positive and unlabeled data. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp 71–85. Springer
Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. Sigmod Rec 29(2):93–104
Article Google Scholar
Brodley C, Friedl M (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167
Article MATH Google Scholar
Campos GO, Zimek A, Sander J, Campello RJ, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927
Article MathSciNet Google Scholar
Cao J, Kwong S, Wang R (2012) A noise-detection based adaboost algorithm for mislabeled data. Pattern Recogn 45(12):4451–4465
Article MATH Google Scholar
Caron L, Dionne G (1999) Insurance fraud estimation: more evidence from the quebec automobile insurance industry, 175–182
Chapelle O, Schölkopf B, Zien A, et al (2006) Semi-supervised learning, vol. 2. Cambridge: MIT Press. Cortes C, Mohri M(2014). Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science. 519:103126
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Disc 17(2):225–252
Article MathSciNet Google Scholar
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Krishnapuram B, Shah M, Smola AJ, Aggarwal CC, Shen D, Rastogi R (eds.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 13-17, 2016, pp 785–794. ACM, San Francisco
Christoffel M, Niu G, Sugiyama M (2016) Class-prior estimation for learning from positive and unlabeled data. In: Asian Conference on Machine Learning, pp 221–236. PMLR
Claesen M, De Smet F, Suykens JA, De Moor B (2015) A robust ensemble approach to learn from positive and unlabeled data using svm base models. Neurocomputing 160:73–84
Article Google Scholar
Dal Pozzolo A, Caelen O, Le Borgne Y-A, Waterschoot S, Bontempi G (2014) Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst Appl 41(10):4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026
Article Google Scholar
De Comité F, Denis F, Gilleron R, Letouzey F (1999) Positive and unlabeled examples help learning. In: International Conference on Algorithmic Learning Theory, pp 219–230 . Springer
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. The J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Denis F, Gilleron R, Letouzey F (2005) Learning from positive and unlabeled examples. Theor Comput Sci 348(1):70–83
Article MathSciNet MATH Google Scholar
du Plessis MC, Niu G, Sugiyama M (2017) Class-prior estimation for learning from positive and unlabeled data. Mach Learn 106(4):463–492
Article MathSciNet MATH Google Scholar
Du Plessis M, Niu G, Sugiyama M (2015) Convex formulation for learning from positive and unlabeled data. In: International Conference on Machine Learning, pp 1386–1394. PMLR
Elkan C, Noto K (2008) Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 213–220
Emmott A, Das S, Dietterich T, Fern A, Wong W-K (2015) A meta-analysis of the anomaly detection problem. Preprint at https://arxiv.org/pdf/1503.01158.pdf
Fernández A, García S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):2378–2398
Article MathSciNet Google Scholar
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from Imbalanced Data Sets vol. 11. Springer
Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869
Article MATH Google Scholar
García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21
Article Google Scholar
Gerlach R, Stamey J (2007) Bayesian model selection for logistic regression with misclassified outcomes. Stat Model 7(3):255–273
Article MathSciNet MATH Google Scholar
Hariri S, Kind MC, Brunner RJ (2018) Extended isolation forest. arXiv preprint arXiv:1811.02141
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328 . IEEE
He F, Liu T, Webb GI, Tao D (2018) Instance-dependent PU learning by bayesian optimal relabeling. Preprint at https://arxiv.org/pdf/1808.02180.pdf
Huang L, Zhao J, Zhu B, Chen H, Broucke SV (2020) An experimental investigation of calibration techniques for imbalanced data. Ieee Access 8:127343–127352
Article Google Scholar
Khoshgoftaar TM, Rebours P (2004) Generating multiple noise elimination filters with the ensemble-partitioning filter. In: Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004., pp 369–375 . IEEE
Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Syst Man Cybern-Part A: Syst Humans 41(3):552–568
Article Google Scholar
Kiryo R, Niu G, du Plessis MC, Sugiyama M (2017) Positive-unlabeled learning with non-negative risk estimator. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, pp 1675–1685
Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 444–452
Lee WS, Liu B (2003) Learning with positive and unlabeled examples using weighted logistic regression. In: ICML, vol 3, pp 448–455
Li X, Liu B (2003) Learning to classify texts using positive and unlabeled data. In: IJCAI, vol 3, pp 587–592
Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp 413–422. IEEE
Liu FT, Ting KM, Zhou Z-H (2010) On detecting clustered anomalies using sciforest. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp 274–290. Springer
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data (TKDD) 6(1):1–39
Liu B, Dai Y, Li X, Lee WS, Yu PS (2003) Building text classifiers using positive and unlabeled examples. In: Third IEEE International Conference on Data Mining, pp 179–186 . IEEE
Liu B, Lee WS, Yu PS, Li X (2002) Partially supervised classification of text documents. In: ICML, vol 2, pp 387–394. Citeseer
Lukashevich H, Nowak S, Dunker P (2009) Using one-class svm outliers detection for verification of collaboratively tagged image training sets, pp 682–685. IEEE, New York
Malossini A, Blanzieri E, Ng RT (2006) Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22(17):2114–2121
Article Google Scholar
Manwani N, Sastry P (2013) Noise tolerance under risk minimization. IEEE Trans Cybern 43(3):1146–1151
Article Google Scholar
Matic N, Guyon I, Bottou L, Denker J, Vapnik V (1992) Computer aided cleaning of large databases for character recognition. In: 11th IAPR International Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition Methodology and Systems, pp 330–331
Mignone P, Pio G, Džeroski S, Ceci M (2020a) Multi-task learning for the simultaneous reconstruction of the human and mouse gene regulatory networks. Sci Rep 10(1):1–15
Mignone P, Pio G, D’Elia D, Ceci M (2020b) Exploiting transfer learning for the reconstruction of the human gene regulatory network. Bioinformatics 36(5):1553–1561
Mordelet F, Vert J-P (2014) A bagging svm to learn from positive and unlabeled examples. Pattern Recogn Lett 37:201–209
Article Google Scholar
Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decis Supp Syst 62:22–31
Article Google Scholar
Northcutt CG, Wu T, Chuang IL (2017) Learning with confident examples: Rank pruning for robust classification with noisy labels. In: Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Aug 11-15, 2017. AUAI Press, Sydney
Oracle (2015) Oracle Database Online Documentation 12c . https://docs.oracle.com/database/121/
Pérez CJ, Girón FJ, Martín J, Ruiz M, Rojano C (2007) Misclassified multinomial data: a bayesian approach. RACSAM 101(1):71–80
MathSciNet MATH Google Scholar
Ramaswamy HG, Scott C, Tewari A (2016) Mixture proportion estimation via kernel embeddings of distributions. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. JMLR Workshop and Conference Proceedings, vol. 48, pp 2052–2060. JMLR.org
Scott C, Blanchard G, Handy G (2013) Classification with asymmetric label noise: consistency and maximal denoising. COLT 2013 - The 26th Annual Conference on Learning Theory, vol 30. JMLR Workshop and Conference Proceedings. JMLR.org, New Jersey, pp 489–511
Shebuti R (2016) ODDS library. http://odds.cs.stonybrook.edu
Stripling E, Baesens B, Chizi B, Vanden Broucke S (2018) Isolation-based conditional anomaly detection on mixed-attribute data to uncover workers’ compensation fraud. Decis Supp Syst 111:13–26
Article Google Scholar
Šubelj L, Furlan v, Bajec M (2011) An expert system for detecting automobile insurance fraud using social network analysis. Expert Syst Appl 38(1):1039–1052
Article Google Scholar
Su G, Chen W, Xu M (2021) Positive-unlabeled learning from imbalanced data. In: International Joint Conferences on Artificial Intelligence IJCAI, pp 2995–3001. ijcai.org, Montreal
Sun J, Zhao F, Wang C, Chen S (2007) Identifying and correcting mislabeled training instances. In: Future Generation Communication and Networking (FGCN 2007), vol 1, pp 244–250. https://doi.org/10.1109/FGCN.2007.146
Triguero I, García S, Herrera F (2015) Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl Inf Syst 42(2):245–284
Article Google Scholar
Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68(12), 1513–1542 . https://doi.org/10.1016/j.datak.2009.08.005.Including Special Section: 21st IEEE International Symposium on Computer-Based Medical Systems (IEEE CBMS 2008)
Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) Openml: networked science in machine learning. SIGKDD Explor 15(2):49–60. https://doi.org/10.1145/2641190.2641198
Article Google Scholar
Vasighizaker A, Jalili S (2018) C-pugp: a cluster-based positive unlabeled learning method for disease gene prediction and prioritization. Comput Biol Chem 76:23–31
Article Google Scholar
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3):304–319
Article Google Scholar
Yu S, Li C (2007) Pe-puc: a graph based pu-learning approach for text classification. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp 574–584. Springer
Zhou Z-H (2018) A brief introduction to weakly supervised learning. Natl Sci Rev 5(1):44–53
Article Google Scholar
Zhu B, Baesens B, Backiel A, vanden Broucke SK (2018) Benchmarking sampling techniques for imbalance learning in churn prediction. J Oper Res Soc 69(1):49–65
Article Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the financial support provided by KBC Groep NV towards this research project.

Author information

Authors and Affiliations

Research Centre for Information Systems Engineering, Faculty of Economics and Business, KU Leuven, Leuven, Belgium
Carlos Ortega Vázquez & Jochen De Weerdt
Department of Business Informatics and Operations Management, Faculty of Economics and Business Administration, Ghent University, Ghent, Belgium
Seppe vanden Broucke

Authors

Carlos Ortega Vázquez
View author publications
You can also search for this author in PubMed Google Scholar
Seppe vanden Broucke
View author publications
You can also search for this author in PubMed Google Scholar
Jochen De Weerdt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Ortega Vázquez.

Additional information

Responsible editor: Dragi Kocev.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ortega Vázquez, C., vanden Broucke, S. & De Weerdt, J. A two-step anomaly detection based method for PU classification in imbalanced data sets. Data Min Knowl Disc 37, 1301–1325 (2023). https://doi.org/10.1007/s10618-023-00925-9

Download citation

Received: 03 February 2022
Accepted: 06 February 2023
Published: 07 March 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10618-023-00925-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

A two-step anomaly detection based method for PU classification in imbalanced data sets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hellinger distance decision trees for PU learning in imbalanced data sets

Development of Novel Framework for Identifying Anomalies in High Volume of Data Using Robust Machine Learning Algorithm

Detecting outliers with one-class selective transfer machine

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A two-step anomaly detection based method for PU classification in imbalanced data sets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hellinger distance decision trees for PU learning in imbalanced data sets

Development of Novel Framework for Identifying Anomalies in High Volume of Data Using Robust Machine Learning Algorithm

Detecting outliers with one-class selective transfer machine

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation