Cross-validation Strategies for Balanced and Imbalanced Datasets

Fontanari, Thomas; Fróes, Tiago Comassetto; Recamonde-Mendoza, Mariana

doi:10.1007/978-3-031-21686-2_43

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13653))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

1243 Accesses

Abstract

Cross-validation (CV) is a widely used technique in machine learning pipelines. However, some of its drawbacks have been recognized in the last decades. In particular, CV may generate folds unrepresentative of the whole dataset, which led some works to propose methods that attempt to produce more distribution-balanced folds. In this work, we propose an adaption of a cluster-based technique for cross-validation based on mini-batch k-means that is more computationally efficient. Furthermore, we compare our adaptation with other splitting strategies previously not compared and also analyze whether class imbalance may influence the quality of the estimators. Our results indicate that the more elaborate CV strategies show potential gains when a small number of folds is used, but stratified cross-validation is preferable for 10-fold CV or in imbalanced scenarios. Finally, our adaptation of the cluster-based splitter reduces its computational cost while retaining similar performance.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and by grants from the Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul - FAPERGS [21/2551-0002052-0] and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) [308075/2021-8].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification

Article 01 March 2023

A Novel Approach for Semi-supervised Learning: Incremental Parallel Training with Cross-Validation (IPT-CV)

Article 23 November 2022

The leave-worst-k-out criterion for cross validation

Article 17 June 2022

Notes

1.
https://github.com/froestiago/K-Fold-Partitioning-Methods/tree/bracis22.

References

Bey, R., Goussault, R., Grolleau, F., Benchoufi, M., Porcher, R.: Fold-stratified cross-validation for unbiased and privacy-preserving federated learning. J. Am. Med. Inform. Assoc. 27(8), 1244–1251 (2020). https://doi.org/10.1093/jamia/ocaa096
Article Google Scholar
Bottou, L., Bengio, Y.: Convergence properties of the k-means algorithms. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems, vol. 7. MIT Press (1994)
Google Scholar
Budka, M., Gabrys, B.: Density-preserving sampling: robust and efficient alternative to cross-validation for error estimation. IEEE Trans. Neural Netw. Learn. Syst. 24(1), 22–34 (2013). https://doi.org/10.1109/TNNLS.2012.2222925
Article Google Scholar
Celisse, A., Mary-Huard, T.: Theoretical analysis of cross-validation for estimating the risk of the k-nearest neighbor classifier. J. Mach. Learn. Res. 19(1), 2373–2426 (2018). JMLR. org
Google Scholar
Cervellera, C., Maccio, D.: Distribution-preserving stratified sampling for learning problems. IEEE Trans. Neural Netw. Learn. Syst. 1–10 (2017). https://doi.org/10.1109/TNNLS.2017.2706964
Cheng, J., et al.: dwt-cv: dense weight transfer-based cross validation strategy for model selection in biomedical data analysis. Futur. Gener. Comput. Syst. 135, 20–29 (2022). https://doi.org/10.1016/j.future.2022.04.025
Article Google Scholar
Corder, G.W., Foreman, D.I.: Nonparametric statistics for non-statisticians (2011)
Google Scholar
Dabbs, B., Junker, B.: Comparison of cross-validation methods for stochastic block models. Technical report, arXiv:1605.03000, arXiv (May 2016), arXiv:1605.03000 [stat] type: article
Diamantidis, N., Karlis, D., Giakoumakis, E.: Unsupervised stratification of cross-validation for accuracy estimation. Artif. Intell. 116(1–2), 1–16 (2000). https://doi.org/10.1016/S0004-3702(99)00094-6
Article MathSciNet MATH Google Scholar
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning, vol. 112, chap. 5. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-7138-7
Kohavi, R., others: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, vol. 14, no. 2, pp. 1137–1145. Montreal, Canada (1995)
Google Scholar
Li, T., Levina, E., Zhu, J.: Network cross-validation by edge sampling. Biometrika 107(2), 257–276 (2020). https://doi.org/10.1093/biomet/asaa006
Article MathSciNet MATH Google Scholar
Maldonado, S., López, J., Iturriaga, A.: Out-of-time cross-validation strategies for classification in the presence of dataset shift. Appl. Intell. 52(5), 5770–5783 (2021). https://doi.org/10.1007/s10489-021-02735-2
Article Google Scholar
Moreno-Torres, J.G., Saez, J.A., Herrera, F.: Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans. Neural Netw. Learn. Syst. 23(8), 1304–1312 (2012). https://doi.org/10.1109/TNNLS.2012.2199516
Article Google Scholar
Motl, J., Kordík, P.: Stratified cross-validation on multiple columns. In: 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 26–31, November 2021. https://doi.org/10.1109/ICTAI52525.2021.00012
Olson, R.S., La Cava, W., Orzechowski, P., Urbanowicz, R.J., Moore, J.H.: PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData mining 10(1), 1–13 (2017)
Article Google Scholar
Pérez-Guaita, D., Kuligowski, J., Lendl, B., Wood, B.R., Quintás, G.: Assessment of discriminant models in infrared imaging using constrained repeated random sampling-cross validation. Analytica Chimica Acta 1033, 156–164 (2018). Elsevier
Google Scholar
Pérez-Guaita, D., Quintás, G., Kuligowski, J.: Discriminant analysis and feature selection in mass spectrometry imaging using constrained repeated random sampling - Cross validation (CORRS-CV). Anal. Chim. Acta 1097, 30–36 (2020). https://doi.org/10.1016/j.aca.2019.10.039
Article Google Scholar
Qian, H., Wang, B., Ma, P., Peng, L., Gao, S., Song, Y.: Managing dataset shift by adversarial validation for credit scoring (2021). https://doi.org/10.48550/ARXIV.2112.10078
Santos, M.S., Soares, J.P., Abreu, P.H., Araujo, H., Santos, J.: Cross-Validation for Imbalanced Datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput. Intell. Mag. 13(4), 59–76 (2018). https://doi.org/10.1109/MCI.2018.2866730
Article Google Scholar
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Google Scholar
Wong, T.T., Yeh, P.Y.: Reliable accuracy estimates from k-fold cross validation. IEEE Trans. Knowl. Data Eng. 32(8), 1586–1594 (2020). https://doi.org/10.1109/TKDE.2019.2912815
Article Google Scholar
Xu, Q.S., Liang, Y.Z.: Monte Carlo cross validation. Chemom. Intell. Lab. Syst. 56(1), 1–11 (2001). https://doi.org/10.1016/S0169-7439(00)00122-2
Article Google Scholar
Zeng, X., Martinez, T.R.: Distribution-balanced stratified cross-validation for accuracy estimation. J. Exp. Theor. Artif. Intell. 12(1), 1–12 (2000). https://doi.org/10.1080/095281300146272
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics, Universidade Federal do Rio Grande do Sul, Porto Alegre, RS, Brazil
Thomas Fontanari, Tiago Comassetto Fróes & Mariana Recamonde-Mendoza
Bioinformatics Core, Hospital de Clínicas de Porto Alegre, Porto Alegre, RS, Brazil
Thomas Fontanari & Mariana Recamonde-Mendoza

Authors

Thomas Fontanari
View author publications
You can also search for this author in PubMed Google Scholar
Tiago Comassetto Fróes
View author publications
You can also search for this author in PubMed Google Scholar
Mariana Recamonde-Mendoza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mariana Recamonde-Mendoza .

Editor information

Editors and Affiliations

Federal University of Rio Grande do Norte, Natal, Brazil
João Carlos Xavier-Junior
Federal University of Bahia, Salvador, Brazil
Ricardo Araújo Rios

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fontanari, T., Fróes, T.C., Recamonde-Mendoza, M. (2022). Cross-validation Strategies for Balanced and Imbalanced Datasets. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13653. Springer, Cham. https://doi.org/10.1007/978-3-031-21686-2_43

Download citation

DOI: https://doi.org/10.1007/978-3-031-21686-2_43
Published: 19 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21685-5
Online ISBN: 978-3-031-21686-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-validation Strategies for Balanced and Imbalanced Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification

A Novel Approach for Semi-supervised Learning: Incremental Parallel Training with Cross-Validation (IPT-CV)

The leave-worst-k-out criterion for cross validation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Cross-validation Strategies for Balanced and Imbalanced Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification

A Novel Approach for Semi-supervised Learning: Incremental Parallel Training with Cross-Validation (IPT-CV)

The leave-worst-k-out criterion for cross validation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation