Abstract
Cross-validation (CV) is a widely used technique in machine learning pipelines. However, some of its drawbacks have been recognized in the last decades. In particular, CV may generate folds unrepresentative of the whole dataset, which led some works to propose methods that attempt to produce more distribution-balanced folds. In this work, we propose an adaption of a cluster-based technique for cross-validation based on mini-batch k-means that is more computationally efficient. Furthermore, we compare our adaptation with other splitting strategies previously not compared and also analyze whether class imbalance may influence the quality of the estimators. Our results indicate that the more elaborate CV strategies show potential gains when a small number of folds is used, but stratified cross-validation is preferable for 10-fold CV or in imbalanced scenarios. Finally, our adaptation of the cluster-based splitter reduces its computational cost while retaining similar performance.
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and by grants from the Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul - FAPERGS [21/2551-0002052-0] and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) [308075/2021-8].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bey, R., Goussault, R., Grolleau, F., Benchoufi, M., Porcher, R.: Fold-stratified cross-validation for unbiased and privacy-preserving federated learning. J. Am. Med. Inform. Assoc. 27(8), 1244–1251 (2020). https://doi.org/10.1093/jamia/ocaa096
Bottou, L., Bengio, Y.: Convergence properties of the k-means algorithms. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems, vol. 7. MIT Press (1994)
Budka, M., Gabrys, B.: Density-preserving sampling: robust and efficient alternative to cross-validation for error estimation. IEEE Trans. Neural Netw. Learn. Syst. 24(1), 22–34 (2013). https://doi.org/10.1109/TNNLS.2012.2222925
Celisse, A., Mary-Huard, T.: Theoretical analysis of cross-validation for estimating the risk of the k-nearest neighbor classifier. J. Mach. Learn. Res. 19(1), 2373–2426 (2018). JMLR. org
Cervellera, C., Maccio, D.: Distribution-preserving stratified sampling for learning problems. IEEE Trans. Neural Netw. Learn. Syst. 1–10 (2017). https://doi.org/10.1109/TNNLS.2017.2706964
Cheng, J., et al.: dwt-cv: dense weight transfer-based cross validation strategy for model selection in biomedical data analysis. Futur. Gener. Comput. Syst. 135, 20–29 (2022). https://doi.org/10.1016/j.future.2022.04.025
Corder, G.W., Foreman, D.I.: Nonparametric statistics for non-statisticians (2011)
Dabbs, B., Junker, B.: Comparison of cross-validation methods for stochastic block models. Technical report, arXiv:1605.03000, arXiv (May 2016), arXiv:1605.03000 [stat] type: article
Diamantidis, N., Karlis, D., Giakoumakis, E.: Unsupervised stratification of cross-validation for accuracy estimation. Artif. Intell. 116(1–2), 1–16 (2000). https://doi.org/10.1016/S0004-3702(99)00094-6
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning, vol. 112, chap. 5. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-7138-7
Kohavi, R., others: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, vol. 14, no. 2, pp. 1137–1145. Montreal, Canada (1995)
Li, T., Levina, E., Zhu, J.: Network cross-validation by edge sampling. Biometrika 107(2), 257–276 (2020). https://doi.org/10.1093/biomet/asaa006
Maldonado, S., López, J., Iturriaga, A.: Out-of-time cross-validation strategies for classification in the presence of dataset shift. Appl. Intell. 52(5), 5770–5783 (2021). https://doi.org/10.1007/s10489-021-02735-2
Moreno-Torres, J.G., Saez, J.A., Herrera, F.: Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans. Neural Netw. Learn. Syst. 23(8), 1304–1312 (2012). https://doi.org/10.1109/TNNLS.2012.2199516
Motl, J., Kordík, P.: Stratified cross-validation on multiple columns. In: 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 26–31, November 2021. https://doi.org/10.1109/ICTAI52525.2021.00012
Olson, R.S., La Cava, W., Orzechowski, P., Urbanowicz, R.J., Moore, J.H.: PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData mining 10(1), 1–13 (2017)
Pérez-Guaita, D., Kuligowski, J., Lendl, B., Wood, B.R., Quintás, G.: Assessment of discriminant models in infrared imaging using constrained repeated random sampling-cross validation. Analytica Chimica Acta 1033, 156–164 (2018). Elsevier
Pérez-Guaita, D., Quintás, G., Kuligowski, J.: Discriminant analysis and feature selection in mass spectrometry imaging using constrained repeated random sampling - Cross validation (CORRS-CV). Anal. Chim. Acta 1097, 30–36 (2020). https://doi.org/10.1016/j.aca.2019.10.039
Qian, H., Wang, B., Ma, P., Peng, L., Gao, S., Song, Y.: Managing dataset shift by adversarial validation for credit scoring (2021). https://doi.org/10.48550/ARXIV.2112.10078
Santos, M.S., Soares, J.P., Abreu, P.H., Araujo, H., Santos, J.: Cross-Validation for Imbalanced Datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput. Intell. Mag. 13(4), 59–76 (2018). https://doi.org/10.1109/MCI.2018.2866730
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Wong, T.T., Yeh, P.Y.: Reliable accuracy estimates from k-fold cross validation. IEEE Trans. Knowl. Data Eng. 32(8), 1586–1594 (2020). https://doi.org/10.1109/TKDE.2019.2912815
Xu, Q.S., Liang, Y.Z.: Monte Carlo cross validation. Chemom. Intell. Lab. Syst. 56(1), 1–11 (2001). https://doi.org/10.1016/S0169-7439(00)00122-2
Zeng, X., Martinez, T.R.: Distribution-balanced stratified cross-validation for accuracy estimation. J. Exp. Theor. Artif. Intell. 12(1), 1–12 (2000). https://doi.org/10.1080/095281300146272
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fontanari, T., Fróes, T.C., Recamonde-Mendoza, M. (2022). Cross-validation Strategies for Balanced and Imbalanced Datasets. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13653. Springer, Cham. https://doi.org/10.1007/978-3-031-21686-2_43
Download citation
DOI: https://doi.org/10.1007/978-3-031-21686-2_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21685-5
Online ISBN: 978-3-031-21686-2
eBook Packages: Computer ScienceComputer Science (R0)