Cross-validation Strategies for Balanced and Imbalanced Datasets | SpringerLink
Skip to main content

Cross-validation Strategies for Balanced and Imbalanced Datasets

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2022)

Abstract

Cross-validation (CV) is a widely used technique in machine learning pipelines. However, some of its drawbacks have been recognized in the last decades. In particular, CV may generate folds unrepresentative of the whole dataset, which led some works to propose methods that attempt to produce more distribution-balanced folds. In this work, we propose an adaption of a cluster-based technique for cross-validation based on mini-batch k-means that is more computationally efficient. Furthermore, we compare our adaptation with other splitting strategies previously not compared and also analyze whether class imbalance may influence the quality of the estimators. Our results indicate that the more elaborate CV strategies show potential gains when a small number of folds is used, but stratified cross-validation is preferable for 10-fold CV or in imbalanced scenarios. Finally, our adaptation of the cluster-based splitter reduces its computational cost while retaining similar performance.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and by grants from the Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul - FAPERGS [21/2551-0002052-0] and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) [308075/2021-8].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/froestiago/K-Fold-Partitioning-Methods/tree/bracis22.

References

  1. Bey, R., Goussault, R., Grolleau, F., Benchoufi, M., Porcher, R.: Fold-stratified cross-validation for unbiased and privacy-preserving federated learning. J. Am. Med. Inform. Assoc. 27(8), 1244–1251 (2020). https://doi.org/10.1093/jamia/ocaa096

    Article  Google Scholar 

  2. Bottou, L., Bengio, Y.: Convergence properties of the k-means algorithms. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems, vol. 7. MIT Press (1994)

    Google Scholar 

  3. Budka, M., Gabrys, B.: Density-preserving sampling: robust and efficient alternative to cross-validation for error estimation. IEEE Trans. Neural Netw. Learn. Syst. 24(1), 22–34 (2013). https://doi.org/10.1109/TNNLS.2012.2222925

    Article  Google Scholar 

  4. Celisse, A., Mary-Huard, T.: Theoretical analysis of cross-validation for estimating the risk of the k-nearest neighbor classifier. J. Mach. Learn. Res. 19(1), 2373–2426 (2018). JMLR. org

    Google Scholar 

  5. Cervellera, C., Maccio, D.: Distribution-preserving stratified sampling for learning problems. IEEE Trans. Neural Netw. Learn. Syst. 1–10 (2017). https://doi.org/10.1109/TNNLS.2017.2706964

  6. Cheng, J., et al.: dwt-cv: dense weight transfer-based cross validation strategy for model selection in biomedical data analysis. Futur. Gener. Comput. Syst. 135, 20–29 (2022). https://doi.org/10.1016/j.future.2022.04.025

    Article  Google Scholar 

  7. Corder, G.W., Foreman, D.I.: Nonparametric statistics for non-statisticians (2011)

    Google Scholar 

  8. Dabbs, B., Junker, B.: Comparison of cross-validation methods for stochastic block models. Technical report, arXiv:1605.03000, arXiv (May 2016), arXiv:1605.03000 [stat] type: article

  9. Diamantidis, N., Karlis, D., Giakoumakis, E.: Unsupervised stratification of cross-validation for accuracy estimation. Artif. Intell. 116(1–2), 1–16 (2000). https://doi.org/10.1016/S0004-3702(99)00094-6

    Article  MathSciNet  MATH  Google Scholar 

  10. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning, vol. 112, chap. 5. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-7138-7

  11. Kohavi, R., others: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, vol. 14, no. 2, pp. 1137–1145. Montreal, Canada (1995)

    Google Scholar 

  12. Li, T., Levina, E., Zhu, J.: Network cross-validation by edge sampling. Biometrika 107(2), 257–276 (2020). https://doi.org/10.1093/biomet/asaa006

    Article  MathSciNet  MATH  Google Scholar 

  13. Maldonado, S., López, J., Iturriaga, A.: Out-of-time cross-validation strategies for classification in the presence of dataset shift. Appl. Intell. 52(5), 5770–5783 (2021). https://doi.org/10.1007/s10489-021-02735-2

    Article  Google Scholar 

  14. Moreno-Torres, J.G., Saez, J.A., Herrera, F.: Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans. Neural Netw. Learn. Syst. 23(8), 1304–1312 (2012). https://doi.org/10.1109/TNNLS.2012.2199516

    Article  Google Scholar 

  15. Motl, J., Kordík, P.: Stratified cross-validation on multiple columns. In: 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 26–31, November 2021. https://doi.org/10.1109/ICTAI52525.2021.00012

  16. Olson, R.S., La Cava, W., Orzechowski, P., Urbanowicz, R.J., Moore, J.H.: PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData mining 10(1), 1–13 (2017)

    Article  Google Scholar 

  17. Pérez-Guaita, D., Kuligowski, J., Lendl, B., Wood, B.R., Quintás, G.: Assessment of discriminant models in infrared imaging using constrained repeated random sampling-cross validation. Analytica Chimica Acta 1033, 156–164 (2018). Elsevier

    Google Scholar 

  18. Pérez-Guaita, D., Quintás, G., Kuligowski, J.: Discriminant analysis and feature selection in mass spectrometry imaging using constrained repeated random sampling - Cross validation (CORRS-CV). Anal. Chim. Acta 1097, 30–36 (2020). https://doi.org/10.1016/j.aca.2019.10.039

    Article  Google Scholar 

  19. Qian, H., Wang, B., Ma, P., Peng, L., Gao, S., Song, Y.: Managing dataset shift by adversarial validation for credit scoring (2021). https://doi.org/10.48550/ARXIV.2112.10078

  20. Santos, M.S., Soares, J.P., Abreu, P.H., Araujo, H., Santos, J.: Cross-Validation for Imbalanced Datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput. Intell. Mag. 13(4), 59–76 (2018). https://doi.org/10.1109/MCI.2018.2866730

    Article  Google Scholar 

  21. Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)

    Google Scholar 

  22. Wong, T.T., Yeh, P.Y.: Reliable accuracy estimates from k-fold cross validation. IEEE Trans. Knowl. Data Eng. 32(8), 1586–1594 (2020). https://doi.org/10.1109/TKDE.2019.2912815

    Article  Google Scholar 

  23. Xu, Q.S., Liang, Y.Z.: Monte Carlo cross validation. Chemom. Intell. Lab. Syst. 56(1), 1–11 (2001). https://doi.org/10.1016/S0169-7439(00)00122-2

    Article  Google Scholar 

  24. Zeng, X., Martinez, T.R.: Distribution-balanced stratified cross-validation for accuracy estimation. J. Exp. Theor. Artif. Intell. 12(1), 1–12 (2000). https://doi.org/10.1080/095281300146272

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mariana Recamonde-Mendoza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fontanari, T., Fróes, T.C., Recamonde-Mendoza, M. (2022). Cross-validation Strategies for Balanced and Imbalanced Datasets. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13653. Springer, Cham. https://doi.org/10.1007/978-3-031-21686-2_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21686-2_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21685-5

  • Online ISBN: 978-3-031-21686-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics