Abstract
Cross-validation using randomized subsets of data—known as k-fold cross-validation—is a powerful means of testing the success rate of models used for classification. However, few if any studies have explored how values of k (number of subsets) affect validation results in models tested with data of known statistical properties. Here, we explore conditions of sample size, model structure, and variable dependence affecting validation outcomes in discrete Bayesian networks (BNs). We created 6 variants of a BN model with known properties of variance and collinearity, along with data sets of n = 50, 500, and 5000 samples, and then tested classification success and evaluated CPU computation time with seven levels of folds (k = 2, 5, 10, 20, n − 5, n − 2, and n − 1). Classification error declined with increasing n, particularly in BN models with high multivariate dependence, and declined with increasing k, generally levelling out at k = 10, although k = 5 sufficed with large samples (n = 5000). Our work supports the common use of k = 10 in the literature, although in some cases k = 5 would suffice with BN models having independent variable structures.
Similar content being viewed by others
References
Adelin AA, Zhang L (2010) A novel definition of the multivariate coefficient of variation. Biomet J 52(5):667–675
Aguilera PA, Fernández A, Reche F, Rumi R (2010) Hybrid Bayesian network classifiers: application to species distribution models. Environ Mod Softw 25:1630–1639
Anguita D, Ghelardoni L, Ghio A, Oneto L, Ridella S (2012) The ‘K’ in K-fold cross validation. In: Proceedings, ESANN 2012, European symposium on artificial neural networks, computational intelligence and Mmachine learning. Bruges (Belgium), 25–27 Apr 2012, i6doc.com publ. http://www.i6doc.com/en/livre/?GCOI=28001100967420
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Booms TL, Huettmann F, Schempf PF (2010) Gyrfalcon nest distribution in Alaska based on a predictive GIS model. Polar Biol 33:347–358
Brady TJ, Monleon VJ, Gray AN (2010) Calibrating vascular plant abundance for detecting future climate changes in Oregon and Washington, USA. Ecol Ind 10:657–667
Breiman L, Spector P (1992) Submodel selection and evaluation in regression: the X-random case. Int Stat Rev 291–319
Cawley GC, Talbot NLC (2007) Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters. J Mach Learn Res 8:841–861
Constantinuo AC, Fenton N, Marsh W, Radlinski L (2016) From complex questionnaire and interviewing data to intelligent Bayesian network models for medical decision support. Artif Intell Med 67:75–93
Cooke RM, Kurowicka D, Hanea AM, Morales O, Ababei DA, Ale B, Roelen A (2007) Continuous/discrete non parametric Bayesian belief nets with UNICORN and UNINET. In: Proceedings of Mathematical Methods in Reliability MMR, 1–4 July 2007, Glasgow, UK
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(Series B):1–38
Do CB, Batzoglou S (2008) What is the expectation maximization algorithm? Nat Biotechnol 26:897–899
Forio MAE, Landuyt D, Bennetsen E, Lock K, Nguyen THT, Ambarita MND, Musonge PLS, Boets P, Everaert G, Dominguez-Granda L, Goethals PLM (2015) Bayesian belief network models to analyse and predict ecological water quality in rivers. Ecol Model 312:222–238
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163
Geisser S (1975) The predictive sample reuse method with applications. J Amer Stat Assoc 70:320–328
Guyon I, Saffari A, Dror G, Cawley G (2010) Model selection: beyond the Bayesian-Frequentist divide. J Mach Learn Res 11:61–87
Hammond TR, Ellis JR (2002) A meta-assessment for elasmobranchs based on dietary data and Bayesian networks. Ecol Ind 1:197–211
Hanea AM, Nane GF (2018) The asymptotic distribution of the determinant of a random correlation matrix. Stat Neerl 72:14–33
Hartemink AJ (2001) Principled computational methods for the validation and discovery of genetic regulatory networks. PhD Dissertation, Massachusetts Institute of Technology, Cambridge, MA
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the Lasso and generalizations. Monographs on statistics and applied probability 143. CRC Press, Chapman
Hobbs NT, Hooten MB (2015) Bayesian models: a statistical primer for ecologists. Princeton University Press, Princeton
Jensen FV, Nielsen TD (2007) Bayesian networks and decision graphs, 2nd edn. Springer, New York
Koski T, Noble J (2011) Bayesian networks: an introduction. Wiley, London
LaDeau SL, Han BA, Rosi-Marshall EJ, Weathers KC (2017) The next decade of big data in ecosystem science. Ecosystems 20:274–283
Last M (2006) The uncertainty principle of cross-validation. In: 2006 IEEE International conference on granular computing, 10–12 May 2006, pp 275–208
Lillegard M, Engen S, Saether BE (2005) Bootstrap methods for estimating spatial synchrony of fluctuating populations. Oikos 109:342–350
Marcot BG (2007) Étude de cas n°5: gestion de ressources naturelles et analyses de risques (Natural resource assessment and risk management). In: Naim P, Wuillemin P-H, Leray P, Pourret O, Becker A (eds) Réseaux Bayésiens (Bayesian networks; in French). Eyrolles, Paris, pp 293–315
Marcot BG (2012) Metrics for evaluating performance and uncertainty of Bayesian network models. Ecol Mod 230:50–62
Marcot BG, Penman TD (2019) Advances in Bayesian network modelling: integration of modelling technologies. Environ Model softw 111:386–393
Murphy KP (2012) Machine learning: a probabilistic perspective. The MIT Press, Cambridge
Pawson SM, Marcot BG, Woodberry O (2017) Predicting forest insect flight activity: a Bayesian network approach. PLoS ONE 12:e0183464
Pourret O, Naïm P, Marcot BG (eds) (2008) Bayesian belief networks: a practical guide to applications. Wiley, West Sussex
Scutari M (2010) Learning Bayesian networks with the bnlearn R package. J Stat Softw 35(3):1–22
Shcheglovitova M, Anderson RP (2013) Estimating optimal complexity for ecological niche models: a jackknife approach for species with small sample sizes. Ecol Mod 269:9–17
Stow CA, Webster KE, Wagner T, Lottig N, Soranno PA, Cha Y (2018) Small values in big data: the continuing need for appropriate metadata. Eco Inform 45:26–30
Van Valen L (2005) The statistics of variation. In: Hallgrímsson B, Hall BK (eds) Variation. Elsevier, Amsterdam, pp 29–47
Zhao Y, Hasan YA (2013) Machine learning algorithms for predicting roadside fine particulate matter concentration level in Hong Kong Central. Comput Ecol Softw 3:61–73
Acknowledgements
We thank Clint Epps, Julie Heinrichs, and an anonymous reviewer for helpful comments on the manuscript. Marcot acknowledges support from U.S. Forest Service, Pacific Northwest Research Station, and University of Melbourne, Australia. Mention of commercial or other products does not necessarily imply endorsement by the U.S. Government.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Marcot, B.G., Hanea, A.M. What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?. Comput Stat 36, 2009–2031 (2021). https://doi.org/10.1007/s00180-020-00999-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-00999-9