Abstract
This article introduces the supervised deep learning method sparse kernel deep stacking networks (SKDSNs), which extend traditional kernel deep stacking networks (KDSNs) by incorporating a set of data-driven regularization and variable selection steps to improve predictive performance in high-dimensional settings. Before model fitting, variable pre-selection is carried out using genetic algorithms in combination with the randomized dependence coefficient, accounting for non-linear dependencies among the inputs and the outcome variable. During model fitting, internal variable selection is based on a ranked feature ordering which is tuned within the model-based optimization framework. Further regularization steps include \(L_1\)-penalized kernel regression and dropout. Our simulation studies demonstrate an improved prediction accuracy of SKDSNs compared to traditional KDSNs. Runtime analysis of SKDSNs shows that the dimension of the random Fourier transformation greatly affects computational efficiency, and that the speed of SKDSNs can be improved by applying a subsampling-based ensemble strategy. Numerical experiments show that the latter strategy further increases predictive performance. Application of SKDSNs to three biomedical data sets confirm the results of the simulation study. SKDSNs are implemented in a new version of the R package kernDeepStackNet.
Similar content being viewed by others
References
Arevalo J, Cruz-Roa A, Arias V, Romero E, Gonzalez FA (2015) An unsupervised feature learning framework for basal cell carcinoma image analysis. Artif Intell Med 64(2):131–145
Bengio Y, Delalleau O (2011) On the expressive power of deep architectures. In Kivinen J, Szepesvari C, Ukkonen E (eds) Proceedings of 22nd international ALT conference algorithmic learning theory, Espoo, Finnland. Springer, Berlin
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22(4):477–505
Chawla NV, Moore TE, Hall LO, Bowyer KW, Kegelmeyer WP, Springer C (2003) Distributed learning with bagging-like performance. Pattern Recogn Lett 24(1–3):455–471
Chen L, Cai C, Chen V, Lu X (2015) Trans-species learning of cellular signaling systems with bimodal deep belief networks. Bioinformatics 31(18):1–8
Clevert DA, Unterthiner T, Hochreiter S (2016) Fast and accurate deep network learning by exponential linear units (ELUS). In: 4th international conference on learning representations (ICLR). Computational and Biological Learning Society (CBLS), Puerto Rico
Deng L, Yu D (2014) Deep learning: methods and applications. Now Publishers, Boston
Deng L, Tur G, He X, Hakkani-Tür D (2012) Use of kernel deep convex networks and end-to-end learning for spoken language understanding. In: Sarikaya R, Liu Y (eds) IEEE spoken language technology workshop (SLT), pp 210–215
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Gebelein H (1941) Das statistische Problem der Korrelation als Variations- und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. ZAMM J Appl Math Mech 21(6):364–379
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Teh YW, Titterington M (eds) Proceedings of the 13th international conference on artificial intelligence and statistics, New Jersey, USA. AISTATS, pp 249–256
Goldstein A, Kapelner A, Bleich J, Pitkin E (2015) Peeking inside the black box: visualizing statistical learning with plots of individual conditional expectation. J Comput Graph Stat 24(1):44–65
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Guyon I, Gunn S, Ben-Hur A, Dror G (2005) Result analysis of the NIPS 2003 feature selection challenge. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems. MIT Press, vol 17, pp 545–552
Hastie T, Tibshirani R, Friedman J (2008) Elements of statistical learning. Springer series in statistics, 2nd edn. Springer, Stanford
Higham NJ (2002) Computing the nearest correlation matrix a problem from finance. IMA J Numer Anal 22(3):329–343
Hinton GE, McClelland JL, Rumelhart DE (1986) Distributed representations. In: Rumelhart DE, McClelland JL, PDP Research Group C (eds) Parallel distributed processing: explorations in the microstructure of cognition. MIT Press, Cambridge, England, vol 1, pp 77–109
Hoffmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220
Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29(1):3–35
Huang PS, Deng L, Hasegawa-Johnson M, He X (2013) Random features for kernel deep convex network, Vancouver, Kanada. In: Acoustics, speech and signal processing (ICASSP). IEEE, New York, USA, pp 3143–3147
Huang PS, Avron H, Sainath TN et al (2014) Kernel methods match deep neural networks on TIMIT, Florence, Italy. In: International conference on acoustic, speech and signal processing (ICASSP). IEEE, Piscataway, USA, pp 205–209
Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Glob Optim 13(4):455–492
Krige DG (1951) A statistical approach to some basic mine valuation problems on the witwatersrand. J Chem Metall Min Soc S Afr 52(6):119–139
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, Computer Science University of Toronto. http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp 2278–2324
Leung MKK, Xiong HY, Lee LJ, Frey BJ (2014) Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12):i121–i129
Lichman M (2013) UCI machine learning repository
Lopez-Paz D, Hennig P, Schölkopf B (2013) The randomized dependence coefficient. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc, vol 26, pp 1–9
Lopez-Ibanez M, Dubois-Lacoste J, Perez Caceres L, Birattari M, Stuetzle T (2016) The IRACE package: iterated racing for automatic algorithm configuration. Oper Res Perspect 3:43–58
Nelsen RB (2006) Introduction to copulas, 2nd edn. Springer, Portland
Pepe M (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, Oxford
Picheny V, Ginsbourger D, Richet Y, Caplin G (2012) Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics 55(1):2–13
Quang D, Chen Y, Xie X (2015) DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31(5):761–763
R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Rahimi A, Recht B (2008) Random features for large-scale kernel machines. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems, vol 20. Curran Associates Inc, Red Hook, pp 1177–1184
Renyi A (1959) On measures of dependence. Acta Math Acad Sci Hung 10(3):441–451
Ribeiro MT, Singh S, Guestrin C (2016) Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, USA, pp 1135–1144
Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff. Mach Learn J 53(1–2):23–69
Robnik-Sikonja M, Savicky P (2018) CORElearn: classification, regression and feature evaluation. R package version 1(52)
Roustant O, Ginsbourger D, Deville Y (2012) DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J Stat Softw 51(1):1–55
Schirra LR, Lausser L, Kestler HA (2016) Selection stability as a means of biomarker discovery in classification. In: Wilhelm AF, Kestler HA (eds) Analysis of large and complex data. Springer, Cham, pp 79–89
Scrucca L (2013) GA: a package for genetic algorithms in R. J Stat Softw 53(4):1–37
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Srivastava N, Hinton G, Krizhevsky A (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Stein M (1987) Large sample properties of simulations using latin hypercube sampling. Technometrics 29(2):143–151
Steinwart I, Christmann A (2008) Support vector machines. Springer, New York
Tsallis C, Stariolo DA (1996) Generalized simulated annealing. Phys A 233(1–2):395–406
Wager S, Wang S, Liang P (2013) Dropout training as adaptive regularization. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc, vol 26, pp 351–359
Wahde M (2008) Biologically inspired optimization methods: an introduction. WIT Press, Ashurst Lodge
Wan L, Zeiler M, Zhang S, et al (2013) Regularization of neural networks using DropConnect. In: Proceedings of the 30th international conference on machine learning, vol 28. JMLR:W&CP, Atlanta, USA
Welchowski T, Schmid M (2016) A framework for parameter estimation and model selection in kernel deep stacking networks. Artif Intell Med 70:31–40
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35
Acknowledgements
We thank Michael Knapp (Department of Medical Biometry, Informatics and Epidemiology, University of Bonn) for proof reading. Financial support from Deutsche Forschungsgemeinschaft (Project SCHM 2966/1-2, Project SCHM 2966/2-1) is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Welchowski, T., Schmid, M. Sparse kernel deep stacking networks. Comput Stat 34, 993–1014 (2019). https://doi.org/10.1007/s00180-018-0832-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-018-0832-9