Sparse kernel deep stacking networks | Computational Statistics Skip to main content

Advertisement

Log in

Sparse kernel deep stacking networks

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

This article introduces the supervised deep learning method sparse kernel deep stacking networks (SKDSNs), which extend traditional kernel deep stacking networks (KDSNs) by incorporating a set of data-driven regularization and variable selection steps to improve predictive performance in high-dimensional settings. Before model fitting, variable pre-selection is carried out using genetic algorithms in combination with the randomized dependence coefficient, accounting for non-linear dependencies among the inputs and the outcome variable. During model fitting, internal variable selection is based on a ranked feature ordering which is tuned within the model-based optimization framework. Further regularization steps include \(L_1\)-penalized kernel regression and dropout. Our simulation studies demonstrate an improved prediction accuracy of SKDSNs compared to traditional KDSNs. Runtime analysis of SKDSNs shows that the dimension of the random Fourier transformation greatly affects computational efficiency, and that the speed of SKDSNs can be improved by applying a subsampling-based ensemble strategy. Numerical experiments show that the latter strategy further increases predictive performance. Application of SKDSNs to three biomedical data sets confirm the results of the simulation study. SKDSNs are implemented in a new version of the R package kernDeepStackNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Arevalo J, Cruz-Roa A, Arias V, Romero E, Gonzalez FA (2015) An unsupervised feature learning framework for basal cell carcinoma image analysis. Artif Intell Med 64(2):131–145

    Article  Google Scholar 

  • Bengio Y, Delalleau O (2011) On the expressive power of deep architectures. In Kivinen J, Szepesvari C, Ukkonen E (eds) Proceedings of 22nd international ALT conference algorithmic learning theory, Espoo, Finnland. Springer, Berlin

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22(4):477–505

    Article  MathSciNet  MATH  Google Scholar 

  • Chawla NV, Moore TE, Hall LO, Bowyer KW, Kegelmeyer WP, Springer C (2003) Distributed learning with bagging-like performance. Pattern Recogn Lett 24(1–3):455–471

    Article  Google Scholar 

  • Chen L, Cai C, Chen V, Lu X (2015) Trans-species learning of cellular signaling systems with bimodal deep belief networks. Bioinformatics 31(18):1–8

    Article  Google Scholar 

  • Clevert DA, Unterthiner T, Hochreiter S (2016) Fast and accurate deep network learning by exponential linear units (ELUS). In: 4th international conference on learning representations (ICLR). Computational and Biological Learning Society (CBLS), Puerto Rico

  • Deng L, Yu D (2014) Deep learning: methods and applications. Now Publishers, Boston

    MATH  Google Scholar 

  • Deng L, Tur G, He X, Hakkani-Tür D (2012) Use of kernel deep convex networks and end-to-end learning for spoken language understanding. In: Sarikaya R, Liu Y (eds) IEEE spoken language technology workshop (SLT), pp 210–215

  • Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  MathSciNet  MATH  Google Scholar 

  • Gebelein H (1941) Das statistische Problem der Korrelation als Variations- und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. ZAMM J Appl Math Mech 21(6):364–379

    Article  MathSciNet  MATH  Google Scholar 

  • Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Teh YW, Titterington M (eds) Proceedings of the 13th international conference on artificial intelligence and statistics, New Jersey, USA. AISTATS, pp 249–256

  • Goldstein A, Kapelner A, Bleich J, Pitkin E (2015) Peeking inside the black box: visualizing statistical learning with plots of individual conditional expectation. J Comput Graph Stat 24(1):44–65

    Article  MathSciNet  Google Scholar 

  • Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge

    MATH  Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  • Guyon I, Gunn S, Ben-Hur A, Dror G (2005) Result analysis of the NIPS 2003 feature selection challenge. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems. MIT Press, vol 17, pp 545–552

  • Hastie T, Tibshirani R, Friedman J (2008) Elements of statistical learning. Springer series in statistics, 2nd edn. Springer, Stanford

    Google Scholar 

  • Higham NJ (2002) Computing the nearest correlation matrix a problem from finance. IMA J Numer Anal 22(3):329–343

    Article  MathSciNet  MATH  Google Scholar 

  • Hinton GE, McClelland JL, Rumelhart DE (1986) Distributed representations. In: Rumelhart DE, McClelland JL, PDP Research Group C (eds) Parallel distributed processing: explorations in the microstructure of cognition. MIT Press, Cambridge, England, vol 1, pp 77–109

  • Hoffmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220

    Article  MathSciNet  MATH  Google Scholar 

  • Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29(1):3–35

    Article  MathSciNet  MATH  Google Scholar 

  • Huang PS, Deng L, Hasegawa-Johnson M, He X (2013) Random features for kernel deep convex network, Vancouver, Kanada. In: Acoustics, speech and signal processing (ICASSP). IEEE, New York, USA, pp 3143–3147

  • Huang PS, Avron H, Sainath TN et al (2014) Kernel methods match deep neural networks on TIMIT, Florence, Italy. In: International conference on acoustic, speech and signal processing (ICASSP). IEEE, Piscataway, USA, pp 205–209

  • Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Glob Optim 13(4):455–492

    Article  MathSciNet  MATH  Google Scholar 

  • Krige DG (1951) A statistical approach to some basic mine valuation problems on the witwatersrand. J Chem Metall Min Soc S Afr 52(6):119–139

    Google Scholar 

  • Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, Computer Science University of Toronto. http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

  • Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp 2278–2324

  • Leung MKK, Xiong HY, Lee LJ, Frey BJ (2014) Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12):i121–i129

    Article  Google Scholar 

  • Lichman M (2013) UCI machine learning repository

  • Lopez-Paz D, Hennig P, Schölkopf B (2013) The randomized dependence coefficient. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc, vol 26, pp 1–9

  • Lopez-Ibanez M, Dubois-Lacoste J, Perez Caceres L, Birattari M, Stuetzle T (2016) The IRACE package: iterated racing for automatic algorithm configuration. Oper Res Perspect 3:43–58

    Article  MathSciNet  Google Scholar 

  • Nelsen RB (2006) Introduction to copulas, 2nd edn. Springer, Portland

    MATH  Google Scholar 

  • Pepe M (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Picheny V, Ginsbourger D, Richet Y, Caplin G (2012) Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics 55(1):2–13

    Article  MathSciNet  Google Scholar 

  • Quang D, Chen Y, Xie X (2015) DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31(5):761–763

    Article  Google Scholar 

  • R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

  • Rahimi A, Recht B (2008) Random features for large-scale kernel machines. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems, vol 20. Curran Associates Inc, Red Hook, pp 1177–1184

    Google Scholar 

  • Renyi A (1959) On measures of dependence. Acta Math Acad Sci Hung 10(3):441–451

    Article  MathSciNet  MATH  Google Scholar 

  • Ribeiro MT, Singh S, Guestrin C (2016) Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, USA, pp 1135–1144

  • Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff. Mach Learn J 53(1–2):23–69

    Article  MATH  Google Scholar 

  • Robnik-Sikonja M, Savicky P (2018) CORElearn: classification, regression and feature evaluation. R package version 1(52)

  • Roustant O, Ginsbourger D, Deville Y (2012) DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J Stat Softw 51(1):1–55

    Article  Google Scholar 

  • Schirra LR, Lausser L, Kestler HA (2016) Selection stability as a means of biomarker discovery in classification. In: Wilhelm AF, Kestler HA (eds) Analysis of large and complex data. Springer, Cham, pp 79–89

  • Scrucca L (2013) GA: a package for genetic algorithms in R. J Stat Softw 53(4):1–37

    Article  Google Scholar 

  • Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Srivastava N, Hinton G, Krizhevsky A (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  • Stein M (1987) Large sample properties of simulations using latin hypercube sampling. Technometrics 29(2):143–151

    Article  MathSciNet  MATH  Google Scholar 

  • Steinwart I, Christmann A (2008) Support vector machines. Springer, New York

    MATH  Google Scholar 

  • Tsallis C, Stariolo DA (1996) Generalized simulated annealing. Phys A 233(1–2):395–406

    Article  Google Scholar 

  • Wager S, Wang S, Liang P (2013) Dropout training as adaptive regularization. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc, vol 26, pp 351–359

  • Wahde M (2008) Biologically inspired optimization methods: an introduction. WIT Press, Ashurst Lodge

    MATH  Google Scholar 

  • Wan L, Zeiler M, Zhang S, et al (2013) Regularization of neural networks using DropConnect. In: Proceedings of the 30th international conference on machine learning, vol 28. JMLR:W&CP, Atlanta, USA

  • Welchowski T, Schmid M (2016) A framework for parameter estimation and model selection in kernel deep stacking networks. Artif Intell Med 70:31–40

    Article  Google Scholar 

  • Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35

    Article  Google Scholar 

Download references

Acknowledgements

We thank Michael Knapp (Department of Medical Biometry, Informatics and Epidemiology, University of Bonn) for proof reading. Financial support from Deutsche Forschungsgemeinschaft (Project SCHM 2966/1-2, Project SCHM 2966/2-1) is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Welchowski.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 282 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Welchowski, T., Schmid, M. Sparse kernel deep stacking networks. Comput Stat 34, 993–1014 (2019). https://doi.org/10.1007/s00180-018-0832-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-018-0832-9

Keywords

Navigation