Sparse kernel deep stacking networks

Welchowski, Thomas; Schmid, Matthias

doi:10.1007/s00180-018-0832-9

Sparse kernel deep stacking networks

Original Paper
Published: 27 August 2018

Volume 34, pages 993–1014, (2019)
Cite this article

Computational Statistics Aims and scope Submit manuscript

383 Accesses
3 Citations
Explore all metrics

Abstract

This article introduces the supervised deep learning method sparse kernel deep stacking networks (SKDSNs), which extend traditional kernel deep stacking networks (KDSNs) by incorporating a set of data-driven regularization and variable selection steps to improve predictive performance in high-dimensional settings. Before model fitting, variable pre-selection is carried out using genetic algorithms in combination with the randomized dependence coefficient, accounting for non-linear dependencies among the inputs and the outcome variable. During model fitting, internal variable selection is based on a ranked feature ordering which is tuned within the model-based optimization framework. Further regularization steps include \(L_1\)-penalized kernel regression and dropout. Our simulation studies demonstrate an improved prediction accuracy of SKDSNs compared to traditional KDSNs. Runtime analysis of SKDSNs shows that the dimension of the random Fourier transformation greatly affects computational efficiency, and that the speed of SKDSNs can be improved by applying a subsampling-based ensemble strategy. Numerical experiments show that the latter strategy further increases predictive performance. Application of SKDSNs to three biomedical data sets confirm the results of the simulation study. SKDSNs are implemented in a new version of the R package kernDeepStackNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Feature-Aware Drop Layer (FADL): A Nonparametric Neural Network Layer for Feature Selection

Multiple Kernel Learning Algorithms and Their Use in Biomedical Informatics

Diversified deep hierarchical kernel ensemble regression

Article 21 June 2024

References

Arevalo J, Cruz-Roa A, Arias V, Romero E, Gonzalez FA (2015) An unsupervised feature learning framework for basal cell carcinoma image analysis. Artif Intell Med 64(2):131–145
Article Google Scholar
Bengio Y, Delalleau O (2011) On the expressive power of deep architectures. In Kivinen J, Szepesvari C, Ukkonen E (eds) Proceedings of 22nd international ALT conference algorithmic learning theory, Espoo, Finnland. Springer, Berlin
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22(4):477–505
Article MathSciNet MATH Google Scholar
Chawla NV, Moore TE, Hall LO, Bowyer KW, Kegelmeyer WP, Springer C (2003) Distributed learning with bagging-like performance. Pattern Recogn Lett 24(1–3):455–471
Article Google Scholar
Chen L, Cai C, Chen V, Lu X (2015) Trans-species learning of cellular signaling systems with bimodal deep belief networks. Bioinformatics 31(18):1–8
Article Google Scholar
Clevert DA, Unterthiner T, Hochreiter S (2016) Fast and accurate deep network learning by exponential linear units (ELUS). In: 4th international conference on learning representations (ICLR). Computational and Biological Learning Society (CBLS), Puerto Rico
Deng L, Yu D (2014) Deep learning: methods and applications. Now Publishers, Boston
MATH Google Scholar
Deng L, Tur G, He X, Hakkani-Tür D (2012) Use of kernel deep convex networks and end-to-end learning for spoken language understanding. In: Sarikaya R, Liu Y (eds) IEEE spoken language technology workshop (SLT), pp 210–215
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Article MathSciNet MATH Google Scholar
Gebelein H (1941) Das statistische Problem der Korrelation als Variations- und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. ZAMM J Appl Math Mech 21(6):364–379
Article MathSciNet MATH Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Teh YW, Titterington M (eds) Proceedings of the 13th international conference on artificial intelligence and statistics, New Jersey, USA. AISTATS, pp 249–256
Goldstein A, Kapelner A, Bleich J, Pitkin E (2015) Peeking inside the black box: visualizing statistical learning with plots of individual conditional expectation. J Comput Graph Stat 24(1):44–65
Article MathSciNet Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
MATH Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Guyon I, Gunn S, Ben-Hur A, Dror G (2005) Result analysis of the NIPS 2003 feature selection challenge. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems. MIT Press, vol 17, pp 545–552
Hastie T, Tibshirani R, Friedman J (2008) Elements of statistical learning. Springer series in statistics, 2nd edn. Springer, Stanford
Google Scholar
Higham NJ (2002) Computing the nearest correlation matrix a problem from finance. IMA J Numer Anal 22(3):329–343
Article MathSciNet MATH Google Scholar
Hinton GE, McClelland JL, Rumelhart DE (1986) Distributed representations. In: Rumelhart DE, McClelland JL, PDP Research Group C (eds) Parallel distributed processing: explorations in the microstructure of cognition. MIT Press, Cambridge, England, vol 1, pp 77–109
Hoffmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220
Article MathSciNet MATH Google Scholar
Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29(1):3–35
Article MathSciNet MATH Google Scholar
Huang PS, Deng L, Hasegawa-Johnson M, He X (2013) Random features for kernel deep convex network, Vancouver, Kanada. In: Acoustics, speech and signal processing (ICASSP). IEEE, New York, USA, pp 3143–3147
Huang PS, Avron H, Sainath TN et al (2014) Kernel methods match deep neural networks on TIMIT, Florence, Italy. In: International conference on acoustic, speech and signal processing (ICASSP). IEEE, Piscataway, USA, pp 205–209
Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Glob Optim 13(4):455–492
Article MathSciNet MATH Google Scholar
Krige DG (1951) A statistical approach to some basic mine valuation problems on the witwatersrand. J Chem Metall Min Soc S Afr 52(6):119–139
Google Scholar
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, Computer Science University of Toronto. http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp 2278–2324
Leung MKK, Xiong HY, Lee LJ, Frey BJ (2014) Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12):i121–i129
Article Google Scholar
Lichman M (2013) UCI machine learning repository
Lopez-Paz D, Hennig P, Schölkopf B (2013) The randomized dependence coefficient. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc, vol 26, pp 1–9
Lopez-Ibanez M, Dubois-Lacoste J, Perez Caceres L, Birattari M, Stuetzle T (2016) The IRACE package: iterated racing for automatic algorithm configuration. Oper Res Perspect 3:43–58
Article MathSciNet Google Scholar
Nelsen RB (2006) Introduction to copulas, 2nd edn. Springer, Portland
MATH Google Scholar
Pepe M (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, Oxford
MATH Google Scholar
Picheny V, Ginsbourger D, Richet Y, Caplin G (2012) Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics 55(1):2–13
Article MathSciNet Google Scholar
Quang D, Chen Y, Xie X (2015) DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31(5):761–763
Article Google Scholar
R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Rahimi A, Recht B (2008) Random features for large-scale kernel machines. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems, vol 20. Curran Associates Inc, Red Hook, pp 1177–1184
Google Scholar
Renyi A (1959) On measures of dependence. Acta Math Acad Sci Hung 10(3):441–451
Article MathSciNet MATH Google Scholar
Ribeiro MT, Singh S, Guestrin C (2016) Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, USA, pp 1135–1144
Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff. Mach Learn J 53(1–2):23–69
Article MATH Google Scholar
Robnik-Sikonja M, Savicky P (2018) CORElearn: classification, regression and feature evaluation. R package version 1(52)
Roustant O, Ginsbourger D, Deville Y (2012) DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J Stat Softw 51(1):1–55
Article Google Scholar
Schirra LR, Lausser L, Kestler HA (2016) Selection stability as a means of biomarker discovery in classification. In: Wilhelm AF, Kestler HA (eds) Analysis of large and complex data. Springer, Cham, pp 79–89
Scrucca L (2013) GA: a package for genetic algorithms in R. J Stat Softw 53(4):1–37
Article Google Scholar
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Book MATH Google Scholar
Srivastava N, Hinton G, Krizhevsky A (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Stein M (1987) Large sample properties of simulations using latin hypercube sampling. Technometrics 29(2):143–151
Article MathSciNet MATH Google Scholar
Steinwart I, Christmann A (2008) Support vector machines. Springer, New York
MATH Google Scholar
Tsallis C, Stariolo DA (1996) Generalized simulated annealing. Phys A 233(1–2):395–406
Article Google Scholar
Wager S, Wang S, Liang P (2013) Dropout training as adaptive regularization. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc, vol 26, pp 351–359
Wahde M (2008) Biologically inspired optimization methods: an introduction. WIT Press, Ashurst Lodge
MATH Google Scholar
Wan L, Zeiler M, Zhang S, et al (2013) Regularization of neural networks using DropConnect. In: Proceedings of the 30th international conference on machine learning, vol 28. JMLR:W&CP, Atlanta, USA
Welchowski T, Schmid M (2016) A framework for parameter estimation and model selection in kernel deep stacking networks. Artif Intell Med 70:31–40
Article Google Scholar
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35
Article Google Scholar

Download references

Acknowledgements

We thank Michael Knapp (Department of Medical Biometry, Informatics and Epidemiology, University of Bonn) for proof reading. Financial support from Deutsche Forschungsgemeinschaft (Project SCHM 2966/1-2, Project SCHM 2966/2-1) is gratefully acknowledged.

Author information

Authors and Affiliations

Department of Medical Biometry, Informatics and Epidemiology, Faculty of Medicine, University Hospital Bonn, Sigmund-Freud-Str. 25, 53127, Bonn, Germany
Thomas Welchowski & Matthias Schmid

Authors

Thomas Welchowski
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Schmid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Welchowski.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 282 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Welchowski, T., Schmid, M. Sparse kernel deep stacking networks. Comput Stat 34, 993–1014 (2019). https://doi.org/10.1007/s00180-018-0832-9

Download citation

Received: 06 May 2017
Accepted: 18 August 2018
Published: 27 August 2018
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s00180-018-0832-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Sparse kernel deep stacking networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Feature-Aware Drop Layer (FADL): A Nonparametric Neural Network Layer for Feature Selection

Multiple Kernel Learning Algorithms and Their Use in Biomedical Informatics

Diversified deep hierarchical kernel ensemble regression

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Electronic supplementary material

Supplementary material 1 (pdf 282 KB)

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Sparse kernel deep stacking networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Feature-Aware Drop Layer (FADL): A Nonparametric Neural Network Layer for Feature Selection

Multiple Kernel Learning Algorithms and Their Use in Biomedical Informatics

Diversified deep hierarchical kernel ensemble regression

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Electronic supplementary material

Supplementary material 1 (pdf 282 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation