Abstract
Deep learning models have been very successful in the application of machine learning methods, often out-performing classical statistical models such as linear regression models or generalized linear models. On the other hand, deep learning models are often criticized for not being explainable nor allowing for variable selection. There are two different ways of dealing with this problem, either we use post-hoc model interpretability methods or we design specific deep learning architectures that allow for an easier interpretation and explanation. This paper builds on our previous work on the LocalGLMnet architecture that gives an interpretable deep learning architecture. In the present paper, we show how group LASSO regularization (and other regularization schemes) can be implemented within the LocalGLMnet architecture so that we receive feature sparsity for variable selection. We benchmark our approach with the recently developed LassoNet of Lemhadri et al. ( LassoNet: a neural network with feature sparsity. J Mach Learn Res 22:1–29, 2021).
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
We call our proposal LASSO regularization of the LocalGLMnet. Whereas the initial proposal of the LASSO was indeed for the linear regression model, this has been extended to GLMs, see Sect. 3.4 in Hastie et al. (2015).
The dataset is available at this link: http://lib.stat.cmu.edu/datasets/boston and code for this example is available on Github at this link: https://github.com/RonRichman/Regularized-LocalGLMnet.
The dataset is available at this link: http://www2.math.uconn.edu/~valdez/telematics_syn-032021.csv
Note that due to privacy concerns, these 100, 000 records were generated synthetically based on real data, see So et al. (2021) for a detailed description of this.
The grouped version of the model was applied in accordance with the instructions at https://github.com/lasso-net/lassonet/issues/7.
References
Agarwal R, Frosst N, Zhang X, Caruana R, Hinton GE (2020) Neural additive models: interpretable machine learning with neural nets. arXiv:2004.13912v1
Apley DW, Zhu J (2020) Visualizing the effects of predictor variables in black box supervised learning models. J R Stat Soc Ser B 82(4):1059–1086
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Gneiting T (2011) Making and evaluating point forecasts. J Am Stat Assoc 106(494):746–762
Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 102(477):359–378
Harrison D, Rubinfeld DL (1978) Hedonic prices and the demand for clean air. J Environ Econ Manag 5:81–102
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the Lasso and generalizations. CRC Press
Hoerl A, Kennard R (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67
Lee JD, Sun DL, Sun Y, Taylor JE (2016) Exact post-selection inference, with application to the LASSO. Ann Stat 44(3):907–927
Lemhadri I, Ruan F, Abraham L, Tibshirani R (2021) LassoNet: a neural network with feature sparsity. J Mach Learn Res 22:1–29
Lindholm M, Richman R, Tsanakas A, Wüthrich MV (2022) Discrimination-free insurance pricing. ASTIN Bull J IAA 52:55–89
Merity S, McCann B, Socher R (2017) Revisiting activation regularization for language RNNs. arXiv:1708.01009v1
Merz M, Richman R, Tsanakas A, Wüthrich MV (2022) Interpreting deep learning models with marginal attribution by conditioning on quantiles. Data Min Knowl Discov 36:1335–1370
Oelker M-R, Tutz G (2017) A uniform framework for the combination of penalties in generalized structured models. Adv Data Anal Classif 11:97–120
Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231
Richman R (2021) Mind the gap—safely incorporating deep learning models into the actuarial toolkit. SSRN Manuscript ID 3857693
Richman R, Wüthrich MV (2022) LocalGLMnet: interpretable deep learning for tabular data. Scand Actuar J, in press
Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B Stat Methodol 58:267–288
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused LASSO. J R Stat Soc Ser B Stat Methodol 67:91–108
Tikhonov AN (1943) On the stability of inverse problems. Dokl Akad Nauk SSSR 39(5):195–198
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762v5
So B, Boucher JP, Valdez EA (2021) Synthetic dataset generation of driver telematics. Risks 9(4):58
Vaughan J, Sudjianto A, Brahimi E, Chen J, Nair VN (2018) Explainable neural networks based on additive index models. arXiv:1806.01933v1
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol 68:49–67
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67:301–320
Acknowledgements
The authors wish to thank the editor, assistant editor and reviewers of an earlier version of this manuscript for their comments which helped to improve the manuscript significantly.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Both authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Appendix: R code
B LassoNet-training details
The LassoNet models used were based on the Python code provided for the group LASSO version of the LassoNet at https://github.com/lasso-net/lassonet/tree/group-lasso.Footnote 5 The dimensions of the hidden layers of the LassoNet were set equal to the same dimensions as the corresponding LocalGLMnet so that the model capacity is roughly comparable, i.e., any differences in performance will be mainly attributable to the way in which regularization is applied within each of the models. The main hyperparameter tested for the LassoNet was the budget parameter M; a range of LassoNet models are fit automatically for different values of the regularization parameter \(\eta \). Values of \(M \in \{1, 10, 100\}\) were tested for each example.
For the Boston housing dataset, the best LassoNet model as indicated by the MSE on the learning set was selected (since there are no validation or test sets used in that example). For the telematics data, the LassoNet producing the lowest values of the binary cross-entropy loss on the validation set was selected.
Only a single run of the LassoNet model was used for these results, nonetheless, it was observed that the results varied quite significantly for each training run (see last line in Table 8, which shows that the LassoNet has the highest standard deviation over training runs among the models considered) indicating that better results could be perhaps achieved using multiple runs and averaging over these.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Richman, R., Wüthrich, M.V. LASSO regularization within the LocalGLMnet architecture. Adv Data Anal Classif 17, 951–981 (2023). https://doi.org/10.1007/s11634-022-00529-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-022-00529-z
Keywords
- Deep learning
- Neural networks
- LocalGLMnet
- Regression model
- Variable selection
- Regularization
- LASSO
- Group LASSO
- Ridge regularization
- Tikhonov regularization