Abstract
Unraveling the reasons behind the remarkable success and exceptional generalization capabilities of deep neural networks presents a formidable challenge. Recent insights from random matrix theory, specifically those concerning the spectral analysis of weight matrices in deep neural networks, offer valuable clues to address this issue. A key finding indicates that the generalization performance of a neural network is associated with the degree of heavy tails in the spectrum of its weight matrices. To capitalize on this discovery, we introduce a novel regularization technique, termed Heavy-Tailed Regularization, which explicitly promotes a more heavy-tailed spectrum in the weight matrix through regularization. Firstly, we employ the Weighted Alpha and Stable Rank as penalty terms, both of which are differentiable, enabling the direct calculation of their gradients. To circumvent over-regularization, we introduce two variations of the penalty function. Then, adopting a Bayesian statistics perspective and leveraging knowledge from random matrices, we develop two novel heavy-tailed regularization methods, utilizing Power-law distribution and Fréchet distribution as priors for the global spectrum and maximum eigenvalues, respectively. We empirically show that heavy-tailed regularization outperforms conventional regularization techniques in terms of generalization performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Auffinger, A., Ben Arous, G., Péché, S.: Poisson convergence for the largest eigenvalues of heavy tailed random matrices. In: Annales de l’IHP Probabilités et Statistiques, vol. 45, pp. 589–610 (2009)
Barsbey, M., Sefidgaran, M., Erdogdu, M.A., Richard, G., Simsekli, U.: Heavy tails in SGD and compressibility of overparametrized neural networks. Adv. Neural. Inf. Process. Syst. 34, 29364–29378 (2021)
Bartlett, P., Maiorov, V., Meir, R.: Almost linear VC dimension bounds for piecewise polynomial networks. Adv. Neural. Inf. Process. Syst. 11 (1998)
Bartlett, P.L., Foster, D.J., Telgarsky, M.J.: Spectrally-normalized margin bounds for neural networks. Adv. Neural. Inf. Process. Syst. 30 (2017)
Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3(Nov), 463–482 (2002)
Chen, Q., Zhao, H., Li, W., Huang, P., Ou, W.: Behavior sequence transformer for e-commerce recommendation in Alibaba. In: Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data (2019). https://doi.org/10.1145/3326937.3341261, http://dx.doi.org/10.1145/3326937.3341261
Davis, R.A., Heiny, J., Mikosch, T., Xie, X.: Extreme value analysis for the sample autocovariance matrices of heavy-tailed multivariate time series. Extremes 19(3), 517–547 (2016). https://doi.org/10.1007/s10687-016-0251-7
Davis, R.A., Mikosch, T., Pfaffel, O.: Asymptotic theory for the sample covariance matrix of a heavy-tailed multivariate time series. Stochast. Process. Appl. 126(3), 767–799 (2016)
Davis, R.A., Pfaffel, O., Stelzer, R.: Limit theory for the largest eigenvalues of sample covariance matrices with heavy-tails. Stochast. Process. Appl. 124(1), 18–50 (2014)
Galassi, A., Lippi, M., Torroni, P.: Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4291–4308 (2020). https://doi.org/10.1109/tnnls.2020.3019893, http://dx.doi.org/10.1109/tnnls.2020.3019893
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hodgkinson, L., Mahoney, M.: Multiplicative noise and heavy tails in stochastic optimization. In: International Conference on Machine Learning, pp. 4262–4274. PMLR (2021)
Mandt, S., Hoffman, M.D., Blei, D.M.: Stochastic gradient descent as approximate Bayesian inference. arXiv preprint arXiv:1704.04289 (2017)
Martin, C.H., Mahoney, M.W.: Traditional and heavy-tailed self regularization in neural network models. arXiv preprint arXiv:1901.08276 (2019)
Martin, C.H., Mahoney, M.W.: Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 505–513. SIAM (2020)
Martin, C.H., Mahoney, M.W.: Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. J. Mach. Learn. Res. 22(165), 1–73 (2021)
Martin, C.H., Mahoney, M.W.: Post-mortem on a deep learning contest: a Simpson’s paradox and the complementary roles of scale metrics versus shape metrics. arXiv preprint arXiv:2106.00734 (2021)
Martin, C.H., Peng, T.S., Mahoney, M.W.: Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nat. Commun. 12(1), 1–13 (2021)
Meng, X., Yao, J.: Impact of classification difficulty on the weight matrices spectra in deep learning and application to early-stopping. arXiv preprint arXiv:2111.13331 (2021)
Nagarajan, V., Kolter, J.Z.: Uniform convergence may be unable to explain generalization in deep learning. Adv. Neural Inf. Process. Syst. 32 (2019)
Neyshabur, B., Bhojanapalli, S., Srebro, N.: A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564 (2017)
Simsekli, U., Sener, O., Deligiannidis, G., Erdogdu, M.A.: Hausdorff dimension, heavy tails, and generalization in neural networks. Adv. Neural. Inf. Process. Syst. 33, 5138–5151 (2020)
Soshnikov, A.: Poisson statistics for the largest eigenvalues of Wigner random matrices with heavy tails. Electron. Commun. Probab. 9, 82–91 (2004)
Vapnik, V., Levin, E., Le Cun, Y.: Measuring the VC-dimension of a learning machine. Neural Comput. 6(5), 851–876 (1994)
Vaswani, A., et al.: Attention is all you need. ArXiv abs/1706.03762 (2017)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)
Zhou, P., Feng, J., Ma, C., Xiong, C., Hoi, S.C.H., et al.: Towards theoretically understanding why SGD generalizes better than Adam in deep learning. Adv. Neural. Inf. Process. Syst. 33, 21285–21296 (2020)
Acknowledgements
Zeng Li’s research is partially supported by NSFC (National Nature Science Foundation of China) Grant NO. 12101292, NSFC Grant NO. 12031005, Shenzhen Fundamental Research Program JCYJ20220818100602005.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xiao, X., Li, Z., Xie, C., Zhou, F. (2023). Heavy-Tailed Regularization of Weight Matrices in Deep Neural Networks. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14263. Springer, Cham. https://doi.org/10.1007/978-3-031-44204-9_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-44204-9_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44203-2
Online ISBN: 978-3-031-44204-9
eBook Packages: Computer ScienceComputer Science (R0)