Heavy-Tailed Regularization of Weight Matrices in Deep Neural Networks | SpringerLink
Skip to main content

Heavy-Tailed Regularization of Weight Matrices in Deep Neural Networks

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2023 (ICANN 2023)

Abstract

Unraveling the reasons behind the remarkable success and exceptional generalization capabilities of deep neural networks presents a formidable challenge. Recent insights from random matrix theory, specifically those concerning the spectral analysis of weight matrices in deep neural networks, offer valuable clues to address this issue. A key finding indicates that the generalization performance of a neural network is associated with the degree of heavy tails in the spectrum of its weight matrices. To capitalize on this discovery, we introduce a novel regularization technique, termed Heavy-Tailed Regularization, which explicitly promotes a more heavy-tailed spectrum in the weight matrix through regularization. Firstly, we employ the Weighted Alpha and Stable Rank as penalty terms, both of which are differentiable, enabling the direct calculation of their gradients. To circumvent over-regularization, we introduce two variations of the penalty function. Then, adopting a Bayesian statistics perspective and leveraging knowledge from random matrices, we develop two novel heavy-tailed regularization methods, utilizing Power-law distribution and Fréchet distribution as priors for the global spectrum and maximum eigenvalues, respectively. We empirically show that heavy-tailed regularization outperforms conventional regularization techniques in terms of generalization performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Auffinger, A., Ben Arous, G., Péché, S.: Poisson convergence for the largest eigenvalues of heavy tailed random matrices. In: Annales de l’IHP Probabilités et Statistiques, vol. 45, pp. 589–610 (2009)

    Google Scholar 

  2. Barsbey, M., Sefidgaran, M., Erdogdu, M.A., Richard, G., Simsekli, U.: Heavy tails in SGD and compressibility of overparametrized neural networks. Adv. Neural. Inf. Process. Syst. 34, 29364–29378 (2021)

    Google Scholar 

  3. Bartlett, P., Maiorov, V., Meir, R.: Almost linear VC dimension bounds for piecewise polynomial networks. Adv. Neural. Inf. Process. Syst. 11 (1998)

    Google Scholar 

  4. Bartlett, P.L., Foster, D.J., Telgarsky, M.J.: Spectrally-normalized margin bounds for neural networks. Adv. Neural. Inf. Process. Syst. 30 (2017)

    Google Scholar 

  5. Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3(Nov), 463–482 (2002)

    MathSciNet  MATH  Google Scholar 

  6. Chen, Q., Zhao, H., Li, W., Huang, P., Ou, W.: Behavior sequence transformer for e-commerce recommendation in Alibaba. In: Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data (2019). https://doi.org/10.1145/3326937.3341261, http://dx.doi.org/10.1145/3326937.3341261

  7. Davis, R.A., Heiny, J., Mikosch, T., Xie, X.: Extreme value analysis for the sample autocovariance matrices of heavy-tailed multivariate time series. Extremes 19(3), 517–547 (2016). https://doi.org/10.1007/s10687-016-0251-7

    Article  MathSciNet  MATH  Google Scholar 

  8. Davis, R.A., Mikosch, T., Pfaffel, O.: Asymptotic theory for the sample covariance matrix of a heavy-tailed multivariate time series. Stochast. Process. Appl. 126(3), 767–799 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  9. Davis, R.A., Pfaffel, O., Stelzer, R.: Limit theory for the largest eigenvalues of sample covariance matrices with heavy-tails. Stochast. Process. Appl. 124(1), 18–50 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  10. Galassi, A., Lippi, M., Torroni, P.: Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4291–4308 (2020). https://doi.org/10.1109/tnnls.2020.3019893, http://dx.doi.org/10.1109/tnnls.2020.3019893

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  12. Hodgkinson, L., Mahoney, M.: Multiplicative noise and heavy tails in stochastic optimization. In: International Conference on Machine Learning, pp. 4262–4274. PMLR (2021)

    Google Scholar 

  13. Mandt, S., Hoffman, M.D., Blei, D.M.: Stochastic gradient descent as approximate Bayesian inference. arXiv preprint arXiv:1704.04289 (2017)

  14. Martin, C.H., Mahoney, M.W.: Traditional and heavy-tailed self regularization in neural network models. arXiv preprint arXiv:1901.08276 (2019)

  15. Martin, C.H., Mahoney, M.W.: Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 505–513. SIAM (2020)

    Google Scholar 

  16. Martin, C.H., Mahoney, M.W.: Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. J. Mach. Learn. Res. 22(165), 1–73 (2021)

    MathSciNet  MATH  Google Scholar 

  17. Martin, C.H., Mahoney, M.W.: Post-mortem on a deep learning contest: a Simpson’s paradox and the complementary roles of scale metrics versus shape metrics. arXiv preprint arXiv:2106.00734 (2021)

  18. Martin, C.H., Peng, T.S., Mahoney, M.W.: Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nat. Commun. 12(1), 1–13 (2021)

    Article  Google Scholar 

  19. Meng, X., Yao, J.: Impact of classification difficulty on the weight matrices spectra in deep learning and application to early-stopping. arXiv preprint arXiv:2111.13331 (2021)

  20. Nagarajan, V., Kolter, J.Z.: Uniform convergence may be unable to explain generalization in deep learning. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  21. Neyshabur, B., Bhojanapalli, S., Srebro, N.: A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564 (2017)

  22. Simsekli, U., Sener, O., Deligiannidis, G., Erdogdu, M.A.: Hausdorff dimension, heavy tails, and generalization in neural networks. Adv. Neural. Inf. Process. Syst. 33, 5138–5151 (2020)

    MATH  Google Scholar 

  23. Soshnikov, A.: Poisson statistics for the largest eigenvalues of Wigner random matrices with heavy tails. Electron. Commun. Probab. 9, 82–91 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  24. Vapnik, V., Levin, E., Le Cun, Y.: Measuring the VC-dimension of a learning machine. Neural Comput. 6(5), 851–876 (1994)

    Article  Google Scholar 

  25. Vaswani, A., et al.: Attention is all you need. ArXiv abs/1706.03762 (2017)

    Google Scholar 

  26. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)

    Article  Google Scholar 

  27. Zhou, P., Feng, J., Ma, C., Xiong, C., Hoi, S.C.H., et al.: Towards theoretically understanding why SGD generalizes better than Adam in deep learning. Adv. Neural. Inf. Process. Syst. 33, 21285–21296 (2020)

    Google Scholar 

Download references

Acknowledgements

Zeng Li’s research is partially supported by NSFC (National Nature Science Foundation of China) Grant NO. 12101292, NSFC Grant NO. 12031005, Shenzhen Fundamental Research Program JCYJ20220818100602005.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeng Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xiao, X., Li, Z., Xie, C., Zhou, F. (2023). Heavy-Tailed Regularization of Weight Matrices in Deep Neural Networks. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14263. Springer, Cham. https://doi.org/10.1007/978-3-031-44204-9_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44204-9_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44203-2

  • Online ISBN: 978-3-031-44204-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics