Activity–weight duality in feed-forward neural networks reveals two co-determinants for generalization

Feng, Yu; Zhang, Wei; Tu, Yuhai

doi:10.1038/s42256-023-00700-x

Article
Published: 14 August 2023

Activity–weight duality in feed-forward neural networks reveals two co-determinants for generalization

Nature Machine Intelligence volume 5, pages 908–918 (2023)Cite this article

1951 Accesses
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Generalization is a fundamental problem in machine learning. For overparameterized deep neural network models, there are many solutions that can fit the training data equally well. The key question is which solution has a better generalization performance measured by test loss (error). Here we report the discovery of exact duality relations between changes in activities and changes in weights in any fully connected layer in feed-forward neural networks. By using the activity–weight duality relation, we decompose the generalization loss into contributions from different directions in weight space. Our analysis reveals that two key factors, sharpness of the loss landscape and size of the solution, act together to determine generalization. In general, flatter and smaller solutions have better generalization. By using the generalization loss decomposition, we show how existing learning algorithms and regularization schemes affect generalization by controlling one or both factors. Furthermore, by applying our analysis framework to evaluate different algorithms for realistic large neural network models in the multi-learner setting, we find that the decentralized algorithms have better generalization performance as they introduce additional landscape-dependent noise that leads to flatter solutions without changing their sizes.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Characteristics of σ_g,n and σ_w,n.**

**Fig. 3: The effects of the learning rate α or the batch size B in SGD on generalization.**

**Fig. 4: The effects of the weight decay rate β or the weight initiation s in SGD on generalization.**

**Fig. 5: The generalization gap and the two generalization determinants in different directions n.**

**Fig. 6: Analysis of generalization for DPSGD and SSGD in large models (ResNet and DenseNet).**

Optimization on multifractal loss landscapes explains a diverse range of geometrical and dynamical properties of deep learning

Article Open access 05 April 2025

Machine learning in spectral domain

Article Open access 26 February 2021

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

Article Open access 05 July 2021

Data availability

The MNIST²⁸ and CIFAR-10²⁹ data are publicly available, and we use the version from the Pytorch package.

Code availability

The codes used in this work are available at the public repository https://github.com/YuFengDuke/A-W-Duality-Project (ref. ³⁰).

References

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article Google Scholar
Goodfellow, I., Courville, A. & Bengio, Y. Deep Learning Vol. 1 (MIT Press, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article Google Scholar
Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D. & Bengio, S. Fantastic generalization measures and where to find them. In 8th International Conference on Learning Representations (2020).
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On large-batch training for deep learning: generalization gap and sharp minima. In 5th International Conference on Learning Representations (2017).
Dinh, L., Pascanu, R., Bengio, S. & Bengio, Y. Sharp minima can generalize for deep nets. In International Conference on Machine Learning 1019–1028 (PMLR, 2017).
Zhu, Z., Wu, J., Yu, B., Wu, L. & Ma, J. The anisotropic noise in stochastic gradient descent: its behavior of escaping from sharp minima and regularization effects. In Proc. International Conference on Machine Learning 7654–7663 (PMLR, 2019).
Martens, J. New insights and perspectives on the natural gradient method. J. Mach. Learn. Res. 21, 1–76 (2020).
MathSciNet MATH Google Scholar
Chaudhari, P., & Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA) 1–10 (IEEE, 2018).
Feng, Y. & Tu, Y. The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima. Proc. Natl Acad. Sci. USA 118, e2015617118 (2021).
Article MathSciNet MATH Google Scholar
Yang, N., Tang, C. & Tu, Y. Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Phys. Rev. Lett. 130, 237101 (2023).
Article MathSciNet Google Scholar
Golatkar, A. S., Achille, A. & Soatto, S. Time matters in regularizing deep networks: weight decay and data augmentation affect early learning dynamics, matter little near convergence. Adv. Neural Inf. Process. Syst. 32, 10677–10687 (2019).
Google Scholar
Lian, X. et al. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. Adv. Neural Inf. Process. Syst. 30, 5330–5340 (2017).
Google Scholar
Shallue, C. J. et al. Measuring the effects of data parallelism on neural network training. J. Mach. Learn. Res. 20, 112:1–112:49 (2019).
MathSciNet Google Scholar
Lian, X., Zhang, W., Zhang, C. & Liu, J. Asynchronous decentralized parallelstochastic gradient descent. In International Conference on Machine Learning 3043–3052 (2018).
Zhang, W. et al. Loss landscape dependent self-adjusting learning rates in decentralized stochastic gradient descent. Preprint at https://arxiv.org/abs/2112.01433 (2021).
He, K., Zhang, X., Ren, S. & Sun, S. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (IEEE, 2017).
Hochreiter, S. & Schmidhuber, J. Flat minima. Neural Comput. 9, 1–42 (1997).
Article MATH Google Scholar
Wei, C. & Ma, T. Improved sample complexities for deep networks and robust classification via an all-layer margin. In International Conference on Learning Representations (2020).
Chaudhari, P. et al. Entropy-SGD: biasing gradient descent into wide valleys. J. Stat. Mech. Theor. E 2019, 124018 (2019).
Article MathSciNet MATH Google Scholar
Foret, P., Kleiner, A., Mobahi, H. & Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In 9th International Conference on Learning Representations (2021).
Baldassi, C., Pittorino, F. & Zecchina, R. Shaping the learning landscape in neural networks around wide flat minima. Proc. Natl Acad. Sci. USA 117, 161–170 (2020).
Article MathSciNet MATH Google Scholar
Yang, R., Mao, J. & Chaudhari, P. Does the data induce capacity control in deep learning? In International Conference on Machine Learning 25166–25197 (PMLR, 2022).
Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29, 141–142 (2012).
Article Google Scholar
Krizhevsky, A., Nair, V. & Hinton, G. CIFAR-10. Canadian Institute for Advanced Research http://www.cs.toronto.edu/~kriz/cifar.html (2010).
Feng, Y. A–W duality in neural network: the flatter and smaller solution is better for generalization. Zenodo https://doi.org/10.5281/zenodo.8031053 (2023).

Download references

Acknowledgements

We thank K. Clarkson and R. Traub for careful reading of our paper and useful comments. The work by F.Y. was partially done while he was employed as an IBM intern.

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Yu Feng, Wei Zhang & Yuhai Tu
Department of Physics, Duke University, Durham, NC, USA
Yu Feng

Authors

Yu Feng
View author publications
You can also search for this author inPubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Yuhai Tu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Y.T. initiated the project and discovered the activity–weight duality. Y.F. and W.Z. did the numerical experiments. Y.T. and Y.F. developed the theory and analysed the results. All authors wrote the paper.

Corresponding author

Correspondence to Yuhai Tu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Pratik Chaudhari and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Text and Figs. 1–14.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Feng, Y., Zhang, W. & Tu, Y. Activity–weight duality in feed-forward neural networks reveals two co-determinants for generalization. Nat Mach Intell 5, 908–918 (2023). https://doi.org/10.1038/s42256-023-00700-x

Download citation

Received: 04 April 2022
Accepted: 30 June 2023
Published: 14 August 2023
Issue Date: August 2023
DOI: https://doi.org/10.1038/s42256-023-00700-x