Activity–weight duality in feed-forward neural networks reveals two co-determinants for generalization | Nature Machine Intelligence
Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Activity–weight duality in feed-forward neural networks reveals two co-determinants for generalization

A preprint version of the article is available at arXiv.

Abstract

Generalization is a fundamental problem in machine learning. For overparameterized deep neural network models, there are many solutions that can fit the training data equally well. The key question is which solution has a better generalization performance measured by test loss (error). Here we report the discovery of exact duality relations between changes in activities and changes in weights in any fully connected layer in feed-forward neural networks. By using the activity–weight duality relation, we decompose the generalization loss into contributions from different directions in weight space. Our analysis reveals that two key factors, sharpness of the loss landscape and size of the solution, act together to determine generalization. In general, flatter and smaller solutions have better generalization. By using the generalization loss decomposition, we show how existing learning algorithms and regularization schemes affect generalization by controlling one or both factors. Furthermore, by applying our analysis framework to evaluate different algorithms for realistic large neural network models in the multi-learner setting, we find that the decentralized algorithms have better generalization performance as they introduce additional landscape-dependent noise that leads to flatter solutions without changing their sizes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The A–W duality.
Fig. 2: Characteristics of σg,n and σw,n.
Fig. 3: The effects of the learning rate α or the batch size B in SGD on generalization.
Fig. 4: The effects of the weight decay rate β or the weight initiation s in SGD on generalization.
Fig. 5: The generalization gap and the two generalization determinants in different directions n.
Fig. 6: Analysis of generalization for DPSGD and SSGD in large models (ResNet and DenseNet).

Similar content being viewed by others

Data availability

The MNIST28 and CIFAR-1029 data are publicly available, and we use the version from the Pytorch package.

Code availability

The codes used in this work are available at the public repository https://github.com/YuFengDuke/A-W-Duality-Project (ref. 30).

References

  1. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  Google Scholar 

  2. Goodfellow, I., Courville, A. & Bengio, Y. Deep Learning Vol. 1 (MIT Press, 2016).

  3. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  4. Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).

  5. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

    Article  Google Scholar 

  6. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  7. Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D. & Bengio, S. Fantastic generalization measures and where to find them. In 8th International Conference on Learning Representations (2020).

  8. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On large-batch training for deep learning: generalization gap and sharp minima. In 5th International Conference on Learning Representations (2017).

  9. Dinh, L., Pascanu, R., Bengio, S. & Bengio, Y. Sharp minima can generalize for deep nets. In International Conference on Machine Learning 1019–1028 (PMLR, 2017).

  10. Zhu, Z., Wu, J., Yu, B., Wu, L. & Ma, J. The anisotropic noise in stochastic gradient descent: its behavior of escaping from sharp minima and regularization effects. In Proc. International Conference on Machine Learning 7654–7663 (PMLR, 2019).

  11. Martens, J. New insights and perspectives on the natural gradient method. J. Mach. Learn. Res. 21, 1–76 (2020).

    MathSciNet  MATH  Google Scholar 

  12. Chaudhari, P., & Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA) 1–10 (IEEE, 2018).

  13. Feng, Y. & Tu, Y. The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima. Proc. Natl Acad. Sci. USA 118, e2015617118 (2021).

    Article  MathSciNet  MATH  Google Scholar 

  14. Yang, N., Tang, C. & Tu, Y. Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Phys. Rev. Lett. 130, 237101 (2023).

    Article  MathSciNet  Google Scholar 

  15. Golatkar, A. S., Achille, A. & Soatto, S. Time matters in regularizing deep networks: weight decay and data augmentation affect early learning dynamics, matter little near convergence. Adv. Neural Inf. Process. Syst. 32, 10677–10687 (2019).

    Google Scholar 

  16. Lian, X. et al. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. Adv. Neural Inf. Process. Syst. 30, 5330–5340 (2017).

    Google Scholar 

  17. Shallue, C. J. et al. Measuring the effects of data parallelism on neural network training. J. Mach. Learn. Res. 20, 112:1–112:49 (2019).

    MathSciNet  Google Scholar 

  18. Lian, X., Zhang, W., Zhang, C. & Liu, J. Asynchronous decentralized parallelstochastic gradient descent. In International Conference on Machine Learning 3043–3052 (2018).

  19. Zhang, W. et al. Loss landscape dependent self-adjusting learning rates in decentralized stochastic gradient descent. Preprint at https://arxiv.org/abs/2112.01433 (2021).

  20. He, K., Zhang, X., Ren, S. & Sun, S. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  21. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (IEEE, 2017).

  22. Hochreiter, S. & Schmidhuber, J. Flat minima. Neural Comput. 9, 1–42 (1997).

    Article  MATH  Google Scholar 

  23. Wei, C. & Ma, T. Improved sample complexities for deep networks and robust classification via an all-layer margin. In International Conference on Learning Representations (2020).

  24. Chaudhari, P. et al. Entropy-SGD: biasing gradient descent into wide valleys. J. Stat. Mech. Theor. E 2019, 124018 (2019).

    Article  MathSciNet  MATH  Google Scholar 

  25. Foret, P., Kleiner, A., Mobahi, H. & Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In 9th International Conference on Learning Representations (2021).

  26. Baldassi, C., Pittorino, F. & Zecchina, R. Shaping the learning landscape in neural networks around wide flat minima. Proc. Natl Acad. Sci. USA 117, 161–170 (2020).

    Article  MathSciNet  MATH  Google Scholar 

  27. Yang, R., Mao, J. & Chaudhari, P. Does the data induce capacity control in deep learning? In International Conference on Machine Learning 25166–25197 (PMLR, 2022).

  28. Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29, 141–142 (2012).

    Article  Google Scholar 

  29. Krizhevsky, A., Nair, V. & Hinton, G. CIFAR-10. Canadian Institute for Advanced Research http://www.cs.toronto.edu/~kriz/cifar.html (2010).

  30. Feng, Y. A–W duality in neural network: the flatter and smaller solution is better for generalization. Zenodo https://doi.org/10.5281/zenodo.8031053 (2023).

Download references

Acknowledgements

We thank K. Clarkson and R. Traub for careful reading of our paper and useful comments. The work by F.Y. was partially done while he was employed as an IBM intern.

Author information

Authors and Affiliations

Authors

Contributions

Y.T. initiated the project and discovered the activity–weight duality. Y.F. and W.Z. did the numerical experiments. Y.T. and Y.F. developed the theory and analysed the results. All authors wrote the paper.

Corresponding author

Correspondence to Yuhai Tu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Pratik Chaudhari and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Text and Figs. 1–14.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feng, Y., Zhang, W. & Tu, Y. Activity–weight duality in feed-forward neural networks reveals two co-determinants for generalization. Nat Mach Intell 5, 908–918 (2023). https://doi.org/10.1038/s42256-023-00700-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-023-00700-x

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics