Abstract
Generalization is a fundamental problem in machine learning. For overparameterized deep neural network models, there are many solutions that can fit the training data equally well. The key question is which solution has a better generalization performance measured by test loss (error). Here we report the discovery of exact duality relations between changes in activities and changes in weights in any fully connected layer in feed-forward neural networks. By using the activity–weight duality relation, we decompose the generalization loss into contributions from different directions in weight space. Our analysis reveals that two key factors, sharpness of the loss landscape and size of the solution, act together to determine generalization. In general, flatter and smaller solutions have better generalization. By using the generalization loss decomposition, we show how existing learning algorithms and regularization schemes affect generalization by controlling one or both factors. Furthermore, by applying our analysis framework to evaluate different algorithms for realistic large neural network models in the multi-learner setting, we find that the decentralized algorithms have better generalization performance as they introduce additional landscape-dependent noise that leads to flatter solutions without changing their sizes.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
9,800 Yen / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
¥14,900 per year
only ¥1,242 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Code availability
The codes used in this work are available at the public repository https://github.com/YuFengDuke/A-W-Duality-Project (ref. 30).
References
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Goodfellow, I., Courville, A. & Bengio, Y. Deep Learning Vol. 1 (MIT Press, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D. & Bengio, S. Fantastic generalization measures and where to find them. In 8th International Conference on Learning Representations (2020).
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On large-batch training for deep learning: generalization gap and sharp minima. In 5th International Conference on Learning Representations (2017).
Dinh, L., Pascanu, R., Bengio, S. & Bengio, Y. Sharp minima can generalize for deep nets. In International Conference on Machine Learning 1019–1028 (PMLR, 2017).
Zhu, Z., Wu, J., Yu, B., Wu, L. & Ma, J. The anisotropic noise in stochastic gradient descent: its behavior of escaping from sharp minima and regularization effects. In Proc. International Conference on Machine Learning 7654–7663 (PMLR, 2019).
Martens, J. New insights and perspectives on the natural gradient method. J. Mach. Learn. Res. 21, 1–76 (2020).
Chaudhari, P., & Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA) 1–10 (IEEE, 2018).
Feng, Y. & Tu, Y. The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima. Proc. Natl Acad. Sci. USA 118, e2015617118 (2021).
Yang, N., Tang, C. & Tu, Y. Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Phys. Rev. Lett. 130, 237101 (2023).
Golatkar, A. S., Achille, A. & Soatto, S. Time matters in regularizing deep networks: weight decay and data augmentation affect early learning dynamics, matter little near convergence. Adv. Neural Inf. Process. Syst. 32, 10677–10687 (2019).
Lian, X. et al. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. Adv. Neural Inf. Process. Syst. 30, 5330–5340 (2017).
Shallue, C. J. et al. Measuring the effects of data parallelism on neural network training. J. Mach. Learn. Res. 20, 112:1–112:49 (2019).
Lian, X., Zhang, W., Zhang, C. & Liu, J. Asynchronous decentralized parallelstochastic gradient descent. In International Conference on Machine Learning 3043–3052 (2018).
Zhang, W. et al. Loss landscape dependent self-adjusting learning rates in decentralized stochastic gradient descent. Preprint at https://arxiv.org/abs/2112.01433 (2021).
He, K., Zhang, X., Ren, S. & Sun, S. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (IEEE, 2017).
Hochreiter, S. & Schmidhuber, J. Flat minima. Neural Comput. 9, 1–42 (1997).
Wei, C. & Ma, T. Improved sample complexities for deep networks and robust classification via an all-layer margin. In International Conference on Learning Representations (2020).
Chaudhari, P. et al. Entropy-SGD: biasing gradient descent into wide valleys. J. Stat. Mech. Theor. E 2019, 124018 (2019).
Foret, P., Kleiner, A., Mobahi, H. & Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In 9th International Conference on Learning Representations (2021).
Baldassi, C., Pittorino, F. & Zecchina, R. Shaping the learning landscape in neural networks around wide flat minima. Proc. Natl Acad. Sci. USA 117, 161–170 (2020).
Yang, R., Mao, J. & Chaudhari, P. Does the data induce capacity control in deep learning? In International Conference on Machine Learning 25166–25197 (PMLR, 2022).
Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29, 141–142 (2012).
Krizhevsky, A., Nair, V. & Hinton, G. CIFAR-10. Canadian Institute for Advanced Research http://www.cs.toronto.edu/~kriz/cifar.html (2010).
Feng, Y. A–W duality in neural network: the flatter and smaller solution is better for generalization. Zenodo https://doi.org/10.5281/zenodo.8031053 (2023).
Acknowledgements
We thank K. Clarkson and R. Traub for careful reading of our paper and useful comments. The work by F.Y. was partially done while he was employed as an IBM intern.
Author information
Authors and Affiliations
Contributions
Y.T. initiated the project and discovered the activity–weight duality. Y.F. and W.Z. did the numerical experiments. Y.T. and Y.F. developed the theory and analysed the results. All authors wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Pratik Chaudhari and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Text and Figs. 1–14.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Feng, Y., Zhang, W. & Tu, Y. Activity–weight duality in feed-forward neural networks reveals two co-determinants for generalization. Nat Mach Intell 5, 908–918 (2023). https://doi.org/10.1038/s42256-023-00700-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-023-00700-x