Training Aware Sigmoidal Optimization

Macêdo, David; Dreyer, Pedro; Ludermir, Teresa; Zanchettin, Cleber

doi:10.1007/978-3-031-21689-3_10

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13654 ))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

950 Accesses
1 Citations

Abstract

Proper optimization of deep neural networks is an open research question since an optimal procedure to change the learning rate throughout training is still unknown. Manually defining a learning rate schedule involves troublesome, time-consuming try and error procedures to determine hyperparameters such as learning rate decay epochs and learning rate decay rates. Although adaptive learning rate optimizers automatize this process, recent studies suggest they may produce overfitting and reduce performance compared to fine-tuned learning rate schedules. Considering that deep neural networks loss functions present landscapes with much more saddle points than local minima, we proposed the Training Aware Sigmoidal Optimizer (TASO), consisting of a two-phase automated learning rate schedule. The first phase uses a high learning rate to fast traverse the numerous saddle point, while the second phase uses a low learning rate to approach the center of the local minimum previously found slowly. We compared the proposed approach with commonly used adaptive learning rates schedules such as Adam, RMSProp, and Adagrad. Our experiments showed that TASO outperformed all competing methods in both optimal (i.e., performing hyperparameter validation) and suboptimal (i.e., using default hyperparameters) scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

AdaLip: An Adaptive Learning Rate Method per Layer for Stochastic Optimization

Article Open access 04 January 2023

Mitigating bias through random activation function selection

Article 27 November 2023

Adaptive hierarchical hyper-gradient descent

Article Open access 13 August 2022

Notes

1.
https://github.com/anonymous .

References

Touvron, H., Vedaldi, A., Douze, M., Jegou, H.: Fixing the train-test resolution discrepancy. In: Neural Information Processing Systems, pp. 8252–8262 (2019)
Google Scholar
Park, D.S.: “Improved noisy student training for automatic speech recognition. In: Annual Conference of the International Speech Communication Association, pp. 2817–2821 (2020)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Google Scholar
Alom, M., et al.: A state-of-the-art survey on deep learning theory and architectures. Electronics 8(3), 292–358 (2019)
Article Google Scholar
Schmidt, R.M., Schneider, F., Hennig, P.: Descending through a crowded valley - benchmarking deep learning optimizers, CoRR, vol. abs/ arXiv: 2007.01547 (2020)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. M. L. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Dauphin, Y.N., Pascanu, R., Gülçehre, Ç., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Neural Information Processing Systems, pp. 2933–2941 (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (2015)
Google Scholar
Tieleman, T., Hinton, G.: RMSProp: Divide the gradient by a running average of its recent magnitude. In: Neural Networks for Machine Learning (2012)
Google Scholar
Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: NeurIPS, pp. 4148–4158 (2017)
Google Scholar
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. of the IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical Report, University of Toronto (2009)
Google Scholar
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comp. Math. and Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Graves, A.: Generating sequences with recurrent neural networks, CoRR, vol. abs/ arxiv: 1308.0850 (2013)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: International Conference on Learning Representations (2018)
Google Scholar
Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: IEEE Information Theory Workshop, pp. 1–5 (2015)
Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. Learning multiple layers of features from tiny images, Science Department, University of Toronto (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Centro de Informática, Universidade Federal de Pernambuco, PE 50.740-560, Recife, Brazil
David Macêdo, Pedro Dreyer, Teresa Ludermir & Cleber Zanchettin

Authors

David Macêdo
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Dreyer
View author publications
You can also search for this author in PubMed Google Scholar
Teresa Ludermir
View author publications
You can also search for this author in PubMed Google Scholar
Cleber Zanchettin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cleber Zanchettin .

Editor information

Editors and Affiliations

Federal University of Rio Grande do Norte, Natal, Brazil
João Carlos Xavier-Junior
Federal University of Bahia, Salvador, Brazil
Ricardo Araújo Rios

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Macêdo, D., Dreyer, P., Ludermir, T., Zanchettin, C. (2022). Training Aware Sigmoidal Optimization. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-21689-3_10
Published: 19 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21688-6
Online ISBN: 978-3-031-21689-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Training Aware Sigmoidal Optimization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

AdaLip: An Adaptive Learning Rate Method per Layer for Stochastic Optimization

Mitigating bias through random activation function selection

Adaptive hierarchical hyper-gradient descent

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Training Aware Sigmoidal Optimization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

AdaLip: An Adaptive Learning Rate Method per Layer for Stochastic Optimization

Mitigating bias through random activation function selection

Adaptive hierarchical hyper-gradient descent

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation