Abstract
Proper optimization of deep neural networks is an open research question since an optimal procedure to change the learning rate throughout training is still unknown. Manually defining a learning rate schedule involves troublesome, time-consuming try and error procedures to determine hyperparameters such as learning rate decay epochs and learning rate decay rates. Although adaptive learning rate optimizers automatize this process, recent studies suggest they may produce overfitting and reduce performance compared to fine-tuned learning rate schedules. Considering that deep neural networks loss functions present landscapes with much more saddle points than local minima, we proposed the Training Aware Sigmoidal Optimizer (TASO), consisting of a two-phase automated learning rate schedule. The first phase uses a high learning rate to fast traverse the numerous saddle point, while the second phase uses a low learning rate to approach the center of the local minimum previously found slowly. We compared the proposed approach with commonly used adaptive learning rates schedules such as Adam, RMSProp, and Adagrad. Our experiments showed that TASO outperformed all competing methods in both optimal (i.e., performing hyperparameter validation) and suboptimal (i.e., using default hyperparameters) scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Touvron, H., Vedaldi, A., Douze, M., Jegou, H.: Fixing the train-test resolution discrepancy. In: Neural Information Processing Systems, pp. 8252–8262 (2019)
Park, D.S.: “Improved noisy student training for automatic speech recognition. In: Annual Conference of the International Speech Communication Association, pp. 2817–2821 (2020)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Alom, M., et al.: A state-of-the-art survey on deep learning theory and architectures. Electronics 8(3), 292–358 (2019)
Schmidt, R.M., Schneider, F., Hennig, P.: Descending through a crowded valley - benchmarking deep learning optimizers, CoRR, vol. abs/ arXiv: 2007.01547 (2020)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. M. L. Res. 12, 2121–2159 (2011)
Dauphin, Y.N., Pascanu, R., Gülçehre, Ç., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Neural Information Processing Systems, pp. 2933–2941 (2014)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (2015)
Tieleman, T., Hinton, G.: RMSProp: Divide the gradient by a running average of its recent magnitude. In: Neural Networks for Machine Learning (2012)
Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: NeurIPS, pp. 4148–4158 (2017)
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. of the IEEE 86(11), 2278–2324 (1998)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical Report, University of Toronto (2009)
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comp. Math. and Math. Phys. 4(5), 1–17 (1964)
Graves, A.: Generating sequences with recurrent neural networks, CoRR, vol. abs/ arxiv: 1308.0850 (2013)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: International Conference on Learning Representations (2018)
Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: IEEE Information Theory Workshop, pp. 1–5 (2015)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Learning multiple layers of features from tiny images, Science Department, University of Toronto (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Macêdo, D., Dreyer, P., Ludermir, T., Zanchettin, C. (2022). Training Aware Sigmoidal Optimization. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-21689-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21688-6
Online ISBN: 978-3-031-21689-3
eBook Packages: Computer ScienceComputer Science (R0)