Abstract
Is it possible for a first-order method, i.e., only first derivatives allowed, to be quadratically convergent? For univariate loss functions, the answer is yes—the Steffensen method avoids second derivatives and is still quadratically convergent like Newton method. By incorporating a specific step size we can even push its convergence order beyond quadratic to \(1+\sqrt{2} \approx 2.414\). While such high convergence orders are a pointless overkill for a deterministic algorithm, they become rewarding when the algorithm is randomized for problems of massive sizes, as randomization invariably compromises convergence speed. We will introduce two adaptive learning rates inspired by the Steffensen method, intended for use in a stochastic optimization setting and requires no hyperparameter tuning aside from batch size. Extensive experiments show that they compare favorably with several existing first-order methods. When restricted to a quadratic objective, our stochastic Steffensen methods reduce to randomized Kaczmarz method—note that this is not true for SGD or SLBFGS—and thus we may also view our methods as a generalization of randomized Kaczmarz to arbitrary objectives.
Similar content being viewed by others
Data availability
We do not analyze or generate any datasets. The numerical experiments in Sect. 5 rely on standard datasets in LIBSVM that is publicly available from https://github.com/cjlin1/libsvm. The authors declare no conflict of interest.
Notes
References
Steffensen, J.F.: Remarks on iteration. Skand. Aktuarietidskr. 1, 64–72 (1933)
Steffensen, J.F.: Further remarks on iteration. Skand. Aktuarietidskr. 28, 44–55 (1945)
Amat, S., Ezquerro, J.A., Hernández-Verón, M.A.: On a Steffensen-like method for solving nonlinear equations. Calcolo 53(2), 171–188 (2016)
Ezquerro, J.A., Hernández-Verón, M.A., Rubio, M.J., Velasco, A.I.: An hybrid method that improves the accessibility of Steffensen’s method. Numer. Algorithms 66(2), 241–267 (2014)
Henrici, P.: Elements of Numerical Analysis. John Wiley, New York (1964)
Huang, H.Y.: Unified approach to quadratically convergent algorithms for function minimization. J. Optim. Theory Appl. 5, 405–423 (1970)
Johnson, L.W., Scholz, D.R.: On Steffensen’s method. SIAM J. Numer. Anal. 5, 296–302 (1968)
Nedzhibov, G.H.: An approach to accelerate iterative methods for solving nonlinear operator equations. In: Applications of Mathematics in Engineering and Economics (AMEE’11). AIP Conf. Proc., vol. 1410, pp. 76–82. Amer. Inst. Phys., Melville (2011)
Nievergelt, Y.: Aitken’s and Steffensen’s accelerations in several variables. Numer. Math. 59(3), 295–310 (1991)
Nievergelt, Y.: The condition of Steffensen’s acceleration in several variables. J. Comput. Appl. Math. 58(3), 291–305 (1995)
Noda, T.: The Aitken-Steffensen method in the solution of simultaneous nonlinear equations. Sūgaku 33(4), 369–372 (1981)
Noda, T.: The Aitken-Steffensen method in the solution of simultaneous nonlinear equations. II. Sūgaku 38(1), 83–85 (1986)
Noda, T.: The Aitken-Steffensen method in the solution of simultaneous nonlinear equations. III. Proc. Jpn. Acad. Ser. A Math. Sci. 62(5), 174–177 (1986)
Noda, T.: The Aitken-Steffensen formula for systems of nonlinear equations. IV. Proc. Jpn. Acad. Ser. A Math. Sci. 66(8), 260–263 (1990)
Noda, T.: The Aitken-Steffensen formula for systems of nonlinear equations. V. Proc. Jpn. Acad. Ser. A Math. Sci. 68(2), 37–40 (1992)
Gill, P.E., Murray, W., Wright, M.H.: Numerical linear algebra and optimization. In: Classics in Applied Mathematics, vol. 83. Society for Industrial and Applied Mathematics, Philadelphia (2021)
Allen-Zhu, Z.: Katyusha: The first direct acceleration of stochastic gradient methods. In: Hatami, H., McKenzie, P., King, V. (eds.) STOC’17: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. Annual ACM Symposium on Theory of Computing, pp. 1200–1205 (2017)
Woodworth, B., Srebro, N.: Tight complexity bounds for optimizing composite objectives. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NIPS 2016). Advances in Neural Information Processing Systems, vol. 29 (2016)
Brezinski, C., Redivo-Zaglia, M.: Extrapolation and Rational Approximation–the Works of the Main Contributors. Springer, Cham (2020)
Householder, A.S.: The Numerical Treatment of a Single Nonlinear Equation. International Series in Pure and Applied Mathematics, McGraw-Hill, New York (1970)
Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Polon. Sci. A 57(6), 355–357 (1937)
Kaczmarz, S.: Approximate solution of systems of linear equations. Int. J. Control 57(6), 1269–1271 (1993)
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Roux, N.L., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in Neural Information Processing Systems 25, pp. 2672–2680 (2012)
Defazio, A., Bach, F.R., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems 27, pp. 1646–1654 (2014)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems 26, pp. 315–323 (2013)
Moritz, P., Nishihara, R., Jordan, M.I.: A linearly-convergent stochastic L-BFGS algorithm. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS. JMLR Workshop and Conference Proceedings, vol. 51, pp. 249–258 (2016)
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
Zhao, R., Haskell, W.B., Tan, V.Y.: Stochastic L-BFGS: improved convergence rates and practical acceleration strategies. IEEE Trans. Signal Process. 66(5), 1155–1169 (2018)
Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR (2015)
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \(O(1/k^{2})\). Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)
Poljak, B.T.: Some methods of speeding up the convergence of iterative methods. Ž. Vyčisl. Mat i Mat. Fiz. 4, 791–803 (1964)
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. arXiv:1904.09237 (2019)
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Broyden, C.G.: The convergence of a class of double-rank minimization algorithms. II. The new algorithm. J. Inst. Math. Appl. 6, 222–231 (1970)
Fletcher, R.: A new approach to variable metric algorithms. Comput. J. 13(3), 317–322 (1970)
Goldfarb, D.: A family of variable-metric methods derived by variational means. Math. Comput. 24, 23–26 (1970)
Shanno, D.F.: Conditioning of quasi-Newton methods for function minimization. Math. Comput. 24, 647–656 (1970)
Potra, F.A.: On an iterative algorithm of order \(1.839\cdots \) for solving nonlinear operator equations. Numer. Funct. Anal. Optim. 7(1), 75–106 (1984/85)
Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. 155(1–2, Ser. A), 549–573 (2016)
Babanezhad, R., Ahmed, M.O., Virani, A., Schmidt, M., Konečný, J., Sallinen, S.: Stopwasting my gradients: practical SVRG. In: Advances in Neural Information Processing Systems 28, pp. 2251–2259 (2015)
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)
Nitanda, A.: Accelerated stochastic gradient descent for minimizing finite sums. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS. JMLR Workshop and Conference Proceedings, vol. 51, pp. 195–203 (2016)
Tan, C., Ma, S., Dai, Y., Qian, Y.: Barzilai-borwein step size for stochastic gradient descent. In: Advances in Neural Information Processing Systems 29, pp. 685–693 (2016)
Rockafellar, R.T.: Convex Analysis. Princeton Landmarks in Mathematics, Princeton University Press, Princeton (1997)
Acknowledgements
This work is partially supported by DARPA HR00112190040, NSF DMS-1854831, NSF ECCS-2216912, ONR N000142312863, and the Eckhardt Faculty Fund. We thank Nati Srebro for his exceptionally pertinent pointers and the two anonymous referees for their helpful comments. LHL thanks Junjie Yue for helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, M., Lai, Z. & Lim, LH. Stochastic Steffensen method. Comput Optim Appl 89, 1–32 (2024). https://doi.org/10.1007/s10589-024-00583-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-024-00583-7