An optimal gradient method for smooth strongly convex minimization | Mathematical Programming Skip to main content
Log in

An optimal gradient method for smooth strongly convex minimization

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We present an optimal gradient method for smooth strongly convex optimization. The method is optimal in the sense that its worst-case bound on the distance to an optimal point exactly matches the lower bound on the oracle complexity for the class of problems, meaning that no black-box first-order method can have a better worst-case guarantee without further assumptions on the class of problems at hand. In addition, we provide a constructive recipe for obtaining the algorithmic parameters of the method and illustrate that it can be used for deriving methods for other optimality criteria as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. The puzzled reader can verify this using basic symbolic computations. We provide a notebook for verifying the equivalence of the expressions in Sect. 4.

References

  1. Bansal, N., Gupta, A.: Potential-function proofs for gradient methods. Theory Comput. 15(1), 1–32 (2019)

    MathSciNet  MATH  Google Scholar 

  2. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  3. Cyrus, S., Hu, B., Van Scoy, B., Lessard, L.: A robust accelerated optimization algorithm for strongly convex functions. In: 2018 Annual American Control Conference (ACC), pages 1376–1381. IEEE (2018)

  4. d’Aspremont, A., Scieur, D., Taylor, A.: Acceleration methods. Found. and Trends® in Optim. 5(1–2), 1–245 (2021)

    Google Scholar 

  5. De Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  6. Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  7. Drori, Y., Taylor, A.: On the oracle complexity of smooth strongly convex minimization. J. Complex. 68, 101590 (2022). https://doi.org/10.1016/j.jco.2021.101590

  8. Drori, Y., Taylor, A.B.: Efficient first-order methods for convex minimization: a constructive approach. Math. Program. 184(1), 183–220 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  9. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145, 451–482 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  10. Gasnikov, A.V., Nesterov, Y.E.: Universal method for stochastic composite optimization problems. Comput. Math. Math. Phys. 58(1), 48–64 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gramlich, D., Ebenbauer, C., Scherer, C.W.: Convex synthesis of accelerated gradient algorithms for optimization and saddle point problems using Lyapunov functions. preprintarXiv:2006.09946, (2020)

  12. Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  13. Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 172(1), 187–205 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  14. Kim, D., Fessler, J.A.: Optimizing the Efficiency of first-order methods for decreasing the gradient of smooth convex functions. J. Optim. Theory Appl. 188, 192–219 (2021). https://doi.org/10.1007/s10957-020-01770-2

  15. Lessard, L., Seiler, P.: Direct synthesis of iterative algorithms with bounds on achievable worst-case convergence rate. In: 2020 American Control Conference (ACC), pages 119–125. IEEE (2020)

  16. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  17. Löfberg, J.: YALMIP : A toolbox for modeling and optimization in MATLAB. In: Proceedings of the CACSD Conference (2004)

  18. APS Mosek. The MOSEK optimization software. Online athttp://www.mosek.com, 54 (2010)

  19. Nemirovski, A.: Optimization II: Numerical methods for nonlinear continuous optimization. Lecture notes, http://www2.isye.gatech.edu/~nemirovs/Lect_OptII.pdf, (1999)

  20. Nemirovskii, A.S.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)

    Article  MathSciNet  Google Scholar 

  21. Nesterov, Y.: A method of solving a convex programming problem with convergence rate O\((1/k^2)\). Soviet Math. Dokl. 27(2), 372–376 (1983)

    MATH  Google Scholar 

  22. Nesterov, Y.: Introductory lectures on convex optimization: a basic course. Applied optimization. Kluwer Academic Publishers (2004). ISBN 9781402075537

  23. Park, C., Park, J., Ryu, E.K.: Factor-\(\sqrt{2}\) acceleration of accelerated gradient methods. preprintarXiv:2102.07366 (2021)

  24. Taylor, A.: Convex Interpolation and Performance Estimation of First-order Methods for Convex Optimization. PhD thesis, Université catholique de Louvain (2017)

  25. Taylor, A., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In: Proceedings of the 2019 Conference on Learning Theory (COLT), volume 99, pages 2934–2992 (2019)

  26. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 1278–1283. IEEE (2017a)

  27. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  28. Toker, O., Ozbay, H.: On the NP-hardness of solving bilinear matrix inequalities and simultaneous stabilization with static output feedback. In: 1995 American Control Conference (ACC), pages 2525–2526. IEEE (1995)

  29. Van Loan, C.F., Golub, G.H.: Matrix computations. Johns Hopkins University Press, Baltimore (1983)

    MATH  Google Scholar 

  30. Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2(1), 49–54 (2018)

    Article  MathSciNet  Google Scholar 

  31. Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of accelerated methods in optimization. J. Mach. Learn. Res. (JMLR) 22(113), 1–34 (2021)

    MathSciNet  MATH  Google Scholar 

  32. Zhou, K., So, A.M.-C., Cheng, J.: Boosting first-order methods by shifting objective: New schemes with faster worst case rates. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

Download references

Acknowledgements

The authors would like to thank Shuvomoy Das Gupta (https://shuvomoy.github.io/site/) for pointing out a few typos and for nice suggestions for improvements. We are also very grateful to two anonymous referees and an associate editor of Mathematical Programming for providing very constructive feedbacks on an early version of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrien Taylor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A. Taylor acknowledges support from the European Research Council (grant SEQUOIA 724063).This work was funded in part by the french government under management of Agence Nationale de la recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute)

Appendices

Alternate parametrization for first-order methods

In this section, we show that any method (7) can be reparametrized as (8) (and vice-versa), using the identity (9). First note that the equivalence is clear for \(k=1\), as

$$\begin{aligned} \begin{aligned} w_1-x_\star&=w_0-x_\star -\frac{h_{1,0}}{L}(\nabla {\tilde{f}}(w_0)+\mu (w_0-x_\star ))\\&=(w_0-x_\star )(1-\frac{\mu }{L}h_{1,0})-\frac{h_{1,0}}{L}\nabla {\tilde{f}}(w_0), \end{aligned} \end{aligned}$$

and hence the equivalence holds for \(k=1\). Now, assuming the equivalence holds at iteration k, let us check that it holds at iteration \(k+1\), that is assume \(w_i-x_\star =(w_0-x_\star )\left( 1-\frac{\mu }{L}\sum _{j=0}^{i-1}\alpha _{i,j}\right) -\frac{1}{L}\sum _{j=0}^{i-1}\alpha _{i,j}\nabla {\tilde{f}}(w_j)\) for \(0\le i\le k\) and compute

$$\begin{aligned} \begin{aligned} w_{k+1}-x_\star&=w_k-x_\star -\frac{1}{L}\sum _{i=0}^k h_{k+1,i}(\nabla {\tilde{f}}(w_i)+\mu (w_i-x_\star ))\\&=(w_0-x_\star )\left( 1-\frac{\mu }{L}\sum _{i=0}^{k-1}\alpha _{k,i}-\frac{\mu }{L}\sum _{i=0}^k h_{k+1,i}+\frac{\mu ^2}{L^2}\sum _{i=0}^k \sum _{j=0}^{k-1} h_{k+1,i}\alpha _{i,j}\right) \\&\quad -\frac{1}{L}\sum _{i=0}^{k-1}\alpha _{k,i}\nabla {\tilde{f}}(w_i)-\frac{1}{L}\sum _{i=0}^k h_{k+1,i}\nabla {\tilde{f}}(w_i)\\ {}&\quad +\frac{\mu }{L^2}\sum _{i=0}^k \sum _{j=0}^{k-1}h_{k+1,i}\alpha _{i,j}\nabla {\tilde{f}}(w_j) \end{aligned} \end{aligned}$$

by reverting the ordering of the double sums, renaming the indices, and reordering, we get

$$\begin{aligned}&w_{k+1}-x_\star \\&\quad =(w_0-x_\star )\left( 1-\frac{\mu }{L}\sum _{i=0}^{k-1}\alpha _{k,i}-\frac{\mu }{L}\sum _{i=0}^k h_{k+1,i}+\frac{\mu ^2}{L^2}\sum _{j=0}^{k-1} \sum _{i=j+1}^{k} h_{k+1,i}\alpha _{i,j}\right) \\&\qquad -\frac{1}{L}\sum _{i=0}^{k-1}\alpha _{k,i}\nabla {\tilde{f}}(w_i)-\frac{1}{L}\sum _{i=0}^k h_{k+1,i}\nabla {\tilde{f}}(w_i)+\frac{\mu }{L^2}\sum _{j=0}^{k-1} \sum _{i=j+1}^{k}h_{k+1,i}\alpha _{i,j}\nabla {\tilde{f}}(w_j)\\&\quad =(w_0-x_\star )\left( 1-\frac{\mu }{L}\sum _{i=0}^{k-1}\alpha _{k,i}-\frac{\mu }{L}\sum _{i=0}^k h_{k+1,i}+\frac{\mu ^2}{L^2}\sum _{i=0}^{k-1} \sum _{j=i+1}^{k} h_{k+1,j}\alpha _{j,i}\right) \\&\qquad -\frac{1}{L}\sum _{i=0}^{k-1}\alpha _{k,i}\nabla {\tilde{f}}(w_i)-\frac{1}{L}\sum _{i=0}^k h_{k+1,i}\nabla {\tilde{f}}(w_i)+\frac{\mu }{L^2}\sum _{i=0}^{k-1} \sum _{j=i+1}^{k}h_{k+1,j}\alpha _{j,i}\nabla {\tilde{f}}(w_i)\\&\quad =(w_0-x_\star )\left( 1-\frac{\mu }{L}h_{k+1,k}-\frac{\mu }{L}\sum _{i=0}^{k-1}\left( \alpha _{k,i}+ h_{k+1,i}-\frac{\mu }{L}\sum _{j=i+1}^{k} h_{k+1,j}\alpha _{j,i}\right) \right) \\&\qquad -\frac{1}{L} h_{k+1,k}\nabla {\tilde{f}}(w_k) -\frac{1}{L}\sum _{i=0}^{k-1}\left( \alpha _{k,i}+h_{k+1,i}-\frac{\mu }{L}\sum _{j=i+1}^{k}h_{k+1,j}\alpha _{j,i}\right) \nabla {\tilde{f}}(w_i). \end{aligned}$$

From this last reformulation, the choice (9), that is

$$\begin{aligned} \begin{aligned} \alpha _{k+1,i}=\left\{ \begin{array}{ll} h_{k+1,k} &{} \text {if } i=k \\ h_{k+1,i}+\alpha _{k,i}-\frac{\mu }{L}\sum \limits _{j=i+1}^k h_{k+1,j}\alpha _{j,i}\quad &{} \text {if } 0\le i <k, \end{array}\right. \end{aligned} \end{aligned}$$

allows enforcing the coefficients of all independent terms \((w_0-x_\star )\), \(\nabla {\tilde{f}}(w_0),\ldots ,\nabla {\tilde{f}}(w_k)\) to be equal in both (7) and (8), reaching the desired statement. In addition, note that this change of variable is reversible.

Algebraic manipulations of (dual-SDP-R)

In this section, we reformulate (dual-SDP-R) for enabling us optimizing both on \(\alpha _{i,j}\)’s and \(\lambda _{i,j}\)’s simultaneously. For doing that, let us start by conveniently noting that

$$\begin{aligned} S(\tau ,\{\lambda _{i,j},\{\alpha _{i,j}\}\})\succeq 0 \Leftrightarrow \begin{pmatrix}S'(\tau ,\{\lambda _{i,j}\},\{\alpha _{i,j}\}) &{} {\mathbf {w}}_N \\ {\mathbf {w}}_N^\top &{} 1\end{pmatrix}\succeq 0, \end{aligned}$$

with \(S'(\tau ,\{\lambda _{i,j}\},\{\alpha _{i,j}\})=S(\tau ,\{\lambda _{i,j}\}, \{\alpha _{i,j}\})+{\mathbf {w}}_N{\mathbf {w}}_N^\top \), using a standard Schur complement (see, e.g., [29]). The motivation underlying this reformulation is that this lifted linear matrix inequality depends linearly on \(\{\alpha _{N,i}\}_i\)’s. Indeed, the coefficients of the last iteration only appear through the term \({\mathbf {w}}_N\), which is not present in \(S'\) (details below).

We only consider the case \(N>1\) below. In the case \(N=1\), (6) can be solved without the following simplifications. Let us develop the expression of \(S'({\cdot })\) as follows

$$\begin{aligned}&S'(\tau ,\{\lambda _{i,j}\},\{\alpha _{i,j}\})\\&\quad =\,\tau {\mathbf {w}}_0{\mathbf {w}}_0^\top +\frac{1}{2(L-\mu )}\bigg (\lambda _{N-1,\star }\,{\mathbf {g}}_{N-1} {\mathbf {g}}_{N-1}^\top \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad +\sum _{i=0}^{N-1}\lambda _{\star ,i}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \bigg )\\&\qquad -\frac{1}{2}\lambda _{\star ,0}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \frac{1}{2}\sum _{i=1}^{N-1}(\lambda _{i-1,i}+\lambda _{\star ,i})({\mathbf {g}}_i{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top )\\&\qquad +\frac{1}{2}\sum _{i=0}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_{i+1}{\mathbf {w}}_i^\top + {\mathbf {w}}_i{\mathbf {g}}_{i+1}^\top )\\&\quad =\,\tau {\mathbf {w}}_0{\mathbf {w}}_0^\top +\frac{1}{2(L-\mu )}\bigg (\lambda _{N-1,\star }\,{\mathbf {g}}_{N-1} {\mathbf {g}}_{N-1}^\top \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad +\sum _{i=0}^{N-1}\lambda _{\star ,i}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \bigg )\\&\qquad -\frac{1}{2}\lambda _{\star ,0}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \frac{1}{2}\sum _{i=1}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_i{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top )\\&\qquad -\frac{1}{2}\lambda _{N-1,\star }({\mathbf {g}}_{N-1}{\mathbf {w}}_{N-1}^\top +{\mathbf {w}}_{N-1} {\mathbf {g}}_{N-1}^\top )+\frac{1}{2}\sum _{i=0}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_{i+1}{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_{i+1}^\top ), \end{aligned}$$

where we used \(\lambda _{i-1,i}+\lambda _{\star ,i}=\lambda _{i,i+1}\) (for \(i=1,\ldots ,N-2\)), and \(\lambda _{N-2,N-1}+\lambda _{\star ,N-1}=\lambda _{N-1,\star }\) for obtaining the second equality. Substituting the expressions for \({\mathbf {w}}_i\)’s, we arrive to

$$\begin{aligned} \begin{aligned}&S'(\tau ,\{\lambda _{i,j}\},\{\alpha _{i,j}\})\\&\quad =\,\tau {\mathbf {w}}_0{\mathbf {w}}_0^\top +\frac{1}{2(L-\mu )}\left( \lambda _{N-1,\star }\,{\mathbf {g}}_{N-1} {\mathbf {g}}_{N-1}^\top +\sum _{i=0}^{N-1}\lambda _{\star ,i}{\mathbf {g}}_i{\mathbf {g}}_i^\top \right. \\&\left. \qquad \qquad \qquad \qquad \qquad \qquad \qquad +\sum _{i=0}^{N-2} \lambda _{i,i+1}({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\frac{1}{2}\lambda _{\star ,0}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top ) -\sum _{i=1}^{N-2}\lambda _{i,i+1}\left( 1-\frac{\mu }{L}\sum _{j=0}^{i-1}\alpha _{i,j} \right) \,\frac{1}{2} ({\mathbf {g}}_i{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_i^\top )\\&\qquad +\sum _{i=1}^{N-2}\lambda _{i,i+1}\sum _{j=0}^{i-1}\frac{\alpha _{i,j}}{L}\, \frac{1}{2}({\mathbf {g}}_i{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_i^\top )\\ {}&\qquad -\lambda _{N-1,\star } \left( 1-\frac{\mu }{L}\sum _{j=0}^{N-2}\alpha _{N-1,j}\right) \, \frac{1}{2}({\mathbf {g}}_{N-1}{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_{N-1}^\top )\\&\qquad +\lambda _{N-1,\star }\sum _{j=0}^{N-2}\frac{\alpha _{N-1,j}}{L}\, \frac{1}{2}({\mathbf {g}}_{N-1}{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_{N-1}^\top )\\&\qquad +\sum _{i=0}^{N-2}\lambda _{i,i+1}\left( 1-\frac{\mu }{L} \sum _{j=0}^{i-1}\alpha _{i,j}\right) \,\frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_{i+1}^\top )\\&\qquad -\sum _{i=0}^{N-2}\lambda _{i,i+1}\sum _{j=0}^{i-1}\frac{\alpha _{i,j}}{L}\, \frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_{i+1}^\top ), \end{aligned} \end{aligned}$$

where we simply expressed each \({\mathbf {w}}_k\) in two terms, one with the contribution of \({\mathbf {w}}_0\), and the other with the contributions of \({\mathbf {g}}_i\)’s. Although not pretty, one can observe that \(S'\) is still bilinear in \(\{\alpha _{i,j}\}\) and \(\{\lambda _{i,j}\}\). This expression can be largely simplified, but this form suffices for the purposes in this work.

The information-theoretic exact method is a solution to (Minimax-R)

Proof of Theorem 4

For readability purposes, we establish the claim without explicitly computing the optimal values of the variables \(\{\alpha _{i,j}\}\) and \(\{\beta _{i,j}\}\) for ITEM. For avoiding this step, let us note that ITEM is clearly a fixed-step first-order method following Definition 2. Therefore, following Lemma 3, the method can also be written in the alternate parametrization (8), using the association \(w_i\leftarrow y_i\) (\(i=0,1,\ldots ,N-1\)) and \(w_N\leftarrow z_N\), where \(y_k\) and \(z_k\) are the sequences defined by (3). Let \(\{\alpha _{i,j}^\star \}\) denote the steps sizes corresponding to ITEM written in the form (8). We proceed to show that by choosing

$$\begin{aligned} \begin{aligned} {\lambda _{\star ,i}^\star }&=\frac{1-q}{L}\frac{A_{i+1}-A_i}{1+qA_N}\\ {\lambda _{i,i+1}^\star }&=\frac{1-q}{L}\frac{A_{i+1}}{1+qA_N}\\ {\lambda _{N-1,\star }^\star }&=\frac{1-q}{L}\frac{A_N}{1+qA_N}\\ {\tau ^\star }&= \frac{1}{1+qA_N} \end{aligned} \end{aligned}$$
(16)

and setting \(\{\beta _{i,j}^{\star }\}\) in accordance to (14), we reach a feasible solution to (Minimax-R). Note that optimality of the solution follows from the value of \(\tau ^\star \), which matches the lower complexity bound discussed in Sect. 2.3.

For establishing dual feasibility, we relate (Minimax-R) to the Lagrangian of (R). That is, denoting

$$\begin{aligned} K=S''(\tau ^\star ,\{\lambda _{i,j}^\star \},\{\beta _{i,j}^{\star }\})-{\mathbf {w}}_N{\mathbf {w}}_N^\top =S(\tau ^\star ,\{\lambda _{i,j}^\star \},\{\alpha _{i,j}^\star \}),\end{aligned}$$

we have for all (FG) as in (11), by construction,

$$\begin{aligned}&F\left( \sum _{i=0}^{N-2} \lambda _{i,i+1}^\star ({\mathbf {f}}_{i+1}-{\mathbf {f}}_i)+\sum _{i=0}^{N-1}\lambda _{\star ,i}^\star {\mathbf {f}}_i-\lambda _{N-1,\star }^\star {\mathbf {f}}_{N-1}\right) +\mathrm {Tr}(KG)\\&\quad \,=\tau ^\star \Vert w_0-w_\star \Vert ^2-\Vert w_N-w_\star \Vert ^2\\&\qquad +\sum _{i=0}^{N-2}\lambda _{i,i+1}^\star \bigg [ {\tilde{f}}(w_{i+1})- {\tilde{f}}(w_i)+\langle \nabla {\tilde{f}}(w_{i+1});w_i-w_{i+1}\rangle \\&\qquad \qquad \qquad +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}} (w_i)-\nabla {\tilde{f}}(w_{i+1})\Vert ^2\bigg ]\\&\qquad +\sum _{i=0}^{N-1}{\lambda _{\star ,i}^\star }\left[ {\tilde{f}}(w_i) -{\tilde{f}}_\star +\langle \nabla {\tilde{f}}(w_i),w_\star -w_{i}\rangle +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}}(w_{i})\Vert ^2\right] \\&\qquad +{\lambda _{N-1,\star }^\star }\left[ {\tilde{f}}_\star -{\tilde{f}}(w_{N-1}) +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}}(w_{N-1})\Vert ^2\right] . \end{aligned}$$

Using the association \(w_i\leftarrow y_i\) (\(i=0,1,\ldots ,N-1\)) and \(w_N\leftarrow z_N\), as well as \(f(y_i)=\tilde{f}(y_i)+\frac{\mu }{2}\Vert y_i-x_\star \Vert ^2\), it follows from Lemma 1 (the weighted sums below are the same as that of Lemma 1, written in terms of \(\tilde{f}(\cdot )\) instead of \(f(\cdot )\) and scaled by a factor \(\frac{1}{L+\mu A_N}\); their reformulations are therefore also the same up to the rescaling) that for \(i=1,\ldots ,N-1\)

$$\begin{aligned} \begin{aligned}&{\lambda _{\star ,i}^\star }\left[ {\tilde{f}}(w_i)-{\tilde{f}}_\star +\langle \nabla {\tilde{f}}(w_i),w_\star -w_{i}\rangle +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}}(w_{i})\Vert ^2\right] \\&\quad +{\lambda _{i-1,i}^\star }\bigg [ {\tilde{f}}(w_{i})- {\tilde{f}}(w_{i-1})+\langle \nabla {\tilde{f}}(w_{i});w_{i-1}-w_{i}\rangle \\&\qquad \qquad \quad \, +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}} (w_i)-\nabla {\tilde{f}}(w_{i-1})\Vert ^2\bigg ]=\frac{1}{L+\mu A_N}(\phi _{i+1}-\phi _{i}), \end{aligned} \end{aligned}$$

as well as

$$\begin{aligned} \begin{aligned}&{\lambda _{\star ,0}^\star }\left[ {\tilde{f}}(w_0)-{\tilde{f}}_\star +\langle \nabla {\tilde{f}}(w_0),w_\star -w_{0}\rangle +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}} (w_{0})\Vert ^2\right] \\&\quad =\frac{1}{L+\mu A_N}(\phi _1-\phi _0),\\&{\lambda _{N-1,\star }^\star }\left[ {\tilde{f}}_\star -{\tilde{f}}(w_{N-1})+\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}}(w_{N-1})\Vert ^2\right] =-\frac{(1-q)A_N}{L+\mu A_N}\psi _{N-1}. \end{aligned} \end{aligned}$$

In addition, noting that \(\phi _0=L\Vert w_0-w_\star \Vert ^2\) allows reaching the following reformulation

$$\begin{aligned} \begin{aligned}&F\left( \sum _{i=0}^{N-2} {\lambda _{i,i+1}^\star }({\mathbf {f}}_{i+1}-{\mathbf {f}}_i)+\sum _{i=0}^{N-1}{\lambda _{\star ,i}^\star }{\mathbf {f}}_i-{\lambda _{N-1,\star }^\star }{\mathbf {f}}_{N-1}\right) +\mathrm {Tr}(KG)\\ {}&=\frac{1}{L+\mu A_N}\left( \phi _0-(1-q)A_N\psi _{N-1}+\sum _{i=0}^{N-1}(\phi _{i+1}-\phi _i)\right) -\Vert z_N-w_\star \Vert ^2\\ {}&=0, \end{aligned} \end{aligned}$$

where the last equality follows from \(\phi _N=(1-q)A_N\psi _{N-1}+(L+\mu A_N)\Vert z_N-w_\star \Vert ^2\). Therefore, ITEM is a solution to (Minimax-R). In more direct terms of the SDP (Minimax-R), one can verify that

$$\begin{aligned} \begin{aligned}&{\lambda _{\star ,0}^\star }-{\lambda _{0,1}^\star }=\frac{1-q}{L}\left( \frac{A_{1}}{1+qA_N}-\frac{A_{1}}{1+qA_N}\right) =0&\\&{\lambda _{i-1,i}^\star }+{\lambda _{\star ,i}^\star }-{\lambda _{i,i+1}^\star }=\frac{1-q}{L}\left( \frac{A_{i}}{1+qA_N}+\frac{A_{i+1}-A_i}{1+qA_N}-\frac{A_{i+1}}{1+qA_N}\right) =0 \quad (i=1,\hdots ,N-2)\\&{\lambda _{N-2,N-1}^\star }+{\lambda _{\star ,N-1}^\star }-{\lambda _{N-1,\star }^\star }=\frac{1-q}{L}\left( \frac{A_{N-1}}{1+qA_N}+\frac{A_{N}-A_{N-1}}{1+qA_N}-\frac{A_N}{1+qA_N}\right) =0,&\end{aligned} \end{aligned}$$

the previous computations therefore imply that \(K=0\) for ITEM, and hence \(S''(\tau ,\{\lambda _{i,j}\},{\{\beta _{i,j}\}})={\mathbf {w}}_N{\mathbf {w}}_N^\top \succeq 0\). \(\square \)

An SDP formulation for optimizing function values

In this section, we show how to adapt the methodology developed in Sect. 3 for a family of alternate design criteria, which include \((f(w_N)-f_\star )/\Vert w_0-w_\star \Vert ^2\) and \((f(w_N)-f_\star )/(f(w_0)-f_\star )\) (for which numerical examples are provided respectively in Sect. E.1 and Sect. E.2). The developments slightly differ from those required for optimizing \({\Vert w_N-w_\star \Vert ^2}/{\Vert w_0-w_\star \Vert ^2}\); and we decided not to present a unified version in the core of the text, for readability purposes. In particular, the set of selected inequalities is slightly different, altering the linearization procedure.

The criteria we deal with in this section are of the form

$$\begin{aligned}&\frac{f(w_N)-f_\star }{c_w\Vert w_0-w_\star \Vert ^2+c_f(f(w_0)-f_\star )}\\&\quad =\frac{{\tilde{f}}(w_N)- f_\star +\frac{\mu }{2}\Vert w_N-w_\star \Vert ^2}{c_w\Vert w_0-w_\star \Vert ^2+c_f({\tilde{f}}(w_0)-f_\star +\frac{\mu }{2}\Vert w_0-w_\star \Vert ^2)}. \end{aligned}$$

As the steps are essentially the same as detailed in Sect. 3, we proceed without providing much detail. We start with the discrete version, using the set \(I=\{\star ,0,\ldots ,N\}\)

$$\begin{aligned} \begin{aligned} \max _{\begin{array}{c} \{(w_i,g_i,f_i)\}_{i\in I}\\ d\in {\mathbb {N}} \end{array}}&f_N-f_\star +\frac{\mu }{2}\Vert w_N-w_\star \Vert ^2&\\ \text {s.t. }&c_w\Vert w_0-w_\star \Vert ^2+c_f(f_0-f_\star +\frac{\mu }{2}\Vert w_0-w_\star \Vert ^2)=1,\, g_\star =0&\\&w_{k} \text { generated by}~(8)&\text {for }k=1,\ldots ,N\\&f_i\ge f_j+\langle g_j;w_i-w_j\rangle +\frac{1}{2(L-\mu )}\Vert g_i-g_j\Vert ^2\quad&\text {for all }i,j\in I. \end{aligned} \end{aligned}$$

The upper bound we use is now very slightly different (the selected subset of constraints is not the same as that of (R))

$$\begin{aligned} \begin{aligned} \mathrm {UB}_{\mu ,L}(\{\alpha _{i,j}\})=\max _{\begin{array}{c} \{(w_i,g_i,f_i)\}_{i\in I}\\ d\in {\mathbb {N}} \end{array}}&f_N-f_\star +\frac{\mu }{2}\Vert w_N-w_\star \Vert ^2&\\ \text {s.t. }&c_w\Vert w_0-w_\star \Vert ^2+c_f(f_0-f_\star +\frac{\mu }{2}\Vert w_0-w_\star \Vert ^2)=1,\, g_\star =0&\\&w_{k} \text { generated by}~(8)&\text {for }k=1,\ldots ,N\\&f_i\ge f_{i+1}+\langle g_{i+1};w_i-w_{i+1}\rangle +\frac{1}{2(L-\mu )}\Vert g_i-g_{i+1}\Vert ^2&\text {for }i=0,\ldots ,N-1\\&f_\star \ge f_i+\langle g_i,w_\star -w_{i+1}\rangle +\frac{1}{2(L-\mu )}\Vert g_{i+1}\Vert ^2&\text {for }i=0,\ldots ,N \end{aligned} \end{aligned}$$

The corresponding SDP can be written using a similar couple (GF)

$$\begin{aligned} \begin{aligned} G&=\begin{pmatrix} \Vert w_0-w_\star \Vert ^2 &{} \langle g_0;w_0-w_\star \rangle &{} \langle g_1;w_0-w_\star \rangle &{}\ldots &{} \langle g_{N};w_0-w_\star \rangle \\ \langle g_0;w_0-w_\star \rangle &{} {\Vert g_0\Vert ^2} &{} \langle g_1;g_0\rangle &{} \ldots &{}\langle g_{N};g_0\rangle \\ \langle g_1;w_0-w_\star \rangle &{} \langle g_1;g_0\rangle &{} {\Vert g_1\Vert ^2} &{}\ldots &{}\langle g_{N};g_1\rangle \\ \vdots &{} \vdots &{} \vdots &{}\ddots &{} \vdots \\ \langle g_{N};w_0-w_\star \rangle &{} \langle g_{N};g_0\rangle &{} \langle g_{N};g_1\rangle &{}\ldots &{}\Vert g_{N}\Vert ^2\\ \end{pmatrix}\\ F&=\begin{pmatrix}f_0-f_\star \\ f_1-f_\star \\ \vdots \\ f_{N}-f_\star \end{pmatrix}, \end{aligned} \end{aligned}$$

and the similar notations

$$\begin{aligned} {\mathbf {w}}_0=e_1\in {\mathbb {R}}^{N+2},\quad {\mathbf {g}}_i=e_{i+2}\in {\mathbb {R}}^{N+2},\quad {\mathbf {f}}_i=e_{i+1}\in {\mathbb {R}}^{N+1}, \end{aligned}$$

with \(i=0,\ldots ,N\) and \(e_i\) being the unit vector whose ith component is equal to 1. In addition, we can also denote by

$$\begin{aligned} {\mathbf {w}}_k={\mathbf {w}}_0\left( 1-\frac{\mu }{L}\sum _{i=0}^{k-1}\alpha _{k,i}\right) -\sum _{i=0}^{k-1}\frac{\alpha _{k,i}}{L}{\mathbf {g}}_i. \end{aligned}$$

A dual formulation of \(\mathrm {UB}_{\mu ,L}\) is given by (we directly included the Schur complement)

$$\begin{aligned} \begin{aligned} \mathrm {UB}_{\mu ,L}(\{\alpha _{i,j}\})=\min _{\tau ,\lambda _{i,j}\ge 0}&\tau ,&\\ \text {s.t.}&\,\begin{pmatrix}{{\bar{S}}}'(\tau ,\{\lambda _{i,j}\},\{{\alpha _{i,j}}\}) &{}\qquad \sqrt{\mu }{\mathbf {w}}_N \\ \sqrt{\mu }{\mathbf {w}}_N^\top &{} 2\end{pmatrix}\succeq 0,&\\&\tau \, c_f\, {\mathbf {f}}_{0}+\sum _{i=0}^{N-1} \lambda _{i,i+1}({\mathbf {f}}_{i+1}-{\mathbf {f}}_i)+\sum _{i=0}^{N}\lambda _{\star ,i}{\mathbf {f}}_i={\mathbf {f}}_{N} \end{aligned} \end{aligned}$$

with

$$\begin{aligned} \begin{aligned} {{\bar{S}}}'(\tau ,\{\lambda _{i,j}\},\{{\alpha _{i,j}}\})=&\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top +\sum _{i=0}^{N} \frac{\lambda _{\star ,i}}{2}\left( -{\mathbf {g}}_i {\mathbf {w}}_i^\top -{\mathbf {w}}_i{\mathbf {g}}_i^\top +\frac{1}{L-\mu }{\mathbf {g}}_i{\mathbf {g}}_i^\top \right) \\&+\sum _{i=0}^{N-1}\frac{\lambda _{i,i+1}}{2}\bigg ({\mathbf {g}}_{i+1}({\mathbf {w}}_i-{\mathbf {w}}_{i+1})^\top +({\mathbf {w}}_i-{\mathbf {w}}_{i+1}){\mathbf {g}}_{i+1}^\top \\&\qquad \qquad \qquad +\frac{1}{L-\mu }({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \bigg ). \end{aligned} \end{aligned}$$

Note that that the equality constraint corresponds to

$$\begin{aligned} \begin{aligned}&c_f+\lambda _{\star ,0}-\lambda _{0,1}=0&\\&\lambda _{i-1,i}+\lambda _{\star ,i}-\lambda _{i,i+1}=0 \quad&\text {for }i=1,\ldots ,N-1\\&\lambda _{N-1,N}+\lambda _{\star ,N}=1.&\end{aligned} \end{aligned}$$

We perform a some additional work on \({{\bar{S}}}'\) (whose dependency on \(\{\alpha _{i,j}\}\) is implicit through the dependency on \(\{{\mathbf {w}}_i\}\)), as before

$$\begin{aligned}&{{\bar{S}}}'(\tau ,\{\lambda _{i,j}\},\{{\alpha _{i,j}}\})\\&\quad =\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top \\&\qquad +\frac{1}{2(L-\mu )}\left( \sum _{i=0}^{N} {\lambda _{\star ,i}}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-1}{\lambda _{i,i+1}} ({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\sum _{i=0}^{N} {\lambda _{\star ,i}}\frac{1}{2}\left( {\mathbf {g}}_i {\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top \right) \\&\qquad +\sum _{i=0}^{N-1}{\lambda _{i,i+1}}\frac{1}{2}\left( {\mathbf {g}}_{i+1} ({\mathbf {w}}_i-{\mathbf {w}}_{i+1})^\top +({\mathbf {w}}_i-{\mathbf {w}}_{i+1}){\mathbf {g}}_{i+1}^\top \right) \\&\quad =\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top \\&\qquad +\frac{1}{2(L-\mu )} \left( \sum _{i=0}^{N} {\lambda _{\star ,i}}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-1} {\lambda _{i,i+1}}({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\lambda _{\star ,0}\frac{1}{2}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \sum _{i=1}^N(\lambda _{\star ,i}+\lambda _{i-1,i})\frac{1}{2}({\mathbf {g}}_i{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top )\\&\qquad +\sum _{i=0}^{N-1}\lambda _{i,i+1}\frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_{i+1}^\top )\\&\quad =\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top \\&\qquad +\frac{1}{2(L-\mu )}\left( \sum _{i=0}^{N} {\lambda _{\star ,i}}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-1}{\lambda _{i,i+1}} ({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\lambda _{\star ,0}\frac{1}{2}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \sum _{i=1}^{N-1}\lambda _{i,i+1}\frac{1}{2}({\mathbf {g}}_i{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top )- \frac{1}{2}({\mathbf {g}}_N{\mathbf {w}}_N^\top +{\mathbf {w}}_N{\mathbf {g}}_N^\top )\\&\qquad +\sum _{i=0}^{N-1}\lambda _{i,i+1}\frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_{i+1}^\top ), \end{aligned}$$

where we used \(\lambda _{\star ,i}+\lambda _{i-1,i}=\lambda _{i,i+1}\) (for \(i=1,\ldots ,N-1\)) and \(\lambda _{\star ,N}+\lambda _{N-1,N}=1\). Now, making the dependence on \(\alpha _{i,j}\)’s explicit again, we arrive to

$$\begin{aligned}&{{\bar{S}}}'(\tau ,\{\lambda _{i,j}\},\{{\alpha _{i,j}}\})\\&\quad =\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top \\&\qquad +\frac{1}{2(L-\mu )}\left( \sum _{i=0}^{N} {\lambda _{\star ,i}}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-1}{\lambda _{i,i+1}} ({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\lambda _{\star ,0}\frac{1}{2}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \sum _{i=1}^{N-1}\lambda _{i,i+1}\left( 1-\frac{\mu }{L}\sum _{j=0}^{i-1} \alpha _{i,j}\right) \frac{1}{2}({\mathbf {g}}_i{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_i^\top )\\&\qquad +\sum _{i=1}^{N-1}\lambda _{i,i+1}\sum _{j=0}^{i-1}\alpha _{i,j} \frac{1}{2}({\mathbf {g}}_i{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_i^\top )- \left( 1-\frac{\mu }{L}\sum _{j=0}^{N-1}\alpha _{N,j}\right) \frac{1}{2}({\mathbf {g}}_N{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_N^\top )\\&\qquad +\sum _{j=0}^{N-1}\alpha _{N,j}\frac{1}{2}({\mathbf {g}}_N{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_N^\top ) +\sum _{i=0}^{N-1}\lambda _{i,i+1}\left( 1-\frac{\mu }{L} \sum _{j=0}^{i-1}\alpha _{i,j}\right) \frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_{i+1}^\top )\\&\qquad -\sum _{i=0}^{N-1}\lambda _{i,i+1}\sum _{j=0}^{i-1}\alpha _{i,j} \frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_{i+1}^\top ) \end{aligned}$$

and it remains to remark that the change of variables

$$\begin{aligned} \begin{aligned} {\beta _{i,j}}=\left\{ \begin{array}{llll} \lambda _{i,i+1}\alpha _{i,j} &{}\quad \text {if } 0\le i \le N-1 \\ \alpha _{N,j} &{}\text {if } i=N. \end{array}\right. \end{aligned} \end{aligned}$$
(17)

linearizes the bilinear matrix inequality, again, and it remains to solve the SDP (13) using standard packages. Numerical results for the pairs \((c_w,c_f)=(1,0)\) and \((c_w,c_f)=(0,1)\) are respectively provided in Sect. E.1 and Sect. E.2. A source code for implementing those SDP is provided in Sect. 4.

Numerical examples

As shown in Appendix D, slight modifications of the relaxations used for obtaining (Minimax-R) allows forming tractable problems for optimizing the parameters of fixed-step methods under different optimality criteria. Although we were unable to obtain closed-form solutions to the problems arising for these alternative criteria, the resulting problems can still be approximated numerically for specific values of \(\mu \), L and N.

In the following, we provide a couple of examples that were obtained by numerically solving the first-order method design problem (Minimax-R), formulated as a linear semidefinite program using standard solvers [17, 18].

1.1 Optimized methods for \((f(w_N)-f_\star )/{\Vert w_0-w_\star \Vert ^2}\)

As a first example, we consider the criterion \(({f(w_N)-f_\star })/{\Vert w_0-w_\star \Vert }\). The following list provides solutions obtained by solving the corresponding design problem for \(N=1,\ldots ,5\) with \(L=1\) and \(\mu =.1\). The solutions are presented using the notations from (7) together with the corresponding worst-case guarantees.

  • For a single iteration, by solving the corresponding optimization problem, we obtain a method with guarantee \(\frac{f(w_1)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.1061\) and step size

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.4606 \end{bmatrix}. \end{aligned}$$

    This bound and the corresponding step size match the optimal step size \(h_{1,0}=\frac{q+1-\sqrt{q^2-q+1}}{q}\), see [24, Theorem 4.14].

  • For \(N=2\) iterations, we obtain \(\frac{f(w_2)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.0418\) with

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5567 &{} \\ 0.1016 &{} \,\,\,1.7016 \end{bmatrix}. \end{aligned}$$
  • For \(N=3\), we obtain \(\frac{f(w_3)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.0189\) with

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5512 &{} &{} \\ 0.1220 &{} \,\,\,1.8708 &{} \\ 0.0316 &{} \,\,\,0.2257 &{} \,\,\,1.8019 \end{bmatrix}. \end{aligned}$$
  • For \(N=4\), we obtain \(\frac{f(w_4)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.0089\), with

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5487 &{} &{} &{} \\ 0.1178 &{} \,\,\,1.8535 &{} &{} \\ 0.0371 &{} \,\,\,0.2685 &{} \,\,\,2.0018 &{} \\ 0.0110 &{} \,\,\,0.0794 &{} \,\,\,0.2963 &{} \,\,\,1.8497 \end{bmatrix}.\end{aligned}$$
  • Finally, for \(N=5\), we obtain \(\frac{f(w_5)-f_\star }{\Vert w_0-w_\star \Vert }\le 0.0042\) with

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5476 &{} &{} &{} &{} \\ 0.1159 &{} \,\,\,1.8454 &{} &{} &{} \\ 0.0350 &{} \,\,\,0.2551 &{} \,\,\,1.9748 &{} &{} \\ 0.0125 &{} \,\,\,0.0913 &{} \,\,\,0.3489 &{} \,\,\,2.0625 &{} \\ 0.0039 &{} \,\,\,0.0287 &{} \,\,\,0.1095 &{} \,\,\,0.3334 &{} \,\,\,1.8732 \end{bmatrix}. \end{aligned}$$

Note that when \(\mu =0\), we recover the step size policy of the OGM by Kim and Fessler [12]. When setting \(\mu >0\), we observe that the resulting optimized method is apparently less practical as the step sizes critically depend on the horizon N. In particular, one can observe that \(h_{1,0}^\star \) varies with the horizon N.

Figure 1 illustrates the behavior of the worst-case guarantee for larger values of N and compares it to the currently best known corresponding lower bound, as well as to worst-case guarantees for TMM, Nesterov’s Fast Gradient Method (FGM) for strongly convex functions, as well as to the methods generated with the SSEP procedure from [8]. All the worst-case guarantees are computed numerically using the corresponding performance estimation problems (see e.g., the toolbox [26]), and as a result, they are tight in the sense that matching inputs to the algorithms attaining the bounds can be numerically constructed.

Fig. 1
figure 1

Numerical comparison (for \(L=1\), \(\mu =0.01\)) between (i) the worst-case guarantee of the optimized method for \(\frac{f(w_k)-f_\star }{\Vert w_0-w_\star \Vert ^2}\) (in red; obtained from developments in Appendix D, and numerical examples in Appendix E.1); (ii) a lower bound on the oracle complexity for this setup (in blue; presented in [7, Corollary 3]), which corresponds to \(\frac{f(w_k)-f_\star }{\Vert w_0-w_\star \Vert ^2}\ge \mu \frac{2-\sqrt{q}}{1+\sqrt{q}}\left( 1-\sqrt{q}\right) ^{2k}\); (iii) the triple momentum method [30] (cyan); (iv) Nesterov’s fast gradient method (defined in [22, Section 2.2, “Constant Step Scheme, II”]; FGM, green), and (v) the method generated by the subspace-search elimination procedure (SSEP) from [8] (dashed, black)

1.2 Optimized methods for \(({f(w_N)-f_\star })/({f(w_0)-f_\star })\)

As in the previous section, the technique can be adapted for the criterion \((f(w_N)-f_\star )/(f(x_0)-f_\star )\), see Appendix D for details. The following step sizes were obtained by setting \(L=1\) and \(\mu =.1\) and solving the resulting optimization problem from different values of N.

  • For a single iteration, \(N=1\), we obtain a guarantee \(\frac{f(w_1)-f_\star }{f(w_0)-f_\star } \le 0.6694\) with the corresponding step size

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.8182 \end{bmatrix},\end{aligned}$$

    which matches the known optimal step size \(2/(L+\mu )\) for this setup [5, Theorem 4.2].

  • For \(N=2\), we obtain \(\frac{f(w_2)-f_\star }{f(w_0)-f_\star } \le 0.3554\) with

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 2.0095 &{} \\ 0.4229 &{} \,\,\,2.0095 \end{bmatrix}.\end{aligned}$$
  • For \(N=3\), we obtain \(\frac{f(w_3)-f_\star }{f(w_0)-f_\star } \le 0.1698\) with

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.9470 &{} &{} \\ 0.4599 &{} \,\,\,2.2406 &{} \\ 0.1705 &{} \,\,\,0.4599 &{} \,\,\,1.9470 \end{bmatrix}.\end{aligned}$$
  • For \(N=4\), we obtain \(\frac{f(w_4)-f_\star }{f(w_0)-f_\star } \le 0.0789\) with

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.9187 &{} &{} &{} \\ 0.4098 &{} \,\,\,2.1746 &{} &{} \\ 0.1796 &{} \,\,\,0.5147 &{} \,\,\,2.1746 &{} \\ 0.0627 &{} \,\,\,0.1796 &{} \,\,\,0.4098 &{} \,\,\,1.9187 \end{bmatrix}.\end{aligned}$$
  • Finally, for \(N=5\), we reach \(\frac{f(w_5)-f_\star }{f(w_0)-f_\star } \le 0.0365\) with

    $$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.9060 &{} &{} &{} &{} \\ 0.3879 &{} \,\,\,2.1439 &{} &{} &{} \\ 0.1585 &{} \,\,\,0.4673 &{} \,\,\,2.1227 &{} &{} \\ 0.0660 &{} \,\,\,0.1945 &{} \,\,\,0.4673 &{} \,\,\,2.1439 &{} \\ 0.0224 &{} \,\,\,0.0660 &{} \,\,\,0.1585 &{} \,\,\,0.3879 &{} \,\,\,1.9060 \end{bmatrix}.\end{aligned}$$
Fig. 2
figure 2

Numerical comparison (for \(L=1\), \(\mu =0.01\)) between (i) the worst-case guarantee of the optimized method for \(\frac{f(w_k)-f_\star }{f(w_0)-f_\star }\) (in red; obtained from developments in Appendix D, and numerical examples in Appendix E.2); (ii) a lower bound on the oracle complexity for this setup (in blue; computed numerically using the procedure from [7]); and (iii) a method generated by the subspace-search elimination procedure (SSEP) from [8] (dashed, black)

Note that the resulting method is again apparently less practical than ITEM, as step sizes also critically depend on the horizon N; for example, observe again that the value of \(h_{1,0}^\star \) depends on N. Interestingly, one can observe that the corresponding step sizes are symmetric, and that the worst-case guarantees seem to behave slightly better than in the distance problem \({\Vert w_N-w_\star \Vert ^2}/{\Vert w_0-w_\star \Vert ^2}\), although their asymptotic rate has to be the same, due to the properties of strongly convex functions. Figure 2 illustrates the worst-case guarantees of the corresponding method for larger numbers of iterations, and compares it to the lower bound.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Taylor, A., Drori, Y. An optimal gradient method for smooth strongly convex minimization. Math. Program. 199, 557–594 (2023). https://doi.org/10.1007/s10107-022-01839-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01839-y

Navigation