An optimal gradient method for smooth strongly convex minimization

Taylor, Adrien; Drori, Yoel

doi:10.1007/s10107-022-01839-y

An optimal gradient method for smooth strongly convex minimization

Full Length Paper
Series A
Published: 20 June 2022

Volume 199, pages 557–594, (2023)
Cite this article

Mathematical Programming Submit manuscript

Adrien Taylor¹ &
Yoel Drori²

1300 Accesses
3 Altmetric
Explore all metrics

Abstract

We present an optimal gradient method for smooth strongly convex optimization. The method is optimal in the sense that its worst-case bound on the distance to an optimal point exactly matches the lower bound on the oracle complexity for the class of problems, meaning that no black-box first-order method can have a better worst-case guarantee without further assumptions on the class of problems at hand. In addition, we provide a constructive recipe for obtaining the algorithmic parameters of the method and illustrate that it can be used for deriving methods for other optimality criteria as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact

Article 28 August 2023

Nonlinear conjugate gradient for smooth convex functions

Article 05 June 2024

Optimized first-order methods for smooth convex minimization

Article 17 October 2015

Notes

The puzzled reader can verify this using basic symbolic computations. We provide a notebook for verifying the equivalence of the expressions in Sect. 4.

References

Bansal, N., Gupta, A.: Potential-function proofs for gradient methods. Theory Comput. 15(1), 1–32 (2019)
MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Cyrus, S., Hu, B., Van Scoy, B., Lessard, L.: A robust accelerated optimization algorithm for strongly convex functions. In: 2018 Annual American Control Conference (ACC), pages 1376–1381. IEEE (2018)
d’Aspremont, A., Scieur, D., Taylor, A.: Acceleration methods. Found. and Trends® in Optim. 5(1–2), 1–245 (2021)
Google Scholar
De Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)
Article MathSciNet MATH Google Scholar
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)
Article MathSciNet MATH Google Scholar
Drori, Y., Taylor, A.: On the oracle complexity of smooth strongly convex minimization. J. Complex. 68, 101590 (2022). https://doi.org/10.1016/j.jco.2021.101590
Drori, Y., Taylor, A.B.: Efficient first-order methods for convex minimization: a constructive approach. Math. Program. 184(1), 183–220 (2020)
Article MathSciNet MATH Google Scholar
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145, 451–482 (2014)
Article MathSciNet MATH Google Scholar
Gasnikov, A.V., Nesterov, Y.E.: Universal method for stochastic composite optimization problems. Comput. Math. Math. Phys. 58(1), 48–64 (2018)
Article MathSciNet MATH Google Scholar
Gramlich, D., Ebenbauer, C., Scherer, C.W.: Convex synthesis of accelerated gradient algorithms for optimization and saddle point problems using Lyapunov functions. preprintarXiv:2006.09946, (2020)
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016)
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 172(1), 187–205 (2017)
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: Optimizing the Efficiency of first-order methods for decreasing the gradient of smooth convex functions. J. Optim. Theory Appl. 188, 192–219 (2021). https://doi.org/10.1007/s10957-020-01770-2
Lessard, L., Seiler, P.: Direct synthesis of iterative algorithms with bounds on achievable worst-case convergence rate. In: 2020 American Control Conference (ACC), pages 119–125. IEEE (2020)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Article MathSciNet MATH Google Scholar
Löfberg, J.: YALMIP : A toolbox for modeling and optimization in MATLAB. In: Proceedings of the CACSD Conference (2004)
APS Mosek. The MOSEK optimization software. Online athttp://www.mosek.com, 54 (2010)
Nemirovski, A.: Optimization II: Numerical methods for nonlinear continuous optimization. Lecture notes, http://www2.isye.gatech.edu/~nemirovs/Lect_OptII.pdf, (1999)
Nemirovskii, A.S.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
Article MathSciNet Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate O$(1/k^2)$. Soviet Math. Dokl. 27(2), 372–376 (1983)
MATH Google Scholar
Nesterov, Y.: Introductory lectures on convex optimization: a basic course. Applied optimization. Kluwer Academic Publishers (2004). ISBN 9781402075537
Park, C., Park, J., Ryu, E.K.: Factor-$\sqrt{2}$ acceleration of accelerated gradient methods. preprintarXiv:2102.07366 (2021)
Taylor, A.: Convex Interpolation and Performance Estimation of First-order Methods for Convex Optimization. PhD thesis, Université catholique de Louvain (2017)
Taylor, A., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In: Proceedings of the 2019 Conference on Learning Theory (COLT), volume 99, pages 2934–2992 (2019)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 1278–1283. IEEE (2017a)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
Article MathSciNet MATH Google Scholar
Toker, O., Ozbay, H.: On the NP-hardness of solving bilinear matrix inequalities and simultaneous stabilization with static output feedback. In: 1995 American Control Conference (ACC), pages 2525–2526. IEEE (1995)
Van Loan, C.F., Golub, G.H.: Matrix computations. Johns Hopkins University Press, Baltimore (1983)
MATH Google Scholar
Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2(1), 49–54 (2018)
Article MathSciNet Google Scholar
Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of accelerated methods in optimization. J. Mach. Learn. Res. (JMLR) 22(113), 1–34 (2021)
MathSciNet MATH Google Scholar
Zhou, K., So, A.M.-C., Cheng, J.: Boosting first-order methods by shifting objective: New schemes with faster worst case rates. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

Download references

Acknowledgements

The authors would like to thank Shuvomoy Das Gupta (https://shuvomoy.github.io/site/) for pointing out a few typos and for nice suggestions for improvements. We are also very grateful to two anonymous referees and an associate editor of Mathematical Programming for providing very constructive feedbacks on an early version of the manuscript.

Author information

Authors and Affiliations

INRIA, Département d’informatique de l’ENS, École normale supérieure, CNRS, PSL Research University, Paris, France
Adrien Taylor
Google Research Israel, Tel Aviv-Yafo, Israel
Yoel Drori

Authors

Adrien Taylor
View author publications
You can also search for this author in PubMed Google Scholar
Yoel Drori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adrien Taylor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A. Taylor acknowledges support from the European Research Council (grant SEQUOIA 724063).This work was funded in part by the french government under management of Agence Nationale de la recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute)

Appendices

Alternate parametrization for first-order methods

In this section, we show that any method (7) can be reparametrized as (8) (and vice-versa), using the identity (9). First note that the equivalence is clear for $k=1$, as

$$\begin{aligned} \begin{aligned} w_1-x_\star&=w_0-x_\star -\frac{h_{1,0}}{L}(\nabla {\tilde{f}}(w_0)+\mu (w_0-x_\star ))\\&=(w_0-x_\star )(1-\frac{\mu }{L}h_{1,0})-\frac{h_{1,0}}{L}\nabla {\tilde{f}}(w_0), \end{aligned} \end{aligned}$$

and hence the equivalence holds for $k=1$. Now, assuming the equivalence holds at iteration k, let us check that it holds at iteration $k+1$, that is assume $w_i-x_\star =(w_0-x_\star )\left( 1-\frac{\mu }{L}\sum _{j=0}^{i-1}\alpha _{i,j}\right) -\frac{1}{L}\sum _{j=0}^{i-1}\alpha _{i,j}\nabla {\tilde{f}}(w_j)$ for $0\le i\le k$ and compute

$$\begin{aligned} \begin{aligned} w_{k+1}-x_\star&=w_k-x_\star -\frac{1}{L}\sum _{i=0}^k h_{k+1,i}(\nabla {\tilde{f}}(w_i)+\mu (w_i-x_\star ))\\&=(w_0-x_\star )\left( 1-\frac{\mu }{L}\sum _{i=0}^{k-1}\alpha _{k,i}-\frac{\mu }{L}\sum _{i=0}^k h_{k+1,i}+\frac{\mu ^2}{L^2}\sum _{i=0}^k \sum _{j=0}^{k-1} h_{k+1,i}\alpha _{i,j}\right) \\&\quad -\frac{1}{L}\sum _{i=0}^{k-1}\alpha _{k,i}\nabla {\tilde{f}}(w_i)-\frac{1}{L}\sum _{i=0}^k h_{k+1,i}\nabla {\tilde{f}}(w_i)\\ {}&\quad +\frac{\mu }{L^2}\sum _{i=0}^k \sum _{j=0}^{k-1}h_{k+1,i}\alpha _{i,j}\nabla {\tilde{f}}(w_j) \end{aligned} \end{aligned}$$

by reverting the ordering of the double sums, renaming the indices, and reordering, we get

$$\begin{aligned}&w_{k+1}-x_\star \\&\quad =(w_0-x_\star )\left( 1-\frac{\mu }{L}\sum _{i=0}^{k-1}\alpha _{k,i}-\frac{\mu }{L}\sum _{i=0}^k h_{k+1,i}+\frac{\mu ^2}{L^2}\sum _{j=0}^{k-1} \sum _{i=j+1}^{k} h_{k+1,i}\alpha _{i,j}\right) \\&\qquad -\frac{1}{L}\sum _{i=0}^{k-1}\alpha _{k,i}\nabla {\tilde{f}}(w_i)-\frac{1}{L}\sum _{i=0}^k h_{k+1,i}\nabla {\tilde{f}}(w_i)+\frac{\mu }{L^2}\sum _{j=0}^{k-1} \sum _{i=j+1}^{k}h_{k+1,i}\alpha _{i,j}\nabla {\tilde{f}}(w_j)\\&\quad =(w_0-x_\star )\left( 1-\frac{\mu }{L}\sum _{i=0}^{k-1}\alpha _{k,i}-\frac{\mu }{L}\sum _{i=0}^k h_{k+1,i}+\frac{\mu ^2}{L^2}\sum _{i=0}^{k-1} \sum _{j=i+1}^{k} h_{k+1,j}\alpha _{j,i}\right) \\&\qquad -\frac{1}{L}\sum _{i=0}^{k-1}\alpha _{k,i}\nabla {\tilde{f}}(w_i)-\frac{1}{L}\sum _{i=0}^k h_{k+1,i}\nabla {\tilde{f}}(w_i)+\frac{\mu }{L^2}\sum _{i=0}^{k-1} \sum _{j=i+1}^{k}h_{k+1,j}\alpha _{j,i}\nabla {\tilde{f}}(w_i)\\&\quad =(w_0-x_\star )\left( 1-\frac{\mu }{L}h_{k+1,k}-\frac{\mu }{L}\sum _{i=0}^{k-1}\left( \alpha _{k,i}+ h_{k+1,i}-\frac{\mu }{L}\sum _{j=i+1}^{k} h_{k+1,j}\alpha _{j,i}\right) \right) \\&\qquad -\frac{1}{L} h_{k+1,k}\nabla {\tilde{f}}(w_k) -\frac{1}{L}\sum _{i=0}^{k-1}\left( \alpha _{k,i}+h_{k+1,i}-\frac{\mu }{L}\sum _{j=i+1}^{k}h_{k+1,j}\alpha _{j,i}\right) \nabla {\tilde{f}}(w_i). \end{aligned}$$

From this last reformulation, the choice (9), that is

$$\begin{aligned} \begin{aligned} \alpha _{k+1,i}=\left\{ \begin{array}{ll} h_{k+1,k} &{} \text {if } i=k \\ h_{k+1,i}+\alpha _{k,i}-\frac{\mu }{L}\sum \limits _{j=i+1}^k h_{k+1,j}\alpha _{j,i}\quad &{} \text {if } 0\le i <k, \end{array}\right. \end{aligned} \end{aligned}$$

allows enforcing the coefficients of all independent terms $(w_0-x_\star )$, $\nabla {\tilde{f}}(w_0),\ldots ,\nabla {\tilde{f}}(w_k)$ to be equal in both (7) and (8), reaching the desired statement. In addition, note that this change of variable is reversible.

Algebraic manipulations of (dual-SDP-R)

In this section, we reformulate (dual-SDP-R) for enabling us optimizing both on $\alpha _{i,j}$’s and $\lambda _{i,j}$’s simultaneously. For doing that, let us start by conveniently noting that

$$\begin{aligned} S(\tau ,\{\lambda _{i,j},\{\alpha _{i,j}\}\})\succeq 0 \Leftrightarrow \begin{pmatrix}S'(\tau ,\{\lambda _{i,j}\},\{\alpha _{i,j}\}) &{} {\mathbf {w}}_N \\ {\mathbf {w}}_N^\top &{} 1\end{pmatrix}\succeq 0, \end{aligned}$$

with $S'(\tau ,\{\lambda _{i,j}\},\{\alpha _{i,j}\})=S(\tau ,\{\lambda _{i,j}\}, \{\alpha _{i,j}\})+{\mathbf {w}}_N{\mathbf {w}}_N^\top $, using a standard Schur complement (see, e.g., [29]). The motivation underlying this reformulation is that this lifted linear matrix inequality depends linearly on $\{\alpha _{N,i}\}_i$’s. Indeed, the coefficients of the last iteration only appear through the term ${\mathbf {w}}_N$, which is not present in $S'$ (details below).

We only consider the case $N>1$ below. In the case $N=1$, (6) can be solved without the following simplifications. Let us develop the expression of $S'({\cdot })$ as follows

$$\begin{aligned}&S'(\tau ,\{\lambda _{i,j}\},\{\alpha _{i,j}\})\\&\quad =\,\tau {\mathbf {w}}_0{\mathbf {w}}_0^\top +\frac{1}{2(L-\mu )}\bigg (\lambda _{N-1,\star }\,{\mathbf {g}}_{N-1} {\mathbf {g}}_{N-1}^\top \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad +\sum _{i=0}^{N-1}\lambda _{\star ,i}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \bigg )\\&\qquad -\frac{1}{2}\lambda _{\star ,0}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \frac{1}{2}\sum _{i=1}^{N-1}(\lambda _{i-1,i}+\lambda _{\star ,i})({\mathbf {g}}_i{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top )\\&\qquad +\frac{1}{2}\sum _{i=0}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_{i+1}{\mathbf {w}}_i^\top + {\mathbf {w}}_i{\mathbf {g}}_{i+1}^\top )\\&\quad =\,\tau {\mathbf {w}}_0{\mathbf {w}}_0^\top +\frac{1}{2(L-\mu )}\bigg (\lambda _{N-1,\star }\,{\mathbf {g}}_{N-1} {\mathbf {g}}_{N-1}^\top \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad +\sum _{i=0}^{N-1}\lambda _{\star ,i}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \bigg )\\&\qquad -\frac{1}{2}\lambda _{\star ,0}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \frac{1}{2}\sum _{i=1}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_i{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top )\\&\qquad -\frac{1}{2}\lambda _{N-1,\star }({\mathbf {g}}_{N-1}{\mathbf {w}}_{N-1}^\top +{\mathbf {w}}_{N-1} {\mathbf {g}}_{N-1}^\top )+\frac{1}{2}\sum _{i=0}^{N-2}\lambda _{i,i+1}({\mathbf {g}}_{i+1}{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_{i+1}^\top ), \end{aligned}$$

where we used $\lambda _{i-1,i}+\lambda _{\star ,i}=\lambda _{i,i+1}$ (for $i=1,\ldots ,N-2$), and $\lambda _{N-2,N-1}+\lambda _{\star ,N-1}=\lambda _{N-1,\star }$ for obtaining the second equality. Substituting the expressions for ${\mathbf {w}}_i$’s, we arrive to

$$\begin{aligned} \begin{aligned}&S'(\tau ,\{\lambda _{i,j}\},\{\alpha _{i,j}\})\\&\quad =\,\tau {\mathbf {w}}_0{\mathbf {w}}_0^\top +\frac{1}{2(L-\mu )}\left( \lambda _{N-1,\star }\,{\mathbf {g}}_{N-1} {\mathbf {g}}_{N-1}^\top +\sum _{i=0}^{N-1}\lambda _{\star ,i}{\mathbf {g}}_i{\mathbf {g}}_i^\top \right. \\&\left. \qquad \qquad \qquad \qquad \qquad \qquad \qquad +\sum _{i=0}^{N-2} \lambda _{i,i+1}({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\frac{1}{2}\lambda _{\star ,0}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top ) -\sum _{i=1}^{N-2}\lambda _{i,i+1}\left( 1-\frac{\mu }{L}\sum _{j=0}^{i-1}\alpha _{i,j} \right) \,\frac{1}{2} ({\mathbf {g}}_i{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_i^\top )\\&\qquad +\sum _{i=1}^{N-2}\lambda _{i,i+1}\sum _{j=0}^{i-1}\frac{\alpha _{i,j}}{L}\, \frac{1}{2}({\mathbf {g}}_i{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_i^\top )\\ {}&\qquad -\lambda _{N-1,\star } \left( 1-\frac{\mu }{L}\sum _{j=0}^{N-2}\alpha _{N-1,j}\right) \, \frac{1}{2}({\mathbf {g}}_{N-1}{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_{N-1}^\top )\\&\qquad +\lambda _{N-1,\star }\sum _{j=0}^{N-2}\frac{\alpha _{N-1,j}}{L}\, \frac{1}{2}({\mathbf {g}}_{N-1}{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_{N-1}^\top )\\&\qquad +\sum _{i=0}^{N-2}\lambda _{i,i+1}\left( 1-\frac{\mu }{L} \sum _{j=0}^{i-1}\alpha _{i,j}\right) \,\frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_{i+1}^\top )\\&\qquad -\sum _{i=0}^{N-2}\lambda _{i,i+1}\sum _{j=0}^{i-1}\frac{\alpha _{i,j}}{L}\, \frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_{i+1}^\top ), \end{aligned} \end{aligned}$$

where we simply expressed each ${\mathbf {w}}_k$ in two terms, one with the contribution of ${\mathbf {w}}_0$, and the other with the contributions of ${\mathbf {g}}_i$’s. Although not pretty, one can observe that $S'$ is still bilinear in $\{\alpha _{i,j}\}$ and $\{\lambda _{i,j}\}$. This expression can be largely simplified, but this form suffices for the purposes in this work.

The information-theoretic exact method is a solution to (Minimax-R)

Proof of Theorem 4

For readability purposes, we establish the claim without explicitly computing the optimal values of the variables $\{\alpha _{i,j}\}$ and $\{\beta _{i,j}\}$ for ITEM. For avoiding this step, let us note that ITEM is clearly a fixed-step first-order method following Definition 2. Therefore, following Lemma 3, the method can also be written in the alternate parametrization (8), using the association $w_i\leftarrow y_i$ ($i=0,1,\ldots ,N-1$) and $w_N\leftarrow z_N$, where $y_k$ and $z_k$ are the sequences defined by (3). Let $\{\alpha _{i,j}^\star \}$ denote the steps sizes corresponding to ITEM written in the form (8). We proceed to show that by choosing

$$\begin{aligned} \begin{aligned} {\lambda _{\star ,i}^\star }&=\frac{1-q}{L}\frac{A_{i+1}-A_i}{1+qA_N}\\ {\lambda _{i,i+1}^\star }&=\frac{1-q}{L}\frac{A_{i+1}}{1+qA_N}\\ {\lambda _{N-1,\star }^\star }&=\frac{1-q}{L}\frac{A_N}{1+qA_N}\\ {\tau ^\star }&= \frac{1}{1+qA_N} \end{aligned} \end{aligned}$$

(16)

and setting $\{\beta _{i,j}^{\star }\}$ in accordance to (14), we reach a feasible solution to (Minimax-R). Note that optimality of the solution follows from the value of $\tau ^\star $, which matches the lower complexity bound discussed in Sect. 2.3.

For establishing dual feasibility, we relate (Minimax-R) to the Lagrangian of (R). That is, denoting

$$\begin{aligned} K=S''(\tau ^\star ,\{\lambda _{i,j}^\star \},\{\beta _{i,j}^{\star }\})-{\mathbf {w}}_N{\mathbf {w}}_N^\top =S(\tau ^\star ,\{\lambda _{i,j}^\star \},\{\alpha _{i,j}^\star \}),\end{aligned}$$

we have for all (F, G) as in (11), by construction,

$$\begin{aligned}&F\left( \sum _{i=0}^{N-2} \lambda _{i,i+1}^\star ({\mathbf {f}}_{i+1}-{\mathbf {f}}_i)+\sum _{i=0}^{N-1}\lambda _{\star ,i}^\star {\mathbf {f}}_i-\lambda _{N-1,\star }^\star {\mathbf {f}}_{N-1}\right) +\mathrm {Tr}(KG)\\&\quad \,=\tau ^\star \Vert w_0-w_\star \Vert ^2-\Vert w_N-w_\star \Vert ^2\\&\qquad +\sum _{i=0}^{N-2}\lambda _{i,i+1}^\star \bigg [ {\tilde{f}}(w_{i+1})- {\tilde{f}}(w_i)+\langle \nabla {\tilde{f}}(w_{i+1});w_i-w_{i+1}\rangle \\&\qquad \qquad \qquad +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}} (w_i)-\nabla {\tilde{f}}(w_{i+1})\Vert ^2\bigg ]\\&\qquad +\sum _{i=0}^{N-1}{\lambda _{\star ,i}^\star }\left[ {\tilde{f}}(w_i) -{\tilde{f}}_\star +\langle \nabla {\tilde{f}}(w_i),w_\star -w_{i}\rangle +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}}(w_{i})\Vert ^2\right] \\&\qquad +{\lambda _{N-1,\star }^\star }\left[ {\tilde{f}}_\star -{\tilde{f}}(w_{N-1}) +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}}(w_{N-1})\Vert ^2\right] . \end{aligned}$$

Using the association $w_i\leftarrow y_i$ ($i=0,1,\ldots ,N-1$) and $w_N\leftarrow z_N$, as well as $f(y_i)=\tilde{f}(y_i)+\frac{\mu }{2}\Vert y_i-x_\star \Vert ^2$, it follows from Lemma 1 (the weighted sums below are the same as that of Lemma 1, written in terms of $\tilde{f}(\cdot )$ instead of $f(\cdot )$ and scaled by a factor $\frac{1}{L+\mu A_N}$; their reformulations are therefore also the same up to the rescaling) that for $i=1,\ldots ,N-1$

$$\begin{aligned} \begin{aligned}&{\lambda _{\star ,i}^\star }\left[ {\tilde{f}}(w_i)-{\tilde{f}}_\star +\langle \nabla {\tilde{f}}(w_i),w_\star -w_{i}\rangle +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}}(w_{i})\Vert ^2\right] \\&\quad +{\lambda _{i-1,i}^\star }\bigg [ {\tilde{f}}(w_{i})- {\tilde{f}}(w_{i-1})+\langle \nabla {\tilde{f}}(w_{i});w_{i-1}-w_{i}\rangle \\&\qquad \qquad \quad \, +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}} (w_i)-\nabla {\tilde{f}}(w_{i-1})\Vert ^2\bigg ]=\frac{1}{L+\mu A_N}(\phi _{i+1}-\phi _{i}), \end{aligned} \end{aligned}$$

as well as

$$\begin{aligned} \begin{aligned}&{\lambda _{\star ,0}^\star }\left[ {\tilde{f}}(w_0)-{\tilde{f}}_\star +\langle \nabla {\tilde{f}}(w_0),w_\star -w_{0}\rangle +\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}} (w_{0})\Vert ^2\right] \\&\quad =\frac{1}{L+\mu A_N}(\phi _1-\phi _0),\\&{\lambda _{N-1,\star }^\star }\left[ {\tilde{f}}_\star -{\tilde{f}}(w_{N-1})+\frac{1}{2(L-\mu )}\Vert \nabla {\tilde{f}}(w_{N-1})\Vert ^2\right] =-\frac{(1-q)A_N}{L+\mu A_N}\psi _{N-1}. \end{aligned} \end{aligned}$$

In addition, noting that $\phi _0=L\Vert w_0-w_\star \Vert ^2$ allows reaching the following reformulation

$$\begin{aligned} \begin{aligned}&F\left( \sum _{i=0}^{N-2} {\lambda _{i,i+1}^\star }({\mathbf {f}}_{i+1}-{\mathbf {f}}_i)+\sum _{i=0}^{N-1}{\lambda _{\star ,i}^\star }{\mathbf {f}}_i-{\lambda _{N-1,\star }^\star }{\mathbf {f}}_{N-1}\right) +\mathrm {Tr}(KG)\\ {}&=\frac{1}{L+\mu A_N}\left( \phi _0-(1-q)A_N\psi _{N-1}+\sum _{i=0}^{N-1}(\phi _{i+1}-\phi _i)\right) -\Vert z_N-w_\star \Vert ^2\\ {}&=0, \end{aligned} \end{aligned}$$

where the last equality follows from $\phi _N=(1-q)A_N\psi _{N-1}+(L+\mu A_N)\Vert z_N-w_\star \Vert ^2$. Therefore, ITEM is a solution to (Minimax-R). In more direct terms of the SDP (Minimax-R), one can verify that

$$\begin{aligned} \begin{aligned}&{\lambda _{\star ,0}^\star }-{\lambda _{0,1}^\star }=\frac{1-q}{L}\left( \frac{A_{1}}{1+qA_N}-\frac{A_{1}}{1+qA_N}\right) =0&\\&{\lambda _{i-1,i}^\star }+{\lambda _{\star ,i}^\star }-{\lambda _{i,i+1}^\star }=\frac{1-q}{L}\left( \frac{A_{i}}{1+qA_N}+\frac{A_{i+1}-A_i}{1+qA_N}-\frac{A_{i+1}}{1+qA_N}\right) =0 \quad (i=1,\hdots ,N-2)\\&{\lambda _{N-2,N-1}^\star }+{\lambda _{\star ,N-1}^\star }-{\lambda _{N-1,\star }^\star }=\frac{1-q}{L}\left( \frac{A_{N-1}}{1+qA_N}+\frac{A_{N}-A_{N-1}}{1+qA_N}-\frac{A_N}{1+qA_N}\right) =0,&\end{aligned} \end{aligned}$$

the previous computations therefore imply that $K=0$ for ITEM, and hence $S''(\tau ,\{\lambda _{i,j}\},{\{\beta _{i,j}\}})={\mathbf {w}}_N{\mathbf {w}}_N^\top \succeq 0$. $\square $

An SDP formulation for optimizing function values

In this section, we show how to adapt the methodology developed in Sect. 3 for a family of alternate design criteria, which include $(f(w_N)-f_\star )/\Vert w_0-w_\star \Vert ^2$ and $(f(w_N)-f_\star )/(f(w_0)-f_\star )$ (for which numerical examples are provided respectively in Sect. E.1 and Sect. E.2). The developments slightly differ from those required for optimizing ${\Vert w_N-w_\star \Vert ^2}/{\Vert w_0-w_\star \Vert ^2}$; and we decided not to present a unified version in the core of the text, for readability purposes. In particular, the set of selected inequalities is slightly different, altering the linearization procedure.

The criteria we deal with in this section are of the form

$$\begin{aligned}&\frac{f(w_N)-f_\star }{c_w\Vert w_0-w_\star \Vert ^2+c_f(f(w_0)-f_\star )}\\&\quad =\frac{{\tilde{f}}(w_N)- f_\star +\frac{\mu }{2}\Vert w_N-w_\star \Vert ^2}{c_w\Vert w_0-w_\star \Vert ^2+c_f({\tilde{f}}(w_0)-f_\star +\frac{\mu }{2}\Vert w_0-w_\star \Vert ^2)}. \end{aligned}$$

As the steps are essentially the same as detailed in Sect. 3, we proceed without providing much detail. We start with the discrete version, using the set $I=\{\star ,0,\ldots ,N\}$

$$\begin{aligned} \begin{aligned} \max _{\begin{array}{c} \{(w_i,g_i,f_i)\}_{i\in I}\\ d\in {\mathbb {N}} \end{array}}&f_N-f_\star +\frac{\mu }{2}\Vert w_N-w_\star \Vert ^2&\\ \text {s.t. }&c_w\Vert w_0-w_\star \Vert ^2+c_f(f_0-f_\star +\frac{\mu }{2}\Vert w_0-w_\star \Vert ^2)=1,\, g_\star =0&\\&w_{k} \text { generated by}~(8)&\text {for }k=1,\ldots ,N\\&f_i\ge f_j+\langle g_j;w_i-w_j\rangle +\frac{1}{2(L-\mu )}\Vert g_i-g_j\Vert ^2\quad&\text {for all }i,j\in I. \end{aligned} \end{aligned}$$

The upper bound we use is now very slightly different (the selected subset of constraints is not the same as that of (R))

$$\begin{aligned} \begin{aligned} \mathrm {UB}_{\mu ,L}(\{\alpha _{i,j}\})=\max _{\begin{array}{c} \{(w_i,g_i,f_i)\}_{i\in I}\\ d\in {\mathbb {N}} \end{array}}&f_N-f_\star +\frac{\mu }{2}\Vert w_N-w_\star \Vert ^2&\\ \text {s.t. }&c_w\Vert w_0-w_\star \Vert ^2+c_f(f_0-f_\star +\frac{\mu }{2}\Vert w_0-w_\star \Vert ^2)=1,\, g_\star =0&\\&w_{k} \text { generated by}~(8)&\text {for }k=1,\ldots ,N\\&f_i\ge f_{i+1}+\langle g_{i+1};w_i-w_{i+1}\rangle +\frac{1}{2(L-\mu )}\Vert g_i-g_{i+1}\Vert ^2&\text {for }i=0,\ldots ,N-1\\&f_\star \ge f_i+\langle g_i,w_\star -w_{i+1}\rangle +\frac{1}{2(L-\mu )}\Vert g_{i+1}\Vert ^2&\text {for }i=0,\ldots ,N \end{aligned} \end{aligned}$$

The corresponding SDP can be written using a similar couple (G, F)

$$\begin{aligned} \begin{aligned} G&=\begin{pmatrix} \Vert w_0-w_\star \Vert ^2 &{} \langle g_0;w_0-w_\star \rangle &{} \langle g_1;w_0-w_\star \rangle &{}\ldots &{} \langle g_{N};w_0-w_\star \rangle \\ \langle g_0;w_0-w_\star \rangle &{} {\Vert g_0\Vert ^2} &{} \langle g_1;g_0\rangle &{} \ldots &{}\langle g_{N};g_0\rangle \\ \langle g_1;w_0-w_\star \rangle &{} \langle g_1;g_0\rangle &{} {\Vert g_1\Vert ^2} &{}\ldots &{}\langle g_{N};g_1\rangle \\ \vdots &{} \vdots &{} \vdots &{}\ddots &{} \vdots \\ \langle g_{N};w_0-w_\star \rangle &{} \langle g_{N};g_0\rangle &{} \langle g_{N};g_1\rangle &{}\ldots &{}\Vert g_{N}\Vert ^2\\ \end{pmatrix}\\ F&=\begin{pmatrix}f_0-f_\star \\ f_1-f_\star \\ \vdots \\ f_{N}-f_\star \end{pmatrix}, \end{aligned} \end{aligned}$$

and the similar notations

$$\begin{aligned} {\mathbf {w}}_0=e_1\in {\mathbb {R}}^{N+2},\quad {\mathbf {g}}_i=e_{i+2}\in {\mathbb {R}}^{N+2},\quad {\mathbf {f}}_i=e_{i+1}\in {\mathbb {R}}^{N+1}, \end{aligned}$$

with $i=0,\ldots ,N$ and $e_i$ being the unit vector whose ith component is equal to 1. In addition, we can also denote by

$$\begin{aligned} {\mathbf {w}}_k={\mathbf {w}}_0\left( 1-\frac{\mu }{L}\sum _{i=0}^{k-1}\alpha _{k,i}\right) -\sum _{i=0}^{k-1}\frac{\alpha _{k,i}}{L}{\mathbf {g}}_i. \end{aligned}$$

A dual formulation of $\mathrm {UB}_{\mu ,L}$ is given by (we directly included the Schur complement)

$$\begin{aligned} \begin{aligned} \mathrm {UB}_{\mu ,L}(\{\alpha _{i,j}\})=\min _{\tau ,\lambda _{i,j}\ge 0}&\tau ,&\\ \text {s.t.}&\,\begin{pmatrix}{{\bar{S}}}'(\tau ,\{\lambda _{i,j}\},\{{\alpha _{i,j}}\}) &{}\qquad \sqrt{\mu }{\mathbf {w}}_N \\ \sqrt{\mu }{\mathbf {w}}_N^\top &{} 2\end{pmatrix}\succeq 0,&\\&\tau \, c_f\, {\mathbf {f}}_{0}+\sum _{i=0}^{N-1} \lambda _{i,i+1}({\mathbf {f}}_{i+1}-{\mathbf {f}}_i)+\sum _{i=0}^{N}\lambda _{\star ,i}{\mathbf {f}}_i={\mathbf {f}}_{N} \end{aligned} \end{aligned}$$

with

$$\begin{aligned} \begin{aligned} {{\bar{S}}}'(\tau ,\{\lambda _{i,j}\},\{{\alpha _{i,j}}\})=&\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top +\sum _{i=0}^{N} \frac{\lambda _{\star ,i}}{2}\left( -{\mathbf {g}}_i {\mathbf {w}}_i^\top -{\mathbf {w}}_i{\mathbf {g}}_i^\top +\frac{1}{L-\mu }{\mathbf {g}}_i{\mathbf {g}}_i^\top \right) \\&+\sum _{i=0}^{N-1}\frac{\lambda _{i,i+1}}{2}\bigg ({\mathbf {g}}_{i+1}({\mathbf {w}}_i-{\mathbf {w}}_{i+1})^\top +({\mathbf {w}}_i-{\mathbf {w}}_{i+1}){\mathbf {g}}_{i+1}^\top \\&\qquad \qquad \qquad +\frac{1}{L-\mu }({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \bigg ). \end{aligned} \end{aligned}$$

Note that that the equality constraint corresponds to

$$\begin{aligned} \begin{aligned}&c_f+\lambda _{\star ,0}-\lambda _{0,1}=0&\\&\lambda _{i-1,i}+\lambda _{\star ,i}-\lambda _{i,i+1}=0 \quad&\text {for }i=1,\ldots ,N-1\\&\lambda _{N-1,N}+\lambda _{\star ,N}=1.&\end{aligned} \end{aligned}$$

We perform a some additional work on ${{\bar{S}}}'$ (whose dependency on $\{\alpha _{i,j}\}$ is implicit through the dependency on $\{{\mathbf {w}}_i\}$), as before

$$\begin{aligned}&{{\bar{S}}}'(\tau ,\{\lambda _{i,j}\},\{{\alpha _{i,j}}\})\\&\quad =\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top \\&\qquad +\frac{1}{2(L-\mu )}\left( \sum _{i=0}^{N} {\lambda _{\star ,i}}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-1}{\lambda _{i,i+1}} ({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\sum _{i=0}^{N} {\lambda _{\star ,i}}\frac{1}{2}\left( {\mathbf {g}}_i {\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top \right) \\&\qquad +\sum _{i=0}^{N-1}{\lambda _{i,i+1}}\frac{1}{2}\left( {\mathbf {g}}_{i+1} ({\mathbf {w}}_i-{\mathbf {w}}_{i+1})^\top +({\mathbf {w}}_i-{\mathbf {w}}_{i+1}){\mathbf {g}}_{i+1}^\top \right) \\&\quad =\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top \\&\qquad +\frac{1}{2(L-\mu )} \left( \sum _{i=0}^{N} {\lambda _{\star ,i}}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-1} {\lambda _{i,i+1}}({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\lambda _{\star ,0}\frac{1}{2}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \sum _{i=1}^N(\lambda _{\star ,i}+\lambda _{i-1,i})\frac{1}{2}({\mathbf {g}}_i{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top )\\&\qquad +\sum _{i=0}^{N-1}\lambda _{i,i+1}\frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_{i+1}^\top )\\&\quad =\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top \\&\qquad +\frac{1}{2(L-\mu )}\left( \sum _{i=0}^{N} {\lambda _{\star ,i}}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-1}{\lambda _{i,i+1}} ({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\lambda _{\star ,0}\frac{1}{2}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \sum _{i=1}^{N-1}\lambda _{i,i+1}\frac{1}{2}({\mathbf {g}}_i{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_i^\top )- \frac{1}{2}({\mathbf {g}}_N{\mathbf {w}}_N^\top +{\mathbf {w}}_N{\mathbf {g}}_N^\top )\\&\qquad +\sum _{i=0}^{N-1}\lambda _{i,i+1}\frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {w}}_i^\top +{\mathbf {w}}_i{\mathbf {g}}_{i+1}^\top ), \end{aligned}$$

where we used $\lambda _{\star ,i}+\lambda _{i-1,i}=\lambda _{i,i+1}$ (for $i=1,\ldots ,N-1$) and $\lambda _{\star ,N}+\lambda _{N-1,N}=1$. Now, making the dependence on $\alpha _{i,j}$’s explicit again, we arrive to

$$\begin{aligned}&{{\bar{S}}}'(\tau ,\{\lambda _{i,j}\},\{{\alpha _{i,j}}\})\\&\quad =\,\tau \, (c_w+c_f\frac{\mu }{2}) {\mathbf {w}}_0{\mathbf {w}}_0^\top \\&\qquad +\frac{1}{2(L-\mu )}\left( \sum _{i=0}^{N} {\lambda _{\star ,i}}{\mathbf {g}}_i{\mathbf {g}}_i^\top +\sum _{i=0}^{N-1}{\lambda _{i,i+1}} ({\mathbf {g}}_i-{\mathbf {g}}_{i+1})({\mathbf {g}}_i-{\mathbf {g}}_{i+1})^\top \right) \\&\qquad -\lambda _{\star ,0}\frac{1}{2}({\mathbf {g}}_0{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_0^\top )- \sum _{i=1}^{N-1}\lambda _{i,i+1}\left( 1-\frac{\mu }{L}\sum _{j=0}^{i-1} \alpha _{i,j}\right) \frac{1}{2}({\mathbf {g}}_i{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_i^\top )\\&\qquad +\sum _{i=1}^{N-1}\lambda _{i,i+1}\sum _{j=0}^{i-1}\alpha _{i,j} \frac{1}{2}({\mathbf {g}}_i{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_i^\top )- \left( 1-\frac{\mu }{L}\sum _{j=0}^{N-1}\alpha _{N,j}\right) \frac{1}{2}({\mathbf {g}}_N{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_N^\top )\\&\qquad +\sum _{j=0}^{N-1}\alpha _{N,j}\frac{1}{2}({\mathbf {g}}_N{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_N^\top ) +\sum _{i=0}^{N-1}\lambda _{i,i+1}\left( 1-\frac{\mu }{L} \sum _{j=0}^{i-1}\alpha _{i,j}\right) \frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {w}}_0^\top +{\mathbf {w}}_0{\mathbf {g}}_{i+1}^\top )\\&\qquad -\sum _{i=0}^{N-1}\lambda _{i,i+1}\sum _{j=0}^{i-1}\alpha _{i,j} \frac{1}{2}({\mathbf {g}}_{i+1}{\mathbf {g}}_j^\top +{\mathbf {g}}_j{\mathbf {g}}_{i+1}^\top ) \end{aligned}$$

and it remains to remark that the change of variables

$$\begin{aligned} \begin{aligned} {\beta _{i,j}}=\left\{ \begin{array}{llll} \lambda _{i,i+1}\alpha _{i,j} &{}\quad \text {if } 0\le i \le N-1 \\ \alpha _{N,j} &{}\text {if } i=N. \end{array}\right. \end{aligned} \end{aligned}$$

(17)

linearizes the bilinear matrix inequality, again, and it remains to solve the SDP (13) using standard packages. Numerical results for the pairs $(c_w,c_f)=(1,0)$ and $(c_w,c_f)=(0,1)$ are respectively provided in Sect. E.1 and Sect. E.2. A source code for implementing those SDP is provided in Sect. 4.

Numerical examples

As shown in Appendix D, slight modifications of the relaxations used for obtaining (Minimax-R) allows forming tractable problems for optimizing the parameters of fixed-step methods under different optimality criteria. Although we were unable to obtain closed-form solutions to the problems arising for these alternative criteria, the resulting problems can still be approximated numerically for specific values of $\mu $, L and N.

In the following, we provide a couple of examples that were obtained by numerically solving the first-order method design problem (Minimax-R), formulated as a linear semidefinite program using standard solvers [17, 18].

1.1 Optimized methods for $(f(w_N)-f_\star )/{\Vert w_0-w_\star \Vert ^2}$

As a first example, we consider the criterion $({f(w_N)-f_\star })/{\Vert w_0-w_\star \Vert }$. The following list provides solutions obtained by solving the corresponding design problem for $N=1,\ldots ,5$ with $L=1$ and $\mu =.1$. The solutions are presented using the notations from (7) together with the corresponding worst-case guarantees.

For a single iteration, by solving the corresponding optimization problem, we obtain a method with guarantee $\frac{f(w_1)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.1061$ and step size
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.4606 \end{bmatrix}. \end{aligned}$$
This bound and the corresponding step size match the optimal step size $h_{1,0}=\frac{q+1-\sqrt{q^2-q+1}}{q}$, see [24, Theorem 4.14].
For $N=2$ iterations, we obtain $\frac{f(w_2)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.0418$ with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5567 &{} \\ 0.1016 &{} \,\,\,1.7016 \end{bmatrix}. \end{aligned}$$
For $N=3$, we obtain $\frac{f(w_3)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.0189$ with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5512 &{} &{} \\ 0.1220 &{} \,\,\,1.8708 &{} \\ 0.0316 &{} \,\,\,0.2257 &{} \,\,\,1.8019 \end{bmatrix}. \end{aligned}$$
For $N=4$, we obtain $\frac{f(w_4)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.0089$, with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5487 &{} &{} &{} \\ 0.1178 &{} \,\,\,1.8535 &{} &{} \\ 0.0371 &{} \,\,\,0.2685 &{} \,\,\,2.0018 &{} \\ 0.0110 &{} \,\,\,0.0794 &{} \,\,\,0.2963 &{} \,\,\,1.8497 \end{bmatrix}.\end{aligned}$$
Finally, for $N=5$, we obtain $\frac{f(w_5)-f_\star }{\Vert w_0-w_\star \Vert }\le 0.0042$ with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5476 &{} &{} &{} &{} \\ 0.1159 &{} \,\,\,1.8454 &{} &{} &{} \\ 0.0350 &{} \,\,\,0.2551 &{} \,\,\,1.9748 &{} &{} \\ 0.0125 &{} \,\,\,0.0913 &{} \,\,\,0.3489 &{} \,\,\,2.0625 &{} \\ 0.0039 &{} \,\,\,0.0287 &{} \,\,\,0.1095 &{} \,\,\,0.3334 &{} \,\,\,1.8732 \end{bmatrix}. \end{aligned}$$

Note that when $\mu =0$, we recover the step size policy of the OGM by Kim and Fessler [12]. When setting $\mu >0$, we observe that the resulting optimized method is apparently less practical as the step sizes critically depend on the horizon N. In particular, one can observe that $h_{1,0}^\star $ varies with the horizon N.

Figure 1 illustrates the behavior of the worst-case guarantee for larger values of N and compares it to the currently best known corresponding lower bound, as well as to worst-case guarantees for TMM, Nesterov’s Fast Gradient Method (FGM) for strongly convex functions, as well as to the methods generated with the SSEP procedure from [8]. All the worst-case guarantees are computed numerically using the corresponding performance estimation problems (see e.g., the toolbox [26]), and as a result, they are tight in the sense that matching inputs to the algorithms attaining the bounds can be numerically constructed.

1.2 Optimized methods for $({f(w_N)-f_\star })/({f(w_0)-f_\star })$

As in the previous section, the technique can be adapted for the criterion $(f(w_N)-f_\star )/(f(x_0)-f_\star )$, see Appendix D for details. The following step sizes were obtained by setting $L=1$ and $\mu =.1$ and solving the resulting optimization problem from different values of N.

For a single iteration, $N=1$, we obtain a guarantee $\frac{f(w_1)-f_\star }{f(w_0)-f_\star } \le 0.6694$ with the corresponding step size
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.8182 \end{bmatrix},\end{aligned}$$
which matches the known optimal step size $2/(L+\mu )$ for this setup [5, Theorem 4.2].
For $N=2$, we obtain $\frac{f(w_2)-f_\star }{f(w_0)-f_\star } \le 0.3554$ with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 2.0095 &{} \\ 0.4229 &{} \,\,\,2.0095 \end{bmatrix}.\end{aligned}$$
For $N=3$, we obtain $\frac{f(w_3)-f_\star }{f(w_0)-f_\star } \le 0.1698$ with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.9470 &{} &{} \\ 0.4599 &{} \,\,\,2.2406 &{} \\ 0.1705 &{} \,\,\,0.4599 &{} \,\,\,1.9470 \end{bmatrix}.\end{aligned}$$
For $N=4$, we obtain $\frac{f(w_4)-f_\star }{f(w_0)-f_\star } \le 0.0789$ with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.9187 &{} &{} &{} \\ 0.4098 &{} \,\,\,2.1746 &{} &{} \\ 0.1796 &{} \,\,\,0.5147 &{} \,\,\,2.1746 &{} \\ 0.0627 &{} \,\,\,0.1796 &{} \,\,\,0.4098 &{} \,\,\,1.9187 \end{bmatrix}.\end{aligned}$$
Finally, for $N=5$, we reach $\frac{f(w_5)-f_\star }{f(w_0)-f_\star } \le 0.0365$ with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.9060 &{} &{} &{} &{} \\ 0.3879 &{} \,\,\,2.1439 &{} &{} &{} \\ 0.1585 &{} \,\,\,0.4673 &{} \,\,\,2.1227 &{} &{} \\ 0.0660 &{} \,\,\,0.1945 &{} \,\,\,0.4673 &{} \,\,\,2.1439 &{} \\ 0.0224 &{} \,\,\,0.0660 &{} \,\,\,0.1585 &{} \,\,\,0.3879 &{} \,\,\,1.9060 \end{bmatrix}.\end{aligned}$$

Note that the resulting method is again apparently less practical than ITEM, as step sizes also critically depend on the horizon N; for example, observe again that the value of $h_{1,0}^\star $ depends on N. Interestingly, one can observe that the corresponding step sizes are symmetric, and that the worst-case guarantees seem to behave slightly better than in the distance problem ${\Vert w_N-w_\star \Vert ^2}/{\Vert w_0-w_\star \Vert ^2}$, although their asymptotic rate has to be the same, due to the properties of strongly convex functions. Figure 2 illustrates the worst-case guarantees of the corresponding method for larger numbers of iterations, and compares it to the lower bound.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Taylor, A., Drori, Y. An optimal gradient method for smooth strongly convex minimization. Math. Program. 199, 557–594 (2023). https://doi.org/10.1007/s10107-022-01839-y

Download citation

Received: 29 April 2021
Accepted: 19 May 2022
Published: 20 June 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10107-022-01839-y

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

An optimal gradient method for smooth strongly convex minimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact

Nonlinear conjugate gradient for smooth convex functions

Optimized first-order methods for smooth convex minimization

Notes

References

Acknowledgements