Abstract
We present an optimal gradient method for smooth strongly convex optimization. The method is optimal in the sense that its worst-case bound on the distance to an optimal point exactly matches the lower bound on the oracle complexity for the class of problems, meaning that no black-box first-order method can have a better worst-case guarantee without further assumptions on the class of problems at hand. In addition, we provide a constructive recipe for obtaining the algorithmic parameters of the method and illustrate that it can be used for deriving methods for other optimality criteria as well.
Similar content being viewed by others
Notes
The puzzled reader can verify this using basic symbolic computations. We provide a notebook for verifying the equivalence of the expressions in Sect. 4.
References
Bansal, N., Gupta, A.: Potential-function proofs for gradient methods. Theory Comput. 15(1), 1–32 (2019)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Cyrus, S., Hu, B., Van Scoy, B., Lessard, L.: A robust accelerated optimization algorithm for strongly convex functions. In: 2018 Annual American Control Conference (ACC), pages 1376–1381. IEEE (2018)
d’Aspremont, A., Scieur, D., Taylor, A.: Acceleration methods. Found. and Trends® in Optim. 5(1–2), 1–245 (2021)
De Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)
Drori, Y., Taylor, A.: On the oracle complexity of smooth strongly convex minimization. J. Complex. 68, 101590 (2022). https://doi.org/10.1016/j.jco.2021.101590
Drori, Y., Taylor, A.B.: Efficient first-order methods for convex minimization: a constructive approach. Math. Program. 184(1), 183–220 (2020)
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145, 451–482 (2014)
Gasnikov, A.V., Nesterov, Y.E.: Universal method for stochastic composite optimization problems. Comput. Math. Math. Phys. 58(1), 48–64 (2018)
Gramlich, D., Ebenbauer, C., Scherer, C.W.: Convex synthesis of accelerated gradient algorithms for optimization and saddle point problems using Lyapunov functions. preprintarXiv:2006.09946, (2020)
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016)
Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 172(1), 187–205 (2017)
Kim, D., Fessler, J.A.: Optimizing the Efficiency of first-order methods for decreasing the gradient of smooth convex functions. J. Optim. Theory Appl. 188, 192–219 (2021). https://doi.org/10.1007/s10957-020-01770-2
Lessard, L., Seiler, P.: Direct synthesis of iterative algorithms with bounds on achievable worst-case convergence rate. In: 2020 American Control Conference (ACC), pages 119–125. IEEE (2020)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Löfberg, J.: YALMIP : A toolbox for modeling and optimization in MATLAB. In: Proceedings of the CACSD Conference (2004)
APS Mosek. The MOSEK optimization software. Online athttp://www.mosek.com, 54 (2010)
Nemirovski, A.: Optimization II: Numerical methods for nonlinear continuous optimization. Lecture notes, http://www2.isye.gatech.edu/~nemirovs/Lect_OptII.pdf, (1999)
Nemirovskii, A.S.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate O\((1/k^2)\). Soviet Math. Dokl. 27(2), 372–376 (1983)
Nesterov, Y.: Introductory lectures on convex optimization: a basic course. Applied optimization. Kluwer Academic Publishers (2004). ISBN 9781402075537
Park, C., Park, J., Ryu, E.K.: Factor-\(\sqrt{2}\) acceleration of accelerated gradient methods. preprintarXiv:2102.07366 (2021)
Taylor, A.: Convex Interpolation and Performance Estimation of First-order Methods for Convex Optimization. PhD thesis, Université catholique de Louvain (2017)
Taylor, A., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In: Proceedings of the 2019 Conference on Learning Theory (COLT), volume 99, pages 2934–2992 (2019)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 1278–1283. IEEE (2017a)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
Toker, O., Ozbay, H.: On the NP-hardness of solving bilinear matrix inequalities and simultaneous stabilization with static output feedback. In: 1995 American Control Conference (ACC), pages 2525–2526. IEEE (1995)
Van Loan, C.F., Golub, G.H.: Matrix computations. Johns Hopkins University Press, Baltimore (1983)
Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2(1), 49–54 (2018)
Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of accelerated methods in optimization. J. Mach. Learn. Res. (JMLR) 22(113), 1–34 (2021)
Zhou, K., So, A.M.-C., Cheng, J.: Boosting first-order methods by shifting objective: New schemes with faster worst case rates. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Acknowledgements
The authors would like to thank Shuvomoy Das Gupta (https://shuvomoy.github.io/site/) for pointing out a few typos and for nice suggestions for improvements. We are also very grateful to two anonymous referees and an associate editor of Mathematical Programming for providing very constructive feedbacks on an early version of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A. Taylor acknowledges support from the European Research Council (grant SEQUOIA 724063).This work was funded in part by the french government under management of Agence Nationale de la recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute)
Appendices
Alternate parametrization for first-order methods
In this section, we show that any method (7) can be reparametrized as (8) (and vice-versa), using the identity (9). First note that the equivalence is clear for \(k=1\), as
and hence the equivalence holds for \(k=1\). Now, assuming the equivalence holds at iteration k, let us check that it holds at iteration \(k+1\), that is assume \(w_i-x_\star =(w_0-x_\star )\left( 1-\frac{\mu }{L}\sum _{j=0}^{i-1}\alpha _{i,j}\right) -\frac{1}{L}\sum _{j=0}^{i-1}\alpha _{i,j}\nabla {\tilde{f}}(w_j)\) for \(0\le i\le k\) and compute
by reverting the ordering of the double sums, renaming the indices, and reordering, we get
From this last reformulation, the choice (9), that is
allows enforcing the coefficients of all independent terms \((w_0-x_\star )\), \(\nabla {\tilde{f}}(w_0),\ldots ,\nabla {\tilde{f}}(w_k)\) to be equal in both (7) and (8), reaching the desired statement. In addition, note that this change of variable is reversible.
Algebraic manipulations of (dual-SDP-R)
In this section, we reformulate (dual-SDP-R) for enabling us optimizing both on \(\alpha _{i,j}\)’s and \(\lambda _{i,j}\)’s simultaneously. For doing that, let us start by conveniently noting that
with \(S'(\tau ,\{\lambda _{i,j}\},\{\alpha _{i,j}\})=S(\tau ,\{\lambda _{i,j}\}, \{\alpha _{i,j}\})+{\mathbf {w}}_N{\mathbf {w}}_N^\top \), using a standard Schur complement (see, e.g., [29]). The motivation underlying this reformulation is that this lifted linear matrix inequality depends linearly on \(\{\alpha _{N,i}\}_i\)’s. Indeed, the coefficients of the last iteration only appear through the term \({\mathbf {w}}_N\), which is not present in \(S'\) (details below).
We only consider the case \(N>1\) below. In the case \(N=1\), (6) can be solved without the following simplifications. Let us develop the expression of \(S'({\cdot })\) as follows
where we used \(\lambda _{i-1,i}+\lambda _{\star ,i}=\lambda _{i,i+1}\) (for \(i=1,\ldots ,N-2\)), and \(\lambda _{N-2,N-1}+\lambda _{\star ,N-1}=\lambda _{N-1,\star }\) for obtaining the second equality. Substituting the expressions for \({\mathbf {w}}_i\)’s, we arrive to
where we simply expressed each \({\mathbf {w}}_k\) in two terms, one with the contribution of \({\mathbf {w}}_0\), and the other with the contributions of \({\mathbf {g}}_i\)’s. Although not pretty, one can observe that \(S'\) is still bilinear in \(\{\alpha _{i,j}\}\) and \(\{\lambda _{i,j}\}\). This expression can be largely simplified, but this form suffices for the purposes in this work.
The information-theoretic exact method is a solution to (Minimax-R)
Proof of Theorem 4
For readability purposes, we establish the claim without explicitly computing the optimal values of the variables \(\{\alpha _{i,j}\}\) and \(\{\beta _{i,j}\}\) for ITEM. For avoiding this step, let us note that ITEM is clearly a fixed-step first-order method following Definition 2. Therefore, following Lemma 3, the method can also be written in the alternate parametrization (8), using the association \(w_i\leftarrow y_i\) (\(i=0,1,\ldots ,N-1\)) and \(w_N\leftarrow z_N\), where \(y_k\) and \(z_k\) are the sequences defined by (3). Let \(\{\alpha _{i,j}^\star \}\) denote the steps sizes corresponding to ITEM written in the form (8). We proceed to show that by choosing
and setting \(\{\beta _{i,j}^{\star }\}\) in accordance to (14), we reach a feasible solution to (Minimax-R). Note that optimality of the solution follows from the value of \(\tau ^\star \), which matches the lower complexity bound discussed in Sect. 2.3.
For establishing dual feasibility, we relate (Minimax-R) to the Lagrangian of (R). That is, denoting
we have for all (F, G) as in (11), by construction,
Using the association \(w_i\leftarrow y_i\) (\(i=0,1,\ldots ,N-1\)) and \(w_N\leftarrow z_N\), as well as \(f(y_i)=\tilde{f}(y_i)+\frac{\mu }{2}\Vert y_i-x_\star \Vert ^2\), it follows from Lemma 1 (the weighted sums below are the same as that of Lemma 1, written in terms of \(\tilde{f}(\cdot )\) instead of \(f(\cdot )\) and scaled by a factor \(\frac{1}{L+\mu A_N}\); their reformulations are therefore also the same up to the rescaling) that for \(i=1,\ldots ,N-1\)
as well as
In addition, noting that \(\phi _0=L\Vert w_0-w_\star \Vert ^2\) allows reaching the following reformulation
where the last equality follows from \(\phi _N=(1-q)A_N\psi _{N-1}+(L+\mu A_N)\Vert z_N-w_\star \Vert ^2\). Therefore, ITEM is a solution to (Minimax-R). In more direct terms of the SDP (Minimax-R), one can verify that
the previous computations therefore imply that \(K=0\) for ITEM, and hence \(S''(\tau ,\{\lambda _{i,j}\},{\{\beta _{i,j}\}})={\mathbf {w}}_N{\mathbf {w}}_N^\top \succeq 0\). \(\square \)
An SDP formulation for optimizing function values
In this section, we show how to adapt the methodology developed in Sect. 3 for a family of alternate design criteria, which include \((f(w_N)-f_\star )/\Vert w_0-w_\star \Vert ^2\) and \((f(w_N)-f_\star )/(f(w_0)-f_\star )\) (for which numerical examples are provided respectively in Sect. E.1 and Sect. E.2). The developments slightly differ from those required for optimizing \({\Vert w_N-w_\star \Vert ^2}/{\Vert w_0-w_\star \Vert ^2}\); and we decided not to present a unified version in the core of the text, for readability purposes. In particular, the set of selected inequalities is slightly different, altering the linearization procedure.
The criteria we deal with in this section are of the form
As the steps are essentially the same as detailed in Sect. 3, we proceed without providing much detail. We start with the discrete version, using the set \(I=\{\star ,0,\ldots ,N\}\)
The upper bound we use is now very slightly different (the selected subset of constraints is not the same as that of (R))
The corresponding SDP can be written using a similar couple (G, F)
and the similar notations
with \(i=0,\ldots ,N\) and \(e_i\) being the unit vector whose ith component is equal to 1. In addition, we can also denote by
A dual formulation of \(\mathrm {UB}_{\mu ,L}\) is given by (we directly included the Schur complement)
with
Note that that the equality constraint corresponds to
We perform a some additional work on \({{\bar{S}}}'\) (whose dependency on \(\{\alpha _{i,j}\}\) is implicit through the dependency on \(\{{\mathbf {w}}_i\}\)), as before
where we used \(\lambda _{\star ,i}+\lambda _{i-1,i}=\lambda _{i,i+1}\) (for \(i=1,\ldots ,N-1\)) and \(\lambda _{\star ,N}+\lambda _{N-1,N}=1\). Now, making the dependence on \(\alpha _{i,j}\)’s explicit again, we arrive to
and it remains to remark that the change of variables
linearizes the bilinear matrix inequality, again, and it remains to solve the SDP (13) using standard packages. Numerical results for the pairs \((c_w,c_f)=(1,0)\) and \((c_w,c_f)=(0,1)\) are respectively provided in Sect. E.1 and Sect. E.2. A source code for implementing those SDP is provided in Sect. 4.
Numerical examples
As shown in Appendix D, slight modifications of the relaxations used for obtaining (Minimax-R) allows forming tractable problems for optimizing the parameters of fixed-step methods under different optimality criteria. Although we were unable to obtain closed-form solutions to the problems arising for these alternative criteria, the resulting problems can still be approximated numerically for specific values of \(\mu \), L and N.
In the following, we provide a couple of examples that were obtained by numerically solving the first-order method design problem (Minimax-R), formulated as a linear semidefinite program using standard solvers [17, 18].
1.1 Optimized methods for \((f(w_N)-f_\star )/{\Vert w_0-w_\star \Vert ^2}\)
As a first example, we consider the criterion \(({f(w_N)-f_\star })/{\Vert w_0-w_\star \Vert }\). The following list provides solutions obtained by solving the corresponding design problem for \(N=1,\ldots ,5\) with \(L=1\) and \(\mu =.1\). The solutions are presented using the notations from (7) together with the corresponding worst-case guarantees.
-
For a single iteration, by solving the corresponding optimization problem, we obtain a method with guarantee \(\frac{f(w_1)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.1061\) and step size
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.4606 \end{bmatrix}. \end{aligned}$$This bound and the corresponding step size match the optimal step size \(h_{1,0}=\frac{q+1-\sqrt{q^2-q+1}}{q}\), see [24, Theorem 4.14].
-
For \(N=2\) iterations, we obtain \(\frac{f(w_2)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.0418\) with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5567 &{} \\ 0.1016 &{} \,\,\,1.7016 \end{bmatrix}. \end{aligned}$$ -
For \(N=3\), we obtain \(\frac{f(w_3)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.0189\) with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5512 &{} &{} \\ 0.1220 &{} \,\,\,1.8708 &{} \\ 0.0316 &{} \,\,\,0.2257 &{} \,\,\,1.8019 \end{bmatrix}. \end{aligned}$$ -
For \(N=4\), we obtain \(\frac{f(w_4)-f_\star }{\Vert w_0-w_\star \Vert } \le 0.0089\), with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5487 &{} &{} &{} \\ 0.1178 &{} \,\,\,1.8535 &{} &{} \\ 0.0371 &{} \,\,\,0.2685 &{} \,\,\,2.0018 &{} \\ 0.0110 &{} \,\,\,0.0794 &{} \,\,\,0.2963 &{} \,\,\,1.8497 \end{bmatrix}.\end{aligned}$$ -
Finally, for \(N=5\), we obtain \(\frac{f(w_5)-f_\star }{\Vert w_0-w_\star \Vert }\le 0.0042\) with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.5476 &{} &{} &{} &{} \\ 0.1159 &{} \,\,\,1.8454 &{} &{} &{} \\ 0.0350 &{} \,\,\,0.2551 &{} \,\,\,1.9748 &{} &{} \\ 0.0125 &{} \,\,\,0.0913 &{} \,\,\,0.3489 &{} \,\,\,2.0625 &{} \\ 0.0039 &{} \,\,\,0.0287 &{} \,\,\,0.1095 &{} \,\,\,0.3334 &{} \,\,\,1.8732 \end{bmatrix}. \end{aligned}$$
Note that when \(\mu =0\), we recover the step size policy of the OGM by Kim and Fessler [12]. When setting \(\mu >0\), we observe that the resulting optimized method is apparently less practical as the step sizes critically depend on the horizon N. In particular, one can observe that \(h_{1,0}^\star \) varies with the horizon N.
Figure 1 illustrates the behavior of the worst-case guarantee for larger values of N and compares it to the currently best known corresponding lower bound, as well as to worst-case guarantees for TMM, Nesterov’s Fast Gradient Method (FGM) for strongly convex functions, as well as to the methods generated with the SSEP procedure from [8]. All the worst-case guarantees are computed numerically using the corresponding performance estimation problems (see e.g., the toolbox [26]), and as a result, they are tight in the sense that matching inputs to the algorithms attaining the bounds can be numerically constructed.
1.2 Optimized methods for \(({f(w_N)-f_\star })/({f(w_0)-f_\star })\)
As in the previous section, the technique can be adapted for the criterion \((f(w_N)-f_\star )/(f(x_0)-f_\star )\), see Appendix D for details. The following step sizes were obtained by setting \(L=1\) and \(\mu =.1\) and solving the resulting optimization problem from different values of N.
-
For a single iteration, \(N=1\), we obtain a guarantee \(\frac{f(w_1)-f_\star }{f(w_0)-f_\star } \le 0.6694\) with the corresponding step size
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.8182 \end{bmatrix},\end{aligned}$$which matches the known optimal step size \(2/(L+\mu )\) for this setup [5, Theorem 4.2].
-
For \(N=2\), we obtain \(\frac{f(w_2)-f_\star }{f(w_0)-f_\star } \le 0.3554\) with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 2.0095 &{} \\ 0.4229 &{} \,\,\,2.0095 \end{bmatrix}.\end{aligned}$$ -
For \(N=3\), we obtain \(\frac{f(w_3)-f_\star }{f(w_0)-f_\star } \le 0.1698\) with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.9470 &{} &{} \\ 0.4599 &{} \,\,\,2.2406 &{} \\ 0.1705 &{} \,\,\,0.4599 &{} \,\,\,1.9470 \end{bmatrix}.\end{aligned}$$ -
For \(N=4\), we obtain \(\frac{f(w_4)-f_\star }{f(w_0)-f_\star } \le 0.0789\) with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.9187 &{} &{} &{} \\ 0.4098 &{} \,\,\,2.1746 &{} &{} \\ 0.1796 &{} \,\,\,0.5147 &{} \,\,\,2.1746 &{} \\ 0.0627 &{} \,\,\,0.1796 &{} \,\,\,0.4098 &{} \,\,\,1.9187 \end{bmatrix}.\end{aligned}$$ -
Finally, for \(N=5\), we reach \(\frac{f(w_5)-f_\star }{f(w_0)-f_\star } \le 0.0365\) with
$$\begin{aligned}{}[{h_{i,j}^\star }]=\begin{bmatrix} 1.9060 &{} &{} &{} &{} \\ 0.3879 &{} \,\,\,2.1439 &{} &{} &{} \\ 0.1585 &{} \,\,\,0.4673 &{} \,\,\,2.1227 &{} &{} \\ 0.0660 &{} \,\,\,0.1945 &{} \,\,\,0.4673 &{} \,\,\,2.1439 &{} \\ 0.0224 &{} \,\,\,0.0660 &{} \,\,\,0.1585 &{} \,\,\,0.3879 &{} \,\,\,1.9060 \end{bmatrix}.\end{aligned}$$
Note that the resulting method is again apparently less practical than ITEM, as step sizes also critically depend on the horizon N; for example, observe again that the value of \(h_{1,0}^\star \) depends on N. Interestingly, one can observe that the corresponding step sizes are symmetric, and that the worst-case guarantees seem to behave slightly better than in the distance problem \({\Vert w_N-w_\star \Vert ^2}/{\Vert w_0-w_\star \Vert ^2}\), although their asymptotic rate has to be the same, due to the properties of strongly convex functions. Figure 2 illustrates the worst-case guarantees of the corresponding method for larger numbers of iterations, and compares it to the lower bound.
Rights and permissions
About this article
Cite this article
Taylor, A., Drori, Y. An optimal gradient method for smooth strongly convex minimization. Math. Program. 199, 557–594 (2023). https://doi.org/10.1007/s10107-022-01839-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01839-y