Abstract
We consider solving nonlinear optimization problems with a stochastic objective and deterministic equality constraints. We assume for the objective that its evaluation, gradient, and Hessian are inaccessible, while one can compute their stochastic estimates by, for example, subsampling. We propose a stochastic algorithm based on sequential quadratic programming (SQP) that uses a differentiable exact augmented Lagrangian as the merit function. To motivate our algorithm design, we first revisit and simplify an old SQP method Lucidi (J. Optim. Theory Appl. 67(2): 227–245, 1990) developed for solving deterministic problems, which serves as the skeleton of our stochastic algorithm. Based on the simplified deterministic algorithm, we then propose a non-adaptive SQP for dealing with stochastic objective, where the gradient and Hessian are replaced by stochastic estimates but the stepsizes are deterministic and prespecified. Finally, we incorporate a recent stochastic line search procedure Paquette and Scheinberg (SIAM J. Optim. 30(1): 349–376 2020) into the non-adaptive stochastic SQP to adaptively select the random stepsizes, which leads to an adaptive stochastic SQP. The global “almost sure” convergence for both non-adaptive and adaptive SQP methods is established. Numerical experiments on nonlinear problems in CUTEst test set demonstrate the superiority of the adaptive algorithm.
Similar content being viewed by others
Notes
The implicit cost for deriving Lipschitz constants sequence in [3] is not counted here, which, however, requires non-negligible evaluations for the objective and constraints.
References
Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization 24(3), 1238–1264 (2014). https://doi.org/10.1137/130915984
Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of newton-sketch and subsampled newton methods. Optimization Methods and Software 35(4), 661–680 (2020). https://doi.org/10.1080/10556788.2020.1725751
Berahas, A.S., Curtis, F.E., Robinson, D., Zhou, B.: Sequential quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM J. Optim. 31(2), 1352–1379 (2021). https://doi.org/10.1137/20m1354556
Bertsekas, D.: Constrained Optimization and Lagrange Multiplier Methods. Elsevier, Belmont, Mass, (1982). https://doi.org/10.1016/c2013-0-10366-2
Bertsekas, D.P.: Projected newton methods for optimization problems with simple constraints. SIAM J. Control. Optim. 20(2), 221–246 (1982). https://doi.org/10.1137/0320018
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS Journal on Optimization 1(2), 92–119 (2019). https://doi.org/10.1287/ijoo.2019.0016
Bollapragada, R., Byrd, R., Nocedal, J.: Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 28(4), 3312–3343 (2018). https://doi.org/10.1137/17m1154679
Bollapragada, R., Byrd, R.H., Nocedal, J.: Exact and inexact subsampled newton methods for optimization. IMA J. Numer. Anal. 39(2), 545–578 (2018). https://doi.org/10.1093/imanum/dry009
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018). https://doi.org/10.1137/16m1080173
Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 127–155 (2012). https://doi.org/10.1007/s10107-012-0572-5
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 169(2), 337–375 (2017). https://doi.org/10.1007/s10107-017-1137-4
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models. Math. Program. 169(2), 447–487 (2017). https://doi.org/10.1007/s10107-017-1141-8
Curtis, F.E., Shi, R.: A fully stochastic second-order trust region method. Optimization Methods and Software 1–34, (2020). https://doi.org/10.1080/10556788.2020.1852403
Curtis, F.E., Scheinberg, K., Shi, R.: A stochastic trust region algorithm based on careful step normalization. INFORMS Journal on Optimization 1(3), 200–220 (2019). https://doi.org/10.1287/ijoo.2018.0010
De, S., Yadav, A., Jacobs, D., Goldstein, T.: Automated Inference with Adaptive Batches. PMLR, Fort Lauderdale, FL, USA, Proceedings of Machine Learning Research, 54, 1504–1513 (2017). URL http://proceedings.mlr.press/v54/de17a.html
Dupacova, J., Wets, R.: Asymptotic behavior of statistical estimators and of optimal solutions of stochastic optimization problems. Ann. Stat. 16(4), 1517–1549 (1988). https://doi.org/10.1214/aos/1176351052
Durrett, R.: Probability, vol 49. Cambridge University Press (2019). https://doi.org/10.1017/9781108591034
Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), A1380–A1405 (2012). https://doi.org/10.1137/110830629
Gallager, R.G.: Stochastic Processes. Cambridge University Press (2013). https://doi.org/10.1017/cbo9781139626514
Gill, P.E., Wong, E.: Sequential quadratic programming methods. In: Mixed Integer Nonlinear Programming, IMA Vol. Math. Appl., vol 154, Springer New York, 147–224 (2011). https://doi.org/10.1007/978-1-4614-1927-3_6
Gill, P.E., Murray, W., Saunders, M.A.: SNOPT: An SQP algorithm for large-scale constrained optimization. SIAM Rev. 47(1), 99–131 (2005). https://doi.org/10.1137/s0036144504446096
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2014). https://doi.org/10.1007/s10589-014-9687-3
Gratton, S., Royer, C.W., Vicente, L.N., Zhang, Z.: Complexity and global rates of trust-region methods based on probabilistic models. IMA J. Numer. Anal. 38(3), 1579–1597 (2017). https://doi.org/10.1093/imanum/drx043
Krejić, N., Krklec, N.: Line search methods with variable sample size for unconstrained optimization. J. Comput. Appl. Math. 245, 213–231 (2013). https://doi.org/10.1016/j.cam.2012.12.020
Lucidi, S.: Recursive quadratic programming algorithm that uses an exact augmented lagrangian function. J. Optim. Theory Appl. 67(2), 227–245 (1990). https://doi.org/10.1007/bf00940474
Maratos, N.: Exact penalty function algorithms for finite dimensional and control optimization problems. Doctoral dissertation (1978). URL http://hdl.handle.net/10044/1/7283
Nagaraj, N.K., Fuller, W.A.: Estimation of the parameters of linear time series models subject to nonlinear restrictions. Ann. Stat. 19(3), 1143–1154 (1991). https://doi.org/10.1214/aos/1176348242
Nandwani, Y., Pathak, A., Mausam, Singla, P.: A primal dual formulation for deep learning with constraints. In: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 12157–12168 (2019). URL http://papers.nips.cc/paper/9385-a-primal-dual-formulation-for-deep-learning-with-constraints.pdf
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009). https://doi.org/10.1137/070704277
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Series in Operations Research and Financial Engineering, Springer New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30(1), 349–376 (2020). https://doi.org/10.1137/18m1216250
Pillo, G.: Exact penalty methods. In: Algorithms for Continuous Optimization, NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci., vol 434, Springer Netherlands, 209–253 (1994). https://doi.org/10.1007/978-94-009-0369-2_8
Pillo, G.D., Grippo, L.: A new class of augmented lagrangians in nonlinear programming. SIAM J. Control. Optim. 17(5), 618–628 (1979). https://doi.org/10.1137/0317044
Pillo, G.D., Grippo, L., Lampariello, F.: A method for solving equality constrained optimization problems by unconstrained minimization. In: Optimization Techniques, Springer-Verlag, Lecture Notes in Control and Information Sci. 23, 96–105 (1980). https://doi.org/10.1007/bfb0006592
Prékopa, A.: Contributions to the theory of stochastic programming. Math. Program. 4(1), 202–221 (1973). https://doi.org/10.1007/bf01584661
Ravi, S.N., Dinh, T., Lokhande, V.S., Singh, V.: Explicitly imposing constraints in deep networks via conditional gradients gives improved generalization and faster convergence. In: Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), vol 33, 4772–4779 (2019), https://doi.org/10.1609/aaai.v33i01.33014772
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods. Math. Program. 174(1–2), 293–326 (2018). https://doi.org/10.1007/s10107-018-1346-5
di Serafino, D., Krejić, N., Jerinkić, N.K., Viola, M.: Lsos: Line-search second-order stochastic optimization methods. (2020) arXiv preprint arXiv:2007.15966
Shapiro, A.: On the asymptotics of constrained local $m$-estimators. Ann. Stat. 28(3), 948–960 (2000). https://doi.org/10.1214/aos/1015952006
Siqueira, A.S., Orban, D.: Cutest.jl. (2020) https://github.com/JuliaSmoothOptimizers/CUTEst.jl, https://doi.org/10.5281/ZENODO.1188851
Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in neural information processing systems, 2899–2908 (2018). https://doi.org/10.5555/3327144.3327213
Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2011). https://doi.org/10.1007/s10208-011-9099-z
Tropp, J.A.: An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning 8(1-2), 1–230 (2015). https://doi.org/10.1561/2200000048
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017). https://doi.org/10.1137/15m1053141
Wets, R.: Stochastic programming: Solution techniques and approximation schemes. In: Mathematical Programming The State of the Art, Springer Berlin Heidelberg, 566–603 (1983). https://doi.org/10.1007/978-3-642-68874-4_22
Acknowledgements
We would like to thank Associated Editor and two anonymous reviewers for helpful and instructive comments. Sen Na would like to thank Nick Gould for helping with the implementation of CUTEst in Julia. This material was completed in part with resources provided by the University of Chicago Research Computing Center. This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract DE-AC02-06CH11347 and by NSF through award CNS-1545046.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Government License: The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. http://energy.gov/downloads/doe-public-access-plan.
Appendices
Proofs of section 3
1.1 Proof of lemma 1
The expectation is taken over randomness of \(\xi _g^k\) and \(\xi _H^k\), which means that the k-th iterate \(({{\varvec{x}}}_k, {\varvec{\lambda }}_k)\) is supposed to be fixed. By the unbiasedness condition in Assumption 2, we have
and
where the fourth equality is due to the independence between \(\xi _g^k\) and \(\xi _H^k\). Moreover, we have
We now bound \(\Vert {\mathcal {K}}\Vert \). Suppose \(G_k^T = Y_kE_k\) where \(Y_k\) has orthonormal columns spanning \(\text {Image}(G_k^T)\) and \(E_k\) is a square matrix, and suppose \(Z_k\) has orthonormal columns spanning \(\text {Ker}(G_k)\). By Assumption 1, we know \(E_k\) is nonsingular and \(Z_k^TB_kZ_k \succeq \gamma _{RH} I\). By directly verifying \(\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) {\mathcal {K}}= I\), one can show
Noting that \(\Vert E_k^{-1}\Vert ^2 = \Vert E_k^{-1}(E_k^{-1})^T\Vert = \Vert (G_kG_k^T)^{-1}\Vert {\mathop {\le }\limits ^{(12)}} 1/\kappa _{1,G}\), and \(\Vert (Z_k^TB_kZ_k)^{-1}\Vert \le 1/\gamma _{RH}\), we can bound \(\Vert {\mathcal {K}}\Vert \) by summing the spectrum norm of each block, and get
Since \(\kappa _B \ge 1\ge \gamma _{RH}\vee \kappa _{1,G}\), we can simplify the bound by
Plugging the above inequality into (A.1), we have
Similarly, we can bound \({\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[\Vert {{\bar{\varDelta }}}{\varvec{\lambda }}_k\Vert ^2]\) as follows.
For the last term, we have
where the second equality is due to the independence between \(\xi _g^k\) and \(\xi _H^k\). We note that
Taking expectation over \(\xi _H^k\) on both sides and using (11), we have
Combining (A.5) with (A.4) and (A.1), we obtain
Plugging the above inequality into (A.3),
Combining (A.6) with (A.2), we can define
and have
This completes the proof.
1.2 Proof of theorem 1
We require the following lemmas.
Lemma 10
Let \((\varDelta {{\varvec{x}}}_k, \varDelta {\varvec{\lambda }}_k)\) be solved by (5). Then for any \(\mu , \nu \),
Proof
We have
This completes the proof. \(\square \)
Lemma 11
Under Assumption 1, we have
Proof
By (5), we know
This completes the proof. \(\square \)
Now, we are ready for proving Theorem 1. We note that
Since \(\kappa _{2,G}\wedge \kappa _B \ge 1\ge \kappa _{1,G}\),
Plugging the above inequality into (15), we have
Case 1: constant stepsize. If \(\alpha _k = \alpha \), we take full expectation of (A.8), sum over \(k = 0,\ldots ,K\), and have
Rearranging the above inequality leads to
Case 2: decaying stepsize. Let us define two sequences
From (A.8), we know
Since \(e_k \ge 0\), \(\{s_k -\min _{{\mathcal {X}}\times \varLambda }{\mathcal {L}}_{\mu , \nu }\}_k\) is a positive supermartingale. By [17, Theorem 4.2.12], we have \(s_k\rightarrow s\) almost surely for some random variable s and \({\mathbb {E}}[s]\le s_0<\infty \). Therefore, \({\mathbb {E}}[\sum _{k=0}^{\infty }e_k] = \sum _{k=0}^{\infty }{\mathbb {E}}\left[ e_k\right] \le \sum _{k=0}^{\infty }{\mathbb {E}}[s_k] - {\mathbb {E}}[s_{k+1}]<\infty \), where the first equality is due to Fubini-Tonelli theorem [17, Theorem 1.7.2]. The above display implies that \(\sum _{k=0}^{\infty }e_k<\infty \) almost surely. Moreover, since \(\sum _{k=0}^{\infty }\alpha _k = \infty \), we obtain \(\sum _{k=0}^{\infty }e_k/\sum _{k=0}^{\infty }\alpha _k = 0\) and \(\liminf _{k\rightarrow 0}\Vert \nabla {\mathcal {L}}_k\Vert = 0\) almost surely. This completes the proof.
Proofs of section 4
1.1 Proof of lemma 3
Let \(P_{\xi _f^k}(\cdot ) = P(\cdot \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k,{{\bar{\varDelta }}}{{\varvec{x}}}_k, {{\bar{\varDelta }}}{\varvec{\lambda }}_k)\) be the conditional probability over randomness in \(\xi _f^k\). We have that
Let us denote
then the above display implies that \(|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k |\le m_k\) by requiring
The above condition is further implied by requiring
Thus, we apply Bernstein inequality [43, Theorem 6.1.1] and have that if
then
Under (B.2), we also have \(P_{\xi _f^k}(|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^{s_k}|> m_k )\le p_f/2\), which implies the condition (23) holds. As for the condition (24), we apply (B.1) to obtain
where the last inequality applies the variance bound for the sample average, that is \({\mathbb {E}}[|{{\bar{f}}}_k - f_k|^2] = {\mathbb {E}}[|f({{\varvec{x}}}_k; \xi ) - f_k|^2]/|\xi _f^k| \le \varOmega _0^2/|\xi _f^k|\) (and similar for \({\mathbb {E}}[\Vert {{\bar{\nabla }}}f_k - \nabla f_k\Vert ^2]\)). Following the same calculation, we can show the same bound for \(|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^{s_k} |^2\). Thus, the condition (24) holds if
Combining (B.2) and (B.3) and taking maximum on the right hand side, the conditions (23) and (24) are satisfied simultaneously when
with \(C_f = 16\max \{\varOmega _0,\varOmega _1\}^2\{1\vee 4\nu ^2\kappa _{2,G}^2(\varOmega _1\vee \kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}})^2\}\). Finally, we note that \(m_k\ne 0\). Otherwise, by (19) we know \({{\bar{\varDelta }}}{{\varvec{x}}}_k = {{\varvec{0}}}\) and \(G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k = {{\varvec{0}}}\). Then, by (9) we have \({{\bar{\varDelta }}}{\varvec{\lambda }}_k = {{\varvec{0}}}\), which contradicts \(\Vert ({{\bar{\varDelta }}}{{\varvec{x}}}_k, {{\bar{\varDelta }}}{\varvec{\lambda }}_k)\Vert >0\) in Assumption 3. This completes the proof.
1.2 Proof of proposition 2
Note that
Thus, we know
Using (12) and (36), we upper bound the spectrum norm of each block of the right hand side matrix, and define
with \(\kappa _{{{\bar{M}}}}\) given in (36), then the first inequality in (39) is satisfied. Furthermore, we know
Thus, we let
and the second inequality in (39) is satisfied. Moreover,
Thus, we let
and the third inequality in (39) is satisfied. Finally,
By Lemma 4, we know \({{\bar{\mu }}}_{{{\bar{K}}}} \le {{\tilde{\mu }}}\) with deterministic \({{\tilde{\mu }}}\). Thus, we let
and the fourth inequality in (39) is satisfied. This completes the proof.
1.3 Proof of lemma 8
We consider the following three cases.
Case 1: reliable step. For the reliable step, we apply Lemma 6 and (43) is changed to
By Line 13 of Algorithm 3, \({{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = (\rho -1){{\bar{\epsilon }}}_k\), while (44) is the same. Thus, by the condition on \(\omega \) in (41) we obtain
Case 2: unreliable step. For the unreliable step, (B.6) is changed to
By Line 15 of Algorithm 3, \({{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = -(1-1/\rho ){{\bar{\epsilon }}}_k\) and (44) is the same. Thus, under the condition on \(\omega \) in (41),
Case 3: unsuccessful step. The result in (45) holds.
Combining (B.7), (B.8), (45), we have
which completes the proof.
1.4 Proof of lemma 9
We consider the following three cases.
Case 1: reliable step. We have
By Line 13 of Algorithm 3, \({{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = (\rho -1){{\bar{\epsilon }}}_k\), and (44) holds as well. By (41), we have
Case 2: unreliable step. The inequality in (B.9) is changed to
We further have \({{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = -(1-1/\rho ){{\bar{\epsilon }}}_k\) and (44) holds as well. Using (41), we get
Case 3: unsuccessful step. The result in (45) holds.
Combining (B.10), (B.11), (45), we obtain
which completes the proof.
Additional simulation results
We list the detailed results for each selected CUTEst problem. For each problem and each method, we have five levels of noise. For all four methods, we report the smallest result among different setups. In particular, for NonAdapSQP and \(\ell _1\) SQP, we take the minimum over all six setups of stepsize sequences (four constant sequences and two decaying sequences). For AdapSQP and \(\ell _1\) AdapSQP, we take the minimum over four setups of constants \(C_{grad}, C_f = \{1,5,10,50\}\). The result here is the KKT residual of the last iterate. We average over convergent runs across five independent runs. The convergence results are summarized in Tables 1, 2, and 3.
In tables, the column of logR shows \(\log (\Vert \nabla {\mathcal {L}}_k\Vert )\) and the column of logStd shows the log of standard deviation. For each problem, the first line is the result of AdapSQP; the second line is the result of \(\ell _1\) AdapSQP; the third line is the result of \(\ell _1\) SQP; and the fourth line is the result of NonAdapSQP. The entry ‘\(\diagup \)’ means that the algorithm does not converge (within the given budget).
Rights and permissions
About this article
Cite this article
Na, S., Anitescu, M. & Kolar, M. An adaptive stochastic sequential quadratic programming with differentiable exact augmented lagrangians. Math. Program. 199, 721–791 (2023). https://doi.org/10.1007/s10107-022-01846-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01846-z