An adaptive stochastic sequential quadratic programming with differentiable exact augmented lagrangians | Mathematical Programming Skip to main content
Log in

An adaptive stochastic sequential quadratic programming with differentiable exact augmented lagrangians

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We consider solving nonlinear optimization problems with a stochastic objective and deterministic equality constraints. We assume for the objective that its evaluation, gradient, and Hessian are inaccessible, while one can compute their stochastic estimates by, for example, subsampling. We propose a stochastic algorithm based on sequential quadratic programming (SQP) that uses a differentiable exact augmented Lagrangian as the merit function. To motivate our algorithm design, we first revisit and simplify an old SQP method Lucidi (J. Optim. Theory Appl. 67(2): 227–245, 1990) developed for solving deterministic problems, which serves as the skeleton of our stochastic algorithm. Based on the simplified deterministic algorithm, we then propose a non-adaptive SQP for dealing with stochastic objective, where the gradient and Hessian are replaced by stochastic estimates but the stepsizes are deterministic and prespecified. Finally, we incorporate a recent stochastic line search procedure Paquette and Scheinberg (SIAM J. Optim. 30(1): 349–376 2020) into the non-adaptive stochastic SQP to adaptively select the random stepsizes, which leads to an adaptive stochastic SQP. The global “almost sure” convergence for both non-adaptive and adaptive SQP methods is established. Numerical experiments on nonlinear problems in CUTEst test set demonstrate the superiority of the adaptive algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. The implicit cost for deriving Lipschitz constants sequence in [3] is not counted here, which, however, requires non-negligible evaluations for the objective and constraints.

References

  1. Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization 24(3), 1238–1264 (2014). https://doi.org/10.1137/130915984

    Article  MathSciNet  MATH  Google Scholar 

  2. Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of newton-sketch and subsampled newton methods. Optimization Methods and Software 35(4), 661–680 (2020). https://doi.org/10.1080/10556788.2020.1725751

    Article  MathSciNet  MATH  Google Scholar 

  3. Berahas, A.S., Curtis, F.E., Robinson, D., Zhou, B.: Sequential quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM J. Optim. 31(2), 1352–1379 (2021). https://doi.org/10.1137/20m1354556

    Article  MathSciNet  MATH  Google Scholar 

  4. Bertsekas, D.: Constrained Optimization and Lagrange Multiplier Methods. Elsevier, Belmont, Mass, (1982). https://doi.org/10.1016/c2013-0-10366-2

  5. Bertsekas, D.P.: Projected newton methods for optimization problems with simple constraints. SIAM J. Control. Optim. 20(2), 221–246 (1982). https://doi.org/10.1137/0320018

    Article  MathSciNet  MATH  Google Scholar 

  6. Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS Journal on Optimization 1(2), 92–119 (2019). https://doi.org/10.1287/ijoo.2019.0016

    Article  MathSciNet  Google Scholar 

  7. Bollapragada, R., Byrd, R., Nocedal, J.: Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 28(4), 3312–3343 (2018). https://doi.org/10.1137/17m1154679

    Article  MathSciNet  MATH  Google Scholar 

  8. Bollapragada, R., Byrd, R.H., Nocedal, J.: Exact and inexact subsampled newton methods for optimization. IMA J. Numer. Anal. 39(2), 545–578 (2018). https://doi.org/10.1093/imanum/dry009

    Article  MathSciNet  MATH  Google Scholar 

  9. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018). https://doi.org/10.1137/16m1080173

    Article  MathSciNet  MATH  Google Scholar 

  10. Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 127–155 (2012). https://doi.org/10.1007/s10107-012-0572-5

    Article  MathSciNet  MATH  Google Scholar 

  11. Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 169(2), 337–375 (2017). https://doi.org/10.1007/s10107-017-1137-4

    Article  MathSciNet  MATH  Google Scholar 

  12. Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models. Math. Program. 169(2), 447–487 (2017). https://doi.org/10.1007/s10107-017-1141-8

    Article  MathSciNet  MATH  Google Scholar 

  13. Curtis, F.E., Shi, R.: A fully stochastic second-order trust region method. Optimization Methods and Software 1–34, (2020). https://doi.org/10.1080/10556788.2020.1852403

  14. Curtis, F.E., Scheinberg, K., Shi, R.: A stochastic trust region algorithm based on careful step normalization. INFORMS Journal on Optimization 1(3), 200–220 (2019). https://doi.org/10.1287/ijoo.2018.0010

    Article  MathSciNet  Google Scholar 

  15. De, S., Yadav, A., Jacobs, D., Goldstein, T.: Automated Inference with Adaptive Batches. PMLR, Fort Lauderdale, FL, USA, Proceedings of Machine Learning Research, 54, 1504–1513 (2017). URL http://proceedings.mlr.press/v54/de17a.html

  16. Dupacova, J., Wets, R.: Asymptotic behavior of statistical estimators and of optimal solutions of stochastic optimization problems. Ann. Stat. 16(4), 1517–1549 (1988). https://doi.org/10.1214/aos/1176351052

    Article  MathSciNet  MATH  Google Scholar 

  17. Durrett, R.: Probability, vol 49. Cambridge University Press (2019). https://doi.org/10.1017/9781108591034

  18. Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), A1380–A1405 (2012). https://doi.org/10.1137/110830629

    Article  MathSciNet  MATH  Google Scholar 

  19. Gallager, R.G.: Stochastic Processes. Cambridge University Press (2013). https://doi.org/10.1017/cbo9781139626514

  20. Gill, P.E., Wong, E.: Sequential quadratic programming methods. In: Mixed Integer Nonlinear Programming, IMA Vol. Math. Appl., vol 154, Springer New York, 147–224 (2011). https://doi.org/10.1007/978-1-4614-1927-3_6

  21. Gill, P.E., Murray, W., Saunders, M.A.: SNOPT: An SQP algorithm for large-scale constrained optimization. SIAM Rev. 47(1), 99–131 (2005). https://doi.org/10.1137/s0036144504446096

    Article  MathSciNet  MATH  Google Scholar 

  22. Gould, N.I.M., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2014). https://doi.org/10.1007/s10589-014-9687-3

    Article  MathSciNet  MATH  Google Scholar 

  23. Gratton, S., Royer, C.W., Vicente, L.N., Zhang, Z.: Complexity and global rates of trust-region methods based on probabilistic models. IMA J. Numer. Anal. 38(3), 1579–1597 (2017). https://doi.org/10.1093/imanum/drx043

    Article  MathSciNet  MATH  Google Scholar 

  24. Krejić, N., Krklec, N.: Line search methods with variable sample size for unconstrained optimization. J. Comput. Appl. Math. 245, 213–231 (2013). https://doi.org/10.1016/j.cam.2012.12.020

    Article  MathSciNet  MATH  Google Scholar 

  25. Lucidi, S.: Recursive quadratic programming algorithm that uses an exact augmented lagrangian function. J. Optim. Theory Appl. 67(2), 227–245 (1990). https://doi.org/10.1007/bf00940474

    Article  MathSciNet  MATH  Google Scholar 

  26. Maratos, N.: Exact penalty function algorithms for finite dimensional and control optimization problems. Doctoral dissertation (1978). URL http://hdl.handle.net/10044/1/7283

  27. Nagaraj, N.K., Fuller, W.A.: Estimation of the parameters of linear time series models subject to nonlinear restrictions. Ann. Stat. 19(3), 1143–1154 (1991). https://doi.org/10.1214/aos/1176348242

    Article  MathSciNet  MATH  Google Scholar 

  28. Nandwani, Y., Pathak, A., Mausam, Singla, P.: A primal dual formulation for deep learning with constraints. In: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 12157–12168 (2019). URL http://papers.nips.cc/paper/9385-a-primal-dual-formulation-for-deep-learning-with-constraints.pdf

  29. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009). https://doi.org/10.1137/070704277

    Article  MathSciNet  MATH  Google Scholar 

  30. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Series in Operations Research and Financial Engineering, Springer New York (2006). https://doi.org/10.1007/978-0-387-40065-5

  31. Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30(1), 349–376 (2020). https://doi.org/10.1137/18m1216250

    Article  MathSciNet  MATH  Google Scholar 

  32. Pillo, G.: Exact penalty methods. In: Algorithms for Continuous Optimization, NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci., vol 434, Springer Netherlands, 209–253 (1994). https://doi.org/10.1007/978-94-009-0369-2_8

  33. Pillo, G.D., Grippo, L.: A new class of augmented lagrangians in nonlinear programming. SIAM J. Control. Optim. 17(5), 618–628 (1979). https://doi.org/10.1137/0317044

    Article  MathSciNet  MATH  Google Scholar 

  34. Pillo, G.D., Grippo, L., Lampariello, F.: A method for solving equality constrained optimization problems by unconstrained minimization. In: Optimization Techniques, Springer-Verlag, Lecture Notes in Control and Information Sci. 23, 96–105 (1980). https://doi.org/10.1007/bfb0006592

  35. Prékopa, A.: Contributions to the theory of stochastic programming. Math. Program. 4(1), 202–221 (1973). https://doi.org/10.1007/bf01584661

    Article  MathSciNet  MATH  Google Scholar 

  36. Ravi, S.N., Dinh, T., Lokhande, V.S., Singh, V.: Explicitly imposing constraints in deep networks via conditional gradients gives improved generalization and faster convergence. In: Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), vol 33, 4772–4779 (2019), https://doi.org/10.1609/aaai.v33i01.33014772

  37. Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods. Math. Program. 174(1–2), 293–326 (2018). https://doi.org/10.1007/s10107-018-1346-5

    Article  MathSciNet  MATH  Google Scholar 

  38. di Serafino, D., Krejić, N., Jerinkić, N.K., Viola, M.: Lsos: Line-search second-order stochastic optimization methods. (2020) arXiv preprint arXiv:2007.15966

  39. Shapiro, A.: On the asymptotics of constrained local $m$-estimators. Ann. Stat. 28(3), 948–960 (2000). https://doi.org/10.1214/aos/1015952006

    Article  MathSciNet  MATH  Google Scholar 

  40. Siqueira, A.S., Orban, D.: Cutest.jl. (2020) https://github.com/JuliaSmoothOptimizers/CUTEst.jl, https://doi.org/10.5281/ZENODO.1188851

  41. Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in neural information processing systems, 2899–2908 (2018). https://doi.org/10.5555/3327144.3327213

  42. Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2011). https://doi.org/10.1007/s10208-011-9099-z

    Article  MathSciNet  MATH  Google Scholar 

  43. Tropp, J.A.: An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning 8(1-2), 1–230 (2015). https://doi.org/10.1561/2200000048

  44. Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017). https://doi.org/10.1137/15m1053141

    Article  MathSciNet  MATH  Google Scholar 

  45. Wets, R.: Stochastic programming: Solution techniques and approximation schemes. In: Mathematical Programming The State of the Art, Springer Berlin Heidelberg, 566–603 (1983). https://doi.org/10.1007/978-3-642-68874-4_22

Download references

Acknowledgements

We would like to thank Associated Editor and two anonymous reviewers for helpful and instructive comments. Sen Na would like to thank Nick Gould for helping with the implementation of CUTEst in Julia. This material was completed in part with resources provided by the University of Chicago Research Computing Center. This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract DE-AC02-06CH11347 and by NSF through award CNS-1545046.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sen Na.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Government License: The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. http://energy.gov/downloads/doe-public-access-plan.

Appendices

Proofs of section 3

1.1 Proof of lemma 1

The expectation is taken over randomness of \(\xi _g^k\) and \(\xi _H^k\), which means that the k-th iterate \(({{\varvec{x}}}_k, {\varvec{\lambda }}_k)\) is supposed to be fixed. By the unbiasedness condition in Assumption 2, we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g}[{{\bar{\varDelta }}}{{\varvec{x}}}_k]&= {\mathbb {E}}[{{\bar{\varDelta }}}{{\varvec{x}}}_k \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k] {\mathop {=}\limits ^{(9)}} -{\mathbb {E}}\left[ \begin{pmatrix} I&{{\varvec{0}}}\end{pmatrix}\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) ^{-1}\left( {\begin{matrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k\\ c_k \end{matrix}}\right) {\bigg \vert }{{\varvec{x}}}_k, {\varvec{\lambda }}_k\right] \\&= -\begin{pmatrix} I&{{\varvec{0}}}\end{pmatrix}\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) ^{-1}{\mathbb {E}}\left[ \left( {\begin{matrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k\\ c_k \end{matrix}}\right) {\bigg \vert }{{\varvec{x}}}_k, {\varvec{\lambda }}_k \right] \\&{\mathop {=}\limits ^{(11)}} -\begin{pmatrix} I&{{\varvec{0}}}\end{pmatrix}\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) ^{-1}\left( {\begin{matrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\\ c_k \end{matrix}}\right) {\mathop {=}\limits ^{{(5)}}} \varDelta {{\varvec{x}}}_k, \end{aligned}$$

and

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[{{\bar{\varDelta }}}{\varvec{\lambda }}_k]&= {\mathbb {E}}[{{\bar{\varDelta }}}{\varvec{\lambda }}_k\mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k] {\mathop {=}\limits ^{{(9)}}} -{\mathbb {E}}\left[ (G_kG_k^T)^{-1}G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k +(G_kG_k^T)^{-1}{{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k\right] \\&= -(G_kG_k^T)^{-1}G_k{\mathbb {E}}[{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k] - (G_kG_k^T)^{-1}{\mathbb {E}}[{{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k\mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k]\\&= -(G_kG_k^T)^{-1}G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - (G_kG_k^T)^{-1}{\mathbb {E}}[{{\bar{M}}}_k^T \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k]{\mathbb {E}}[{{\bar{\varDelta }}}{{\varvec{x}}}_k\mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k]\\&{\mathop {=}\limits ^{{(11)}}} -(G_kG_k^T)^{-1}G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - (G_kG_k^T)^{-1}M_k^T\varDelta {{\varvec{x}}}_k\\&{\mathop {=}\limits ^{{(5)}}} \varDelta {\varvec{\lambda }}_k, \end{aligned}$$

where the fourth equality is due to the independence between \(\xi _g^k\) and \(\xi _H^k\). Moreover, we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g}\left[ \Vert {{\bar{\varDelta }}}{{\varvec{x}}}_k\Vert ^2\right]&= \Vert \varDelta {{\varvec{x}}}_k\Vert ^2 + {\mathbb {E}}_{\xi ^k_g}\left[ \Vert {{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k\Vert ^2\right] \nonumber \\&{\mathop {=}\limits ^{{(9)}}}\Vert \varDelta {{\varvec{x}}}_k\Vert ^2+ {\mathbb {E}}\bigg [\bigg \Vert \begin{pmatrix} I&{{\varvec{0}}}\end{pmatrix}\underbrace{\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) ^{-1}}_{=:{\mathcal {K}}}\left( {\begin{matrix} {{\bar{g}}}_k - \nabla f_k\\ {{\varvec{0}}}\end{matrix}}\right) \bigg \Vert ^2{\bigg \vert }{{\varvec{x}}}_k, {\varvec{\lambda }}_k \bigg ] \nonumber \\&{\mathop {\le }\limits ^{{(11)}}} \Vert \varDelta {{\varvec{x}}}_k\Vert ^2 + \Vert {\mathcal {K}}\Vert ^2\psi . \end{aligned}$$
(A.1)

We now bound \(\Vert {\mathcal {K}}\Vert \). Suppose \(G_k^T = Y_kE_k\) where \(Y_k\) has orthonormal columns spanning \(\text {Image}(G_k^T)\) and \(E_k\) is a square matrix, and suppose \(Z_k\) has orthonormal columns spanning \(\text {Ker}(G_k)\). By Assumption 1, we know \(E_k\) is nonsingular and \(Z_k^TB_kZ_k \succeq \gamma _{RH} I\). By directly verifying \(\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) {\mathcal {K}}= I\), one can show

$$\begin{aligned} {\mathcal {K}}= \left( {\begin{matrix} Z_k(Z_k^TB_kZ_k)^{-1}Z_k^T &{} (I - Z_k(Z_k^TB_kZ_k)^{-1}Z_k^TB_k)Y_kE_k^{-1}\\ E_k^{-1}Y_k^T(I - B_kZ_k(Z_k^TB_kZ_k)^{-1}Z_k^T) &{} E_k^{-1}Y_k^T(B_kZ_k(Z_k^TB_kZ_k)^{-1}Z_k^TB_k - B_k)Y_kE_k^{-1} \end{matrix}}\right) . \end{aligned}$$

Noting that \(\Vert E_k^{-1}\Vert ^2 = \Vert E_k^{-1}(E_k^{-1})^T\Vert = \Vert (G_kG_k^T)^{-1}\Vert {\mathop {\le }\limits ^{(12)}} 1/\kappa _{1,G}\), and \(\Vert (Z_k^TB_kZ_k)^{-1}\Vert \le 1/\gamma _{RH}\), we can bound \(\Vert {\mathcal {K}}\Vert \) by summing the spectrum norm of each block, and get

$$\begin{aligned} \Vert {\mathcal {K}}\Vert \le \frac{1}{\gamma _{RH}} + \frac{2}{\sqrt{\kappa _{1,G}}}\left( 1 + \frac{\kappa _B}{\gamma _{RH}}\right) + \frac{1}{\kappa _{1,G}}\left( \frac{\kappa _B^2}{\gamma _{RH}} + \kappa _B\right) . \end{aligned}$$

Since \(\kappa _B \ge 1\ge \gamma _{RH}\vee \kappa _{1,G}\), we can simplify the bound by

$$\begin{aligned} \Vert {\mathcal {K}}\Vert \le \frac{5\kappa _B}{\sqrt{\kappa _{1,G}}\gamma _{RH}} + \frac{2\kappa _B^2}{\kappa _{1,G}\gamma _{RH}} \le \frac{7\kappa _B^2}{\kappa _{1,G}\gamma _{RH}}. \end{aligned}$$

Plugging the above inequality into (A.1), we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g}\left[ \Vert {{\bar{\varDelta }}}{{\varvec{x}}}_k\Vert ^2\right] \le \Vert \varDelta {{\varvec{x}}}_k\Vert ^2 + \frac{49\kappa _B^4}{\kappa _{1,G}^2\gamma _{RH}^2}\psi . \end{aligned}$$
(A.2)

Similarly, we can bound \({\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[\Vert {{\bar{\varDelta }}}{\varvec{\lambda }}_k\Vert ^2]\) as follows.

$$\begin{aligned}&{\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{\varDelta }}}{\varvec{\lambda }}_k\Vert ^2\right] \nonumber \\&\quad = \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{\varDelta }}}{\varvec{\lambda }}_k - \varDelta {\varvec{\lambda }}_k\Vert ^2\right] \nonumber \\&\quad {\mathop {=}\limits ^{{(9)}}} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \left\| (G_kG_k^T)^{-1}\left( G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k -G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - M_k\varDelta {{\varvec{x}}}_k\right) \right\| ^2 \right] \nonumber \\&\quad {\mathop {\le }\limits ^{{(12)}}} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + \frac{2}{\kappa _{1,G}^2}\left\{ {\mathbb {E}}_{\xi ^k_g}\left[ \left\| G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k - G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\right\| ^2\right] + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \right\} \nonumber \\&\quad {\mathop {\le }\limits ^{{(12)}}}\Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + \frac{2}{\kappa _{1,G}^2}\left\{ \kappa _{2,G} {\mathbb {E}}_{\xi ^k_g}\left[ \Vert {{\bar{g}}}_k - \nabla f_k\Vert ^2\right] + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \right\} \nonumber \\&\quad {\mathop {\le }\limits ^{{(11)}}} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + \frac{2}{\kappa _{1,G}^2}\left\{ \kappa _{2,G} \psi + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \right\} . \end{aligned}$$
(A.3)

For the last term, we have

$$\begin{aligned}&{\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \nonumber \\&\quad = {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert ({{\bar{M}}}_k - M_k)^T({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k) + M_k^T({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k) + ({{\bar{M}}}_k - M_k)^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \nonumber \\&\quad = {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert ({{\bar{M}}}_k - M_k)^T({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k)\Vert ^2 + \Vert M_k^T({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k)\Vert ^2 + \Vert ({{\bar{M}}}_k - M_k)^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \nonumber \\&\quad {\mathop {\le }\limits ^{{(12)}}} {\mathbb {E}}_{\xi ^k_H}\left[ \Vert {{\bar{M}}}_k - M_k\Vert ^2\right] {\mathbb {E}}_{\xi ^k_g}\left[ \Vert ({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k)\Vert ^2\right] + \kappa _M^2{\mathbb {E}}_{\xi ^k_g}\left[ \Vert ({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k)\Vert ^2\right] \nonumber \\&\quad \quad \qquad + {\mathbb {E}}_{\xi ^k_H}\left[ \Vert {{\bar{M}}}_k - M_k\Vert ^2\right] \Vert \varDelta {{\varvec{x}}}_k\Vert ^2, \end{aligned}$$
(A.4)

where the second equality is due to the independence between \(\xi _g^k\) and \(\xi _H^k\). We note that

$$\begin{aligned} \Vert {{\bar{M}}}_k - M_k\Vert ^2&{\mathop {\le }\limits ^{(8)}} 2\Vert G_k\Vert ^2\Vert {{\bar{H}}}_k - \nabla ^2f_k\Vert ^2 + 2\Vert {{\bar{T}}}_k - T_k\Vert ^2 \nonumber \\&{\mathop {\le }\limits ^{(8)}} 2\Vert G_k\Vert ^2\Vert {{\bar{H}}}_k - \nabla ^2f_k\Vert ^2 + 2\Vert \nabla f({{\varvec{x}}}_k; \xi _H^k) - \nabla f_k\Vert ^2\sum _{i=1}^{m}\Vert \nabla ^2c_i({{\varvec{x}}}_k)\Vert ^2 \nonumber \\&{\mathop {\le }\limits ^{(12)}} 2\kappa _{2,G}\Vert {{\bar{H}}}_k - \nabla ^2f_k\Vert ^2 + 2\kappa _{\nabla _{{{\varvec{x}}}}^2c}^2\Vert \nabla f({{\varvec{x}}}^k; \xi _H^k) - \nabla f_k\Vert ^2. \end{aligned}$$

Taking expectation over \(\xi _H^k\) on both sides and using (11), we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_H}[\Vert {{\bar{M}}}_k - M_k\Vert ^2] \le 2(\kappa _{2,G}+\kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi . \end{aligned}$$
(A.5)

Combining (A.5) with (A.4) and (A.1), we obtain

$$\begin{aligned}&{\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \\&\quad \le \left\{ 2(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi +\kappa _M^2 \right\} \frac{49\kappa _B^4}{\kappa _{1,G}^2\gamma _{RH}^2}\psi + 2(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi \Vert \varDelta {{\varvec{x}}}_k\Vert ^2. \end{aligned}$$

Plugging the above inequality into (A.3),

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{\varDelta }}}{\varvec{\lambda }}_k\Vert ^2\right]\le & {} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + \frac{4(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi }{\kappa _{1,G}^2}\Vert \varDelta {{\varvec{x}}}_k\Vert ^2 \nonumber \\&+ \left\{ \frac{2\kappa _{2,G}}{\kappa _{1,G}^2} +\frac{98\kappa _B^4}{\kappa _{1,G}^4\gamma _{RH}^2}\left( 2(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi +\kappa _M^2\right) \right\} \psi .\nonumber \\ \end{aligned}$$
(A.6)

Combining (A.6) with (A.2), we can define

$$\begin{aligned} \varUpsilon _0 :=1 + \frac{4(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi }{\kappa _{1,G}^2} \vee \frac{2\kappa _{2,G}}{\kappa _{1,G}^2} + \frac{49\kappa _B^4}{\kappa _{1,G}^2\gamma _{RH}^2} \cdot \left\{ 1+ \frac{4(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi +2\kappa _M^2}{\kappa _{1,G}^2}\right\} \end{aligned}$$
(A.7)

and have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}\right\| ^2\right] \le \varUpsilon _0\left( \left\| \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ \varDelta {\varvec{\lambda }}_k \end{pmatrix}\right\| ^2+\psi \right) . \end{aligned}$$

This completes the proof.

1.2 Proof of theorem 1

We require the following lemmas.

Lemma 10

Let \((\varDelta {{\varvec{x}}}_k, \varDelta {\varvec{\lambda }}_k)\) be solved by (5). Then for any \(\mu , \nu \),

$$\begin{aligned} \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{\mu , \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{\mu , \nu }^k \end{pmatrix}^T \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ \varDelta {\varvec{\lambda }}_k \end{pmatrix}=(\varDelta {{\varvec{x}}}_k)^T\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - \mu \Vert c_k\Vert ^2 + c_k^T\varDelta {\varvec{\lambda }}_k - \nu \Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2. \end{aligned}$$

Proof

We have

$$\begin{aligned} \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{\mu , \nu }^k\\ s \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{\mu , \nu }^k \end{pmatrix}^T \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ \varDelta {\varvec{\lambda }}_k \end{pmatrix}&{\mathop {=}\limits ^{(4)}} (\varDelta {{\varvec{x}}}_k)^T\left( I + \nu M_kG_k\right) \nabla _{{{\varvec{x}}}}{\mathcal {L}}_k + \mu (\varDelta {{\varvec{x}}}_k)^TG_k^Tc_k + (\varDelta {\varvec{\lambda }}_k)^Tc_k \\&\quad + \nu (\varDelta {\varvec{\lambda }}_k)^TG_kG_k^TG_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\\&{\mathop {=}\limits ^{(5)}} (\varDelta {{\varvec{x}}}_k)^T\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - \mu \Vert c_k\Vert ^2 + c_k^T\varDelta {\varvec{\lambda }}_k - \nu \Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2. \end{aligned}$$

This completes the proof. \(\square \)

Lemma 11

Under Assumption 1, we have

$$\begin{aligned} \left\| \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ \varDelta {\varvec{\lambda }}_k \end{pmatrix}\right\| ^2 \le \frac{3\kappa _M^2}{\kappa _{1,G}^2}\left\| \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2. \end{aligned}$$

Proof

By (5), we know

$$\begin{aligned} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 =&\Vert (G_kG_k^T)^{-1}(G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k+M_k^T\varDelta {{\varvec{x}}}_k)\Vert ^2{\mathop {\le }\limits ^{(12)}}\frac{2}{\kappa _{1,G}^2}(\Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2 + \kappa _M^2\Vert \varDelta {{\varvec{x}}}_k\Vert ^2)\\ \le&\,\, \frac{2\kappa _M^2}{\kappa _{1,G}^2}(\Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2 + \Vert \varDelta {{\varvec{x}}}_k\Vert ^2).\text {(since }\kappa _M\ge 1\ge \kappa _{1,G}) \end{aligned}$$

This completes the proof. \(\square \)

Now, we are ready for proving Theorem 1. We note that

$$\begin{aligned} \Vert \nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2&= \Vert (I - G_k^T(G_kG_k^T)^{-1}G_k)\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2 + \Vert G_k^T(G_kG_k^T)^{-1}G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2\\&{\mathop {=}\limits ^{{(5)}}} \Vert (I - G_k^T(G_kG_k^T)^{-1}G_k)B_k\varDelta {{\varvec{x}}}_k\Vert ^2 + \Vert G_k^T(G_kG_k^T)^{-1}G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2\\&{\mathop {\le }\limits ^{{(12)}}} \kappa _B^2\Vert \varDelta {{\varvec{x}}}_k\Vert ^2 + \frac{1}{\kappa _{1,G}}\Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2,\\ \Vert c_k\Vert ^2&{\mathop {=}\limits ^{{(5)}}}\Vert G_k\varDelta {{\varvec{x}}}_k\Vert ^2 {\mathop {\le }\limits ^{(12)}}\kappa _{2,G}\Vert \varDelta {{\varvec{x}}}_k\Vert ^2. \end{aligned}$$

Since \(\kappa _{2,G}\wedge \kappa _B \ge 1\ge \kappa _{1,G}\),

$$\begin{aligned} \Vert \nabla {\mathcal {L}}_k\Vert ^2\le \frac{\kappa _B^2+\kappa _{2,G}}{\kappa _{1,G}}\left\| \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2. \end{aligned}$$

Plugging the above inequality into (15), we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[{\mathcal {L}}_{\mu , \nu }^{k+1}]\le {\mathcal {L}}_{\mu , \nu }^k - \frac{\alpha _k{\tilde{\delta }}\kappa _{1,G}}{2(\kappa _B^2+\kappa _{2,G})}\Vert \nabla {\mathcal {L}}_k\Vert ^2 + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{2}\alpha _k^2. \end{aligned}$$
(A.8)

Case 1: constant stepsize. If \(\alpha _k = \alpha \), we take full expectation of (A.8), sum over \(k = 0,\ldots ,K\), and have

$$\begin{aligned} \min _{{\mathcal {X}}\times \varLambda }{\mathcal {L}}_{\mu , \nu } - {\mathcal {L}}_{\mu , \nu }^0 \le - \frac{{\tilde{\delta }}\kappa _{1,G}}{2(\kappa _B^2+\kappa _{2,G})}\cdot \alpha \sum _{k=0}^{K}{\mathbb {E}}[\Vert \nabla {\mathcal {L}}_k\Vert ^2] + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi (K+1)}{2}\alpha ^2. \end{aligned}$$

Rearranging the above inequality leads to

$$\begin{aligned} \frac{1}{K+1}\sum _{k=0}^{K}{\mathbb {E}}[\Vert \nabla {\mathcal {L}}_k\Vert ^2]\le & {} \frac{2(\kappa _B^2+\kappa _{2,G})}{{\tilde{\delta }} \kappa _{1,G}}\cdot \frac{{\mathcal {L}}_{\mu , \nu }^0 - \min _{{\mathcal {X}}\times \varLambda }{\mathcal {L}}_{\mu , \nu }}{(K+1)\alpha }\\&+ \frac{\varUpsilon _0(\kappa _B^2+\kappa _{2,G})\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{{\tilde{\delta }}\kappa _{1,G}}\alpha . \end{aligned}$$

Case 2: decaying stepsize. Let us define two sequences

$$\begin{aligned} s_k = {\mathcal {L}}_{\mu , \nu }^k + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{2}\sum _{i=k}^{\infty }\alpha _i^2,\quad \quad e_k = \frac{\alpha _k{{\tilde{\delta }}}\kappa _{1,G}}{2(\kappa _B^2 +\kappa _{2,G})}\Vert \nabla {\mathcal {L}}_k\Vert ^2. \end{aligned}$$

From (A.8), we know

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[s_{k+1}]&= {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[{\mathcal {L}}_{\mu , \nu }^{k+1}] + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{2}\sum _{i=k+1}^{\infty }\alpha _i^2\\&\le {\mathcal {L}}_{\mu , \nu }^k - \frac{\alpha _k{\tilde{\delta }}\kappa _{1,G}}{2(\kappa _B^2+\kappa _{2,G})}\Vert \nabla {\mathcal {L}}_k\Vert ^2 + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{2}\sum _{i=k}^{\infty }\alpha _i^2 = s_k - e_k. \end{aligned}$$

Since \(e_k \ge 0\), \(\{s_k -\min _{{\mathcal {X}}\times \varLambda }{\mathcal {L}}_{\mu , \nu }\}_k\) is a positive supermartingale. By [17, Theorem 4.2.12], we have \(s_k\rightarrow s\) almost surely for some random variable s and \({\mathbb {E}}[s]\le s_0<\infty \). Therefore, \({\mathbb {E}}[\sum _{k=0}^{\infty }e_k] = \sum _{k=0}^{\infty }{\mathbb {E}}\left[ e_k\right] \le \sum _{k=0}^{\infty }{\mathbb {E}}[s_k] - {\mathbb {E}}[s_{k+1}]<\infty \), where the first equality is due to Fubini-Tonelli theorem [17, Theorem 1.7.2]. The above display implies that \(\sum _{k=0}^{\infty }e_k<\infty \) almost surely. Moreover, since \(\sum _{k=0}^{\infty }\alpha _k = \infty \), we obtain \(\sum _{k=0}^{\infty }e_k/\sum _{k=0}^{\infty }\alpha _k = 0\) and \(\liminf _{k\rightarrow 0}\Vert \nabla {\mathcal {L}}_k\Vert = 0\) almost surely. This completes the proof.

Proofs of section 4

1.1 Proof of lemma 3

Let \(P_{\xi _f^k}(\cdot ) = P(\cdot \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k,{{\bar{\varDelta }}}{{\varvec{x}}}_k, {{\bar{\varDelta }}}{\varvec{\lambda }}_k)\) be the conditional probability over randomness in \(\xi _f^k\). We have that

$$\begin{aligned}&\left| {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k \right| \nonumber \\&\quad {\mathop {\le }\limits ^{{(21)}}} \left| {{\bar{f}}}_k - f_k\right| + \frac{\nu }{2}\left| \Vert G_k({{\bar{\nabla }}}f_k + G_k^T{\varvec{\lambda }}_k) \Vert ^2 - \Vert G_k(\nabla f_k + G_k^T{\varvec{\lambda }}_k) \Vert ^2\right| \nonumber \\&\quad \le \left| {{\bar{f}}}_k - f_k\right| + \frac{\nu }{2}\Vert G_k\Vert ^2\left\| {{\bar{\nabla }}}f_k - \nabla f_k\right\| ^2 + \nu \Vert G_k(\nabla f_k + G_k^T{\varvec{\lambda }}_k)\Vert \Vert G_k\Vert \Vert {{\bar{\nabla }}}f_k - \nabla f_k\Vert \nonumber \\&\quad {\mathop {\le }\limits ^{{(12)}}} \left| {{\bar{f}}}_k - f_k\right| + \nu \kappa _{2,G}\left( \left\| {{\bar{\nabla }}}f_k - \nabla f_k\right\| + \kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}\right) \Vert {{\bar{\nabla }}}f_k - \nabla f_k\Vert . \end{aligned}$$
(B.1)

Let us denote

$$\begin{aligned} m_k = - \kappa _f{{\bar{\alpha }}}_k^2\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}, \end{aligned}$$

then the above display implies that \(|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k |\le m_k\) by requiring

$$\begin{aligned} \left| {{\bar{f}}}_k - f_k\right| \le \frac{m_k}{2} \quad \text { and}\quad \left\| {{\bar{\nabla }}}f_k - \nabla f_k\right\| \le \frac{m_k}{4\nu \kappa _{2,G}\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}}\wedge 1. \end{aligned}$$

The above condition is further implied by requiring

$$\begin{aligned} \left| {{\bar{f}}}_k - f_k\right| \vee \left\| {{\bar{\nabla }}}f_k - \nabla f_k\right\| \le \frac{m_k\wedge 1}{4\nu \kappa _{2,G}\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}} \vee 2} . \end{aligned}$$

Thus, we apply Bernstein inequality [43, Theorem 6.1.1] and have that if

$$\begin{aligned} |\xi _f^k| \ge \frac{16\max \{\varOmega _0, \varOmega _1\}^2 (4\nu ^2\kappa _{2,G}^2\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}^2\vee 1) }{m_k^2\wedge 1 }\log \left( \frac{8d}{p_f}\right) , \end{aligned}$$
(B.2)

then

$$\begin{aligned} P_{\xi _f^k}(|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k|> m_k)\le p_f/2. \end{aligned}$$

Under (B.2), we also have \(P_{\xi _f^k}(|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^{s_k}|> m_k )\le p_f/2\), which implies the condition (23) holds. As for the condition (24), we apply (B.1) to obtain

$$\begin{aligned}&{\mathbb {E}}_{\xi _f^k}\left[ \left| {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k \right| ^2\right] \\&\quad \le 2{\mathbb {E}}_{\xi _f^k}[|{{\bar{f}}}_k - f_k|^2] + 2\nu ^2\kappa _{2,G}^2\left( \varOmega _1+\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}\right) ^2 {\mathbb {E}}_{\xi _f^k}[\Vert {{\bar{\nabla }}}f_k - \nabla f_k\Vert ^2]\\&\quad \le \frac{2\varOmega _0^2 + 2\nu ^2\kappa _{2,G}^2\varOmega _1^2\left( \varOmega _1+\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}\right) ^2}{|\xi _f^k|}, \end{aligned}$$

where the last inequality applies the variance bound for the sample average, that is \({\mathbb {E}}[|{{\bar{f}}}_k - f_k|^2] = {\mathbb {E}}[|f({{\varvec{x}}}_k; \xi ) - f_k|^2]/|\xi _f^k| \le \varOmega _0^2/|\xi _f^k|\) (and similar for \({\mathbb {E}}[\Vert {{\bar{\nabla }}}f_k - \nabla f_k\Vert ^2]\)). Following the same calculation, we can show the same bound for \(|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^{s_k} |^2\). Thus, the condition (24) holds if

$$\begin{aligned} |\xi _f^k| \ge \frac{2\varOmega _0^2 + 2\nu ^2\kappa _{2,G}^2\varOmega _1^2\left( \varOmega _1+\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}\right) ^2}{{{\bar{\epsilon }}}_k^2}. \end{aligned}$$
(B.3)

Combining (B.2) and (B.3) and taking maximum on the right hand side, the conditions (23) and (24) are satisfied simultaneously when

$$\begin{aligned} |\xi _f^k| \ge \frac{C_f\log \left( \frac{8d}{p_f}\right) }{m_k^2\wedge {{\bar{\epsilon }}}_k^2\wedge 1}, \end{aligned}$$
(B.4)

with \(C_f = 16\max \{\varOmega _0,\varOmega _1\}^2\{1\vee 4\nu ^2\kappa _{2,G}^2(\varOmega _1\vee \kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}})^2\}\). Finally, we note that \(m_k\ne 0\). Otherwise, by (19) we know \({{\bar{\varDelta }}}{{\varvec{x}}}_k = {{\varvec{0}}}\) and \(G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k = {{\varvec{0}}}\). Then, by (9) we have \({{\bar{\varDelta }}}{\varvec{\lambda }}_k = {{\varvec{0}}}\), which contradicts \(\Vert ({{\bar{\varDelta }}}{{\varvec{x}}}_k, {{\bar{\varDelta }}}{\varvec{\lambda }}_k)\Vert >0\) in Assumption 3. This completes the proof.

1.2 Proof of proposition 2

Note that

$$\begin{aligned}&{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + \nu {{\bar{M}}}_k G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + G_k^Tc_k \\&\quad = (I - G_k^T(G_kG_k^T)^{-1}G_k){{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + G_k^T(G_kG_k^T)^{-1}G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + \nu {{\bar{M}}}_k G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}^k + G_k^Tc_k\\&\quad {\mathop {=}\limits ^{(9)}} - (I - G_k^T(G_kG_k^T)^{-1}G_k)B_k{{\bar{\varDelta }}}{{\varvec{x}}}_k \\&\qquad + \left( G_k^T(G_kG_k^T)^{-1} + \nu {{\bar{M}}}_k\right) G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k - G_k^TG_k{{\bar{\varDelta }}}{{\varvec{x}}}_k. \end{aligned}$$

Thus, we know

$$\begin{aligned}&\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + \nu {{\bar{M}}}_k G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + G_k^Tc_k\\ \nu G_kG_k^TG_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix} \nonumber \\&\quad = \begin{pmatrix} -(I - G_k^T(G_kG_k^T)^{-1}G_k)B_k - G_k^TG_k &{} G_k^T(G_kG_k^T)^{-1} + \nu {{\bar{M}}}_k\\ {{\varvec{0}}}&{} \nu G_kG_k^T \end{pmatrix}\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}.\nonumber \\ \end{aligned}$$
(B.5)

Using (12) and (36), we upper bound the spectrum norm of each block of the right hand side matrix, and define

$$\begin{aligned} \varUpsilon _1 = \kappa _B+\kappa _{2,G} + \frac{1}{\sqrt{\kappa _{1,G}}} + \nu (\kappa _{{{\bar{M}}}} + \kappa _{2,G}) \end{aligned}$$

with \(\kappa _{{{\bar{M}}}}\) given in (36), then the first inequality in (39) is satisfied. Furthermore, we know

$$\begin{aligned} \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix} {\mathop {=}\limits ^{(9)}} \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ -{{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k - G_kG_k^T{{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} = \begin{pmatrix} I &{} {{\varvec{0}}}\\ -{{\bar{M}}}_k^T &{} -G_kG_k^T \end{pmatrix}\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}. \end{aligned}$$

Thus, we let

$$\begin{aligned} \varUpsilon _2 = \frac{1}{1 + \kappa _{{{\bar{M}}}} + \kappa _{2,G}} \end{aligned}$$

and the second inequality in (39) is satisfied. Moreover,

$$\begin{aligned} \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}&{\mathop {=}\limits ^{(9)}} \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ -(G_kG_k^T)^{-1}{{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k - (G_kG_k^T)^{-1}G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix} \\&= \begin{pmatrix} I &{} {{\varvec{0}}}\\ -(G_kG_k^T)^{-1}{{\bar{M}}}_k^T &{} - (G_kG_k^T)^{-1} \end{pmatrix}\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}. \end{aligned}$$

Thus, we let

$$\begin{aligned} \varUpsilon _3 = 1 + \frac{\kappa _{{{\bar{M}}}}+1}{\kappa _{1,G}} \end{aligned}$$

and the third inequality in (39) is satisfied. Finally,

$$\begin{aligned} \begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}&{\mathop {=}\limits ^{(20)}} \begin{pmatrix} \left( I+\nu {{\bar{M}}}_kG_k\right) {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + {{\bar{\mu }}}_{{{\bar{K}}}}G_k^Tc_k\\ c_k + \nu G_kG_k^TG_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\\&{\mathop {=}\limits ^{(B.5)}}\begin{pmatrix} -(I - G_k^T(G_kG_k^T)^{-1}G_k)B_k - {{\bar{\mu }}}_{{{\bar{K}}}}G_k^TG_k &{} G_k^T(G_kG_k^T)^{-1} + \nu {{\bar{M}}}_k\\ -G_k &{} \nu G_kG_k^T \end{pmatrix}\\&\quad \qquad \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}. \end{aligned}$$

By Lemma 4, we know \({{\bar{\mu }}}_{{{\bar{K}}}} \le {{\tilde{\mu }}}\) with deterministic \({{\tilde{\mu }}}\). Thus, we let

$$\begin{aligned} \varUpsilon _4 = \kappa _B + ({{\tilde{\mu }}}+1)\kappa _{2,G} +\frac{1}{\sqrt{\kappa _{1,G}}} + \nu (\kappa _{{{\bar{M}}}}+\kappa _{2,G}) \end{aligned}$$

and the fourth inequality in (39) is satisfied. This completes the proof.

1.3 Proof of lemma 8

We consider the following three cases.

Case 1: reliable step. For the reliable step, we apply Lemma 6 and (43) is changed to

$$\begin{aligned} {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k&\le \frac{{{\bar{\alpha }}}_k\beta }{2}\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}\nonumber \\&{\mathop {\le }\limits ^{(26)}} \frac{{{\bar{\alpha }}}_k\beta }{4}\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} - \frac{{{\bar{\epsilon }}}_k}{4} \nonumber \\&{\mathop {\le }\limits ^{(19)}} -\frac{{{\bar{\alpha }}}_k\beta (\gamma _{RH}\wedge \nu )}{8}\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{k} \end{pmatrix}\right\| ^2 - \frac{{{\bar{\epsilon }}}_k}{4}. \end{aligned}$$
(B.6)

By Line 13 of Algorithm 3, \({{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = (\rho -1){{\bar{\epsilon }}}_k\), while (44) is the same. Thus, by the condition on \(\omega \) in (41) we obtain

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k\le & {} -\frac{\beta (\gamma _{RH}\wedge \nu )\omega }{32}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 - \frac{\omega {{\bar{\epsilon }}}_k}{8}\nonumber \\&+ \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2. \end{aligned}$$
(B.7)

Case 2: unreliable step. For the unreliable step, (B.6) is changed to

$$\begin{aligned} {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \le \frac{{{\bar{\alpha }}}_k\beta }{2}\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} {\mathop {\le }\limits ^{(19)}} -\frac{\beta (\gamma _{RH}\wedge \nu )}{4}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2. \end{aligned}$$

By Line 15 of Algorithm 3, \({{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = -(1-1/\rho ){{\bar{\epsilon }}}_k\) and (44) is the same. Thus, under the condition on \(\omega \) in (41),

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k\le & {} -\frac{\beta (\gamma _{RH}\wedge \nu )\omega }{32}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{k} \end{pmatrix}\right\| ^2\nonumber \\&-\frac{1}{2}(1-\omega )\left( 1-\frac{1}{\rho }\right) {{\bar{\epsilon }}}_k + \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2.\nonumber \\ \end{aligned}$$
(B.8)

Case 3: unsuccessful step. The result in (45) holds.

Combining (B.7), (B.8), (45), we have

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k \le \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2, \end{aligned}$$

which completes the proof.

1.4 Proof of lemma 9

We consider the following three cases.

Case 1: reliable step. We have

$$\begin{aligned}&{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \nonumber \\&\quad \le |{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&\quad {\mathop {\le }\limits ^{(25)}} {{\bar{\alpha }}}_k\beta \begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&\quad {\mathop {\le }\limits ^{(26)}} \frac{{{\bar{\alpha }}}_k\beta }{2}\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} -\frac{{{\bar{\epsilon }}}_k}{2} + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&\quad {\mathop {\le }\limits ^{(19)}} -\frac{\beta {{\bar{\alpha }}}_k(\gamma _{RH}\wedge \nu )}{4}\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 - \frac{{{\bar{\epsilon }}}_k}{2} + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k|. \end{aligned}$$
(B.9)

By Line 13 of Algorithm 3, \({{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = (\rho -1){{\bar{\epsilon }}}_k\), and (44) holds as well. By (41), we have

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k&\le -\frac{\beta (\gamma _{RH}\wedge \nu )\omega }{32}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 \nonumber \\&\quad - \frac{\omega {{\bar{\epsilon }}}_k}{8} + \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2 \nonumber \\&\quad +\omega ( |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| ). \end{aligned}$$
(B.10)

Case 2: unreliable step. The inequality in (B.9) is changed to

$$\begin{aligned} {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k&\le |{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&{\mathop {\le }\limits ^{(25)}} {{\bar{\alpha }}}_k\beta \begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&{\mathop {\le }\limits ^{(19)}} -\frac{\beta {{\bar{\alpha }}}_k(\gamma _{RH}\wedge \nu )}{2}\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}-{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}|\\&\quad \quad + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k|. \end{aligned}$$

We further have \({{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = -(1-1/\rho ){{\bar{\epsilon }}}_k\) and (44) holds as well. Using (41), we get

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k\le & {} -\frac{\beta (\gamma _{RH}\wedge \nu )\omega }{32}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 - \frac{1}{2}(1-\omega )\left( 1-\frac{1}{\rho }\right) {{\bar{\epsilon }}}_k\nonumber \\&+ \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2 +\omega ( |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| \nonumber \\&+ |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| ). \end{aligned}$$
(B.11)

Case 3: unsuccessful step. The result in (45) holds.

Combining (B.10), (B.11), (45), we obtain

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k\le & {} \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2\\&+ \omega \left( |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \right) , \end{aligned}$$

which completes the proof.

Additional simulation results

We list the detailed results for each selected CUTEst problem. For each problem and each method, we have five levels of noise. For all four methods, we report the smallest result among different setups. In particular, for NonAdapSQP and \(\ell _1\) SQP, we take the minimum over all six setups of stepsize sequences (four constant sequences and two decaying sequences). For AdapSQP and \(\ell _1\) AdapSQP, we take the minimum over four setups of constants \(C_{grad}, C_f = \{1,5,10,50\}\). The result here is the KKT residual of the last iterate. We average over convergent runs across five independent runs. The convergence results are summarized in Tables 1, 2, and 3.

In tables, the column of logR shows \(\log (\Vert \nabla {\mathcal {L}}_k\Vert )\) and the column of logStd shows the log of standard deviation. For each problem, the first line is the result of AdapSQP; the second line is the result of \(\ell _1\) AdapSQP; the third line is the result of \(\ell _1\) SQP; and the fourth line is the result of NonAdapSQP. The entry ‘\(\diagup \)’ means that the algorithm does not converge (within the given budget).

Table 1 KKT results summary
Table 2 KKT results summary
Table 3 KKT results summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Na, S., Anitescu, M. & Kolar, M. An adaptive stochastic sequential quadratic programming with differentiable exact augmented lagrangians. Math. Program. 199, 721–791 (2023). https://doi.org/10.1007/s10107-022-01846-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01846-z

Keywords

Mathematics Subject Classification

Navigation