Abstract
This paper is devoted to studying an augmented Lagrangian method for solving a class of manifold optimization problems, which have nonsmooth objective functions and nonlinear constraints. Under the constant positive linear dependence condition on manifolds, we show that the proposed method converges to a stationary point of the nonsmooth manifold optimization problem. Moreover, we propose a globalized semismooth Newton method to solve the augmented Lagrangian subproblem on manifolds efficiently. The local superlinear convergence of the manifold semismooth Newton method is also established under some suitable conditions. We also prove that the semismoothness on submanifolds can be inherited from that in the ambient manifold. Finally, numerical experiments on compressed modes and (constrained) sparse principal component analysis illustrate the advantages of the proposed method.
Similar content being viewed by others
Notes
The subscript p in \(\left\langle \cdot , \cdot \right\rangle _p\) is usually omitted for simplicity.
An affine connection is \(\mathcal {D}(\mathcal {M})\)-linear w.r.t. X, \(\mathbb {R}\)-linear w.r.t. X, Y, and satisfies the product rule, see Chapter 5 in [2].
We say a function f on a manifold is locally Lipschitz if \(f \circ \varphi ^{-1}\) is locally Lipschitz in U for every chart \((U, \varphi )\).
This is well-defined in a small neighborhood of q [19, Proposition 3.2.9].
Let \(g: \mathcal {M}\rightarrow \mathbb {R}^m\) and \(f: \mathbb {R}^m \rightarrow \mathbb {R}\) be two smooth maps, we can use Definition 2.4 to compute the gradient of \(f \circ g\). For \(p \in \mathcal {M}\), \(\xi _p \in T_p\mathcal {M}\), choosing any curve \(\gamma : (-1, 1) \rightarrow \mathcal {M}\) such that \(\gamma (0) = p\) and \({\dot{\gamma }}(0) = \xi _p\), we have \(\left\langle \xi _p, {{\,\mathrm{grad}\,}}\,(f\circ g)(p) \right\rangle = \xi _p (f \circ g) = \left. \frac{\mathrm {d}(f \circ g \circ \gamma )(t)}{\mathrm {d}t}\right| _{t=0} = \sum _{i=1}^m\alpha _i\left. \frac{\mathrm {d}(g_i \circ \gamma )(t)}{\mathrm {d}t}\right| _{t=0} = \sum _{i=1}^m \alpha _i \cdot (\xi _p g_i) = \sum _{i=1}^m \alpha _i \left\langle \xi _p, {{\,\mathrm{grad}\,}}\,[g(p)]_i \right\rangle \), where \(\alpha := \nabla f(q)\), \(q := g(\gamma (t))\) and \(g_i\) is the i-th component of g. Thus, \({{\,\mathrm{grad}\,}}\, (f \circ g)(p) = \sum _{i=1}^m \alpha _i {{\,\mathrm{grad}\,}}\, [g(p)]_i\).
We say the map \(\mathcal {K}\) is upper semicontinuous if for every \(\varepsilon > 0\) there exists \( \delta >0\) such that for all \(q \in B_\delta (p)\) we have \(P_{qp}\mathcal {K}(q) \subset \mathcal {K}(p) + {\hat{B}}_\varepsilon (0)\), where \({\hat{B}}_\varepsilon (0) := \{ V \in \mathcal {L}(T_p\mathcal {M}) : \left\| V \right\| < \varepsilon \}\).
Since \(\gamma \) is a geodesic which is parallel to itself, i.e., \(\nabla _{{\dot{\gamma }}} {\dot{\gamma }} = 0\), then \(\frac{\mathrm {d}}{\mathrm {d}t} \Vert {\dot{\gamma }}(t) \Vert ^2 = 2\left\langle \nabla _{{\dot{\gamma }}}{\dot{\gamma }}, {\dot{\gamma }} \right\rangle \equiv 0\). Therefore, we have \(\Vert {\dot{\gamma }}(t)\Vert = \Vert {\dot{\gamma }}(0) \Vert = \Vert V \Vert \) and \(\ell (\gamma |_{[s,t]}) = \int _s^t \Vert {\dot{\gamma }}(u) \Vert \mathrm {d}u = |t - s| \Vert {\dot{\gamma }}(0)\Vert \).
We only consider our method in this setting since other algorithms mentioned in Table 2 have difficulty in reaching these stopping conditions. Indeed, these observations are also valid for the original termination conditions with a lower significance.
References
Absil, P.-A., Hosseini, S.: A collection of nonsmooth Riemannian optimization problems. In: Nonsmooth Optimization and its Applications, Springer, pp. 1–15 (2019)
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
Absil, P.-A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22, 135–158 (2012)
Adler, R.L., Dedieu, J.-P., Margulies, J.Y., Martens, M., Shub, M.: Newton’s method on Riemannian manifolds and a geometric model for the human spine. IMA J. Numer. Anal. 22, 359–390 (2002)
Andreani, R., Birgin, E.G., Martínez, J.M., Schuverdt, M.L.: On augmented Lagrangian methods with general lower-level constraints. SIAM J. Optim. 18, 1286–1309 (2008)
Andreani, R., Haeser, G., Schuverdt, M.L., Silva, P.J.: A relaxed constant positive linear dependence constraint qualification and applications. Math. Program. 135, 255–273 (2012)
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)
Azagra, D., Ferrera, J., López-Mesas, F.: Nonsmooth analysis and Hamilton-Jacobi equations on Riemannian manifolds. J. Funct. Anal. 220, 304–361 (2005)
Bacák, M., Bergmann, R., Steidl, G., Weinmann, A.: A second order nonsmooth variational model for restoring manifold-valued images. SIAM J. Sci. Comput. 38, A567–A597 (2016)
Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, London (1982)
Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48, 334–334 (1997)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2014)
Bonnans, J.F., Shapiro, A.: Perturbation Analysis of Optimization Problems. Springer, Berlin (2000)
Borckmans, P.B., Selvan, S.E., Boumal, N., Absil, P.-A.: A Riemannian subgradient algorithm for economic dispatch with valve-point effect. J. Comput. Appl. Math. 255, 848–866 (2014)
Boumal, N., Absil, P.-A.: RTRMC: a Riemannian trust-region method for low-rank matrix completion. In: Advances in Neural Information Processing Systems, vol. 24 (2011)
Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2011)
Cai, J.-F., Liu, H., Wang, Y.: Fast rank-one alternating minimization algorithm for phase retrieval. J. Sci. Comput. 79, 128–147 (2019)
Carmo, M.P.: Riemannian Geometry. Birkhäuser, London (1992)
Chen, S., Ma, S., Man-Cho So, A., Zhang, T.: Proximal gradient method for nonsmooth optimization over the Stiefel manifold. SIAM J. Optim. 30, 210–239 (2020)
Chen, S., Ma, S., Xue, L., Zou, H.: An alternating manifold proximal gradient method for sparse principal component analysis and sparse canonical correlation analysis. Inform. J. Optim. 2, 192–208 (2020)
Chen, W., Ji, H., You, Y.: An augmented Lagrangian method for \(\ell _1\)-regularized optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 38, B570–B592 (2016)
Chen, X., Guo, L., Lu, Z., Ye, J.J.: An augmented Lagrangian method for non-Lipschitz nonconvex programming. SIAM J. Numer. Anal. 55, 168–193 (2017)
Cho, M., Lee, J.: Riemannian approach to batch normalization. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Clarke, F.H.: Optimization and nonsmooth analysis. SIAM 2, 668 (1990)
Curtis, F.E., Jiang, H., Robinson, D.P.: An adaptive augmented Lagrangian method for large-scale constrained optimization. Math. Program. 152, 201–245 (2015)
Daniilidis, A., Deville, R., Durand-Cartagena, E., Rifford, L.: Self-contracted curves in Riemannian manifolds. J. Math. Anal. Appl. 457, 1333–1352 (2018)
de Oliveira, F.R., Ferreira, O.P.: Newton method for finding a singularity of a special class of locally Lipschitz continuous vector fields on Riemannian manifolds. J. Optim. Theory Appl. 185, 522–539 (2020)
de Oliveira, F.R., Oliveira, F.R.: A global Newton method for the nonsmooth vector fields on Riemannian manifolds. J. Optim. Theory Appl. 2, 1–15 (2021)
Deng, K., Peng, Z.: An inexact augmented Lagrangian method for nonsmooth optimization on Riemannian manifold, arXiv preprint arXiv:1911.09900 (2019)
Dirr, G., Helmke, U., Lageman, C.: Nonsmooth Riemannian optimization with applications to sphere packing and grasping. In: Lagrangian and Hamiltonian Methods for Nonlinear Control, Springer, vol. 2007, pp. 29–45 (2006)
Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, Berlin (2007)
Ferreira, O., Oliveira, P.: Proximal point algorithm on Riemannian manifolds. Optimization 51, 257–270 (2002)
Gabay, D.: Minimizing a differentiable function over a differential manifold. J. Optim. Theory Appl. 37, 177–219 (1982)
Ghahraei, E., Hosseini, S., Pouryayevali, M.R.: Pseudo-Jacobian and characterization of monotone vector fields on Riemannian manifolds. J. Convex Anal. 24, 149–168 (2017)
Hiriart-Urruty, J.-B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms, vol. 2. Springer, Berlin (1993)
Hiriart-Urruty, J.-B., Strodiot, J.-J., Nguyen, V.H.: Generalized Hessian matrix and second-order optimality conditions for problems with \(C^{1,1}\) data. Appl. Math. Optim. 11, 43–56 (1984)
Hosseini, S., Pouryayevali, M.: Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds. Nonlinear Anal. Theory Methods Appl. 74, 3884–3895 (2011)
Hu, J., Liu, X., Wen, Z.-W., Yuan, Y.-X.: A brief introduction to manifold optimization. J. Oper. Res. Soc. China 8, 199–248 (2020)
Hu, J., Milzarek, A., Wen, Z., Yuan, Y.: Adaptive quadratically regularized Newton method for Riemannian optimization. SIAM J. Matrix Anal. Appl. 39, 1181–1207 (2018)
Huang, W., Absil, P.-A., Gallivan, K.A.: A Riemannian symmetric rank-one trust-region method. Math. Program. 150, 179–216 (2015)
Huang, W., Wei, K.: Extending FISTA to Riemannian optimization for sparse PCA, arXiv preprint arXiv:1909.05485 (2019)
Huang, W., Wei, K.: Riemannian proximal gradient methods. Math. Programm. 5, 1–43 (2021)
Kovnatsky, A., Glashoff, K., Bronstein, M.M.: MADMM: A generic algorithm for non-smooth optimization on manifolds. In: European Conference on Computer Vision, Springer, pp. 680–696 (2016)
Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58, 431–449 (2014)
Lai, R., Wen, Z., Yin, W., Gu, X., Lui, L.M.: Folding-free global conformal mapping for genus-0 surfaces by harmonic energy minimization. J. Sci. Comput. 58, 705–725 (2014)
Lee, J.M.: Introduction to Smooth Manifolds. Springer, Berlin (2012)
Lee, J.M.: Introduction to Riemannian Manifolds. Springer, Berlin (2018)
Li, X., Sun, D., Toh, K.-C.: A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems. SIAM J. Optim. 28, 433–458 (2018)
Lu, Z., Zhang, Y.: An augmented Lagrangian approach for sparse principal component analysis. Math. Program. 135, 149–193 (2012)
Mifflin, R.: Semismooth and semiconvex functions in constrained optimization. SIAM J. Control. Optim. 15, 959–972 (1977)
Montanari, A., Richard, E.: Non-negative principal component analysis: Message passing algorithms and sharp asymptotics. IEEE Trans. Inf. Theory 62, 1458–1484 (2015)
Ozoliņš, V., Lai, R., Caflisch, R., Osher, S.: Compressed modes for variational problems in mathematics and physics. Proc. Natl. Acad. Sci. U.S.A. 110, 18368–18373 (2013)
Qi, L., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58, 353–367 (1993)
Qi, L., Wei, Z.: On the constant positive linear dependence condition and its application to SQP methods. SIAM J. Optim. 10, 963–981 (2000)
Rampazzo, F., Sussmann, H.J.: Commutators of flow maps of nonsmooth vector fields. J. Differ. Equ. 232, 134–175 (2007)
Rockafellar, R.T.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1970)
Spivak, M.D.: A Comprehensive Introduction to Differential Deometry, vol. 2, 3rd edn. Publish or Perish, New York (1999)
Sun, D., Sun, J.: Semismooth matrix-valued functions. Math. Oper. Res. 27, 150–169 (2002)
Sun, D., Sun, J., Zhang, L.: The rate of convergence of the augmented Lagrangian method for nonlinear semidefinite programming. Math. Program. 114, 349–391 (2008)
Vandereycken, B.: Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 23, 1214–1236 (2013)
Waldspurger, I., d’Aspremont, A., Mallat, S.: Phase recovery, maxcut and complex semidefinite programming. Math. Program. 149, 47–81 (2015)
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142, 397–434 (2013)
Xiao, X., Li, Y., Wen, Z., Zhang, L.: A regularized semi-smooth Newton method with projection steps for composite convex programs. J. Sci. Comput. 76, 364–389 (2018)
Yang, L., Sun, D., Toh, K.-C.: SDPNAL\(+\): A majorized semismooth Newton-CG augmented Lagrangian method for semidefinite programming with nonnegative constraints. Math. Program. Comput. 7, 331–366 (2015)
Yang, W.H., Zhang, L.-H., Song, R.: Optimality conditions for the nonlinear programming problems on Riemannian manifolds. Pac. J. Optim. 10, 415–434 (2014)
Zhang, H., Reddi, S.J., Sra, S.: Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4592–4600 (2016)
Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In: Conference on Learning Theory, pp. 1617–1638 (2016)
Zhao, X.-Y., Sun, D., Toh, K.-C.: A Newton-CG augmented Lagrangian method for semidefinite programming. SIAM J. Optim. 20, 1737–1765 (2010)
Zhu, H., Zhang, X., Chu, D., Liao, L.-Z.: Nonconvex and nonsmooth optimization with generalized orthogonality constraints: an approximate augmented Lagrangian method. J. Sci. Comput. 72, 331–372 (2017)
Zhu, X., Sato, H.: Riemannian conjugate gradient methods with inverse retraction. Comput. Optim. Appl. 77, 779–810 (2020)
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Gr. Stat. 15, 265–286 (2006)
Acknowledgements
The authors would like to thank the two anonymous referees and the associate editor for their valuable comments and constructive suggestions, which have greatly improved the quality and the presentation of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work was supported in part by National Natural Science Foundation of China (11901338, 61620106010) and Tsinghua University Initiative Scientific Research Program. The research of the third author was supported in part by the National Natural Science Foundation of China (12071464) and the Beijing Natural Science Foundation (Z190002)
Appendices
Appendix A: The deferred Proofs in Sects. 4.1 and 4.2
1.1 A.1 Proof of Lemma 4.2
Proof
From the compactness of U, there exist \(C, K, r_0 > 0\) such that Lemma 4.1 holds for every \(q \in U\). By shrinking the constant \(r_0\) if necessary, we could also assume that for every \(q \in U\), \(R_q v\) is defined for every \(v \in T_q\mathcal {M}\) with \(\Vert v \Vert < r_0\). For a fixed point \(q \in \mathcal {M}\), let \(\{ e_i \}_{i \in [n]}\) be an orthonormal basis of \(T_q \mathcal {M}\). We define the isomorphism \(\psi _q: \mathbb {R}^n \rightarrow T_q \mathcal {M}\) as \(\psi _q(w_1, \ldots , w_n) = \sum _{i=1}^n w_i e_i\), and the chart \(\varphi _q: B_{r_0}(q) \rightarrow \mathbb {R}^n\) as \(\varphi _q\left( \exp _q \psi _q(w) \right) = w\), where \(\Vert w\Vert _{\mathbb {R}^n} < r_0\),Footnote 9 which implies that \((\psi _q \circ \varphi _q )(p) = \exp _q^{-1} p\) for every \(p \in B_q(r_0)\).
For \(\Vert v \Vert < r_0\), we define the curve \(\gamma (t) := R_q (tv)\), then
where \({\tilde{C}}_q := \sup _{\Vert v\Vert < r_0} \Vert \mathrm {d}R_q|_v \Vert \). Since \(R_q v\) is \(C^2\) with respect to q and v and U is compact, we know \({\tilde{C}} := \sup _{q \in U} {\tilde{C}}_q < \infty \). Thus, \(\exp ^{-1}_q R_q v\) is defined for \(q \in U\) and \(\Vert v \Vert < r := r_0 / {\tilde{C}}\).
Let \({\hat{B}}_{r} := \{ w \in \mathbb {R}^n : \Vert w \Vert _{\mathbb {R}^n} < r \}\), and \({\hat{R}}_q, {\hat{E}}_q: {\hat{B}}_{r} \rightarrow \mathbb {R}^n\) be the maps \(\varphi _q \circ R_q \circ \psi _q|_{{\hat{B}}_r}\), \(\varphi _q \circ \exp _q \circ \psi _q|_{{\hat{B}}_r}\), respectively. Since the differentials of \(\exp _q\) and \(R_q\) are the same at 0, then the differentials of \({\hat{E}}_q\) and \({\hat{R}}_q\) also coincide at 0. Let \({\hat{D}}_q := {\hat{E}}_q - {\hat{R}}_q\), then Taylor’s theorem states that for any \(w \in {\hat{B}}_r\), there exists \(t_w \in (0, 1)\) such that \(\Vert {\hat{D}}_q(w) \Vert _{\mathbb {R}^n} = \frac{1}{2} \Vert \nabla ^2 {\hat{D}}_q(t_w w)[w, w] \Vert _{\mathbb {R}^n} \le {\hat{C}}_q \Vert w \Vert ^2_{\mathbb {R}^n}\), where \({\hat{C}}_q := \frac{1}{2}\sup _{ \Vert w \Vert _{\mathbb {R}^n} < r} \Vert \nabla ^2 {\hat{D}}_q(w) \Vert \).
Since \(\psi _q\) is a linear isomorphism, then \((\psi _q \circ {\hat{D}}_q \circ \psi _q^{-1})(v) = (\psi _q \circ (\varphi _q \circ \exp _q - \varphi _q \circ R_q))(v) = (\psi _q \circ \varphi _q \circ \exp _q)(v) - (\psi _q \circ \varphi _q \circ R_q)(v) = v - \exp ^{-1}_q R_q v\). Therefore, by Lemma 4.1, we have
for every \(v \in T_q \mathcal {M}\) with \(\Vert v \Vert < r\), where the last inequality follows from (A.1) and \(\Vert \exp ^{-1}_q R_q v \Vert = d(R_q v, q)\). Finally, since \(R_q v\) and \(\exp _q v\) are both \(C^2\) with respect to q and v, \({\hat{C}}_q\) is uniformly bounded on the compact set U. Thus, the required constant C exists and is finite. \(\square \)
1.2 A.2 Proof of Lemma 4.4
Proof
Define \({\hat{\varphi }} = \varphi \circ \gamma \). We know that \(\hat{\varphi }\) is continuously differentiable with Lipschitz gradient in [0, 1]. By Theorem 2.3 in [37], there exist \(\xi \in (0, 1), {\hat{M}}_\xi \in \partial ^2 {\hat{\varphi }}(\xi )\) such that \({\hat{\varphi }}(1) - {\hat{\varphi }}(0) = {\hat{\varphi }}^\prime (0) + \frac{{\hat{M}}_\xi }{2}\). Note that due to \({\hat{\varphi }}^\prime (t) = \left\langle {{\,\mathrm{grad}\,}}\varphi (\gamma (t)), {\dot{\gamma }}(t) \right\rangle \) and \(\nabla _{\dot{\gamma }}{\dot{\gamma }} = 0\), we find \({\hat{\varphi }}^{\prime \prime }(t) = \left\langle \nabla _{{\dot{\gamma }}(t)} {{\,\mathrm{grad}\,}}\varphi (\gamma (t)), {\dot{\gamma }}(t) \right\rangle \), whenever \({\hat{\varphi }}^\prime \) is differentiable at t. From the linearity of the parallel transport, it suffices to consider the case \({\hat{M}}_\xi \in \partial ^2_B {\hat{\varphi }}(\xi )\), i.e., there exists \(\{ t_k \} \rightarrow \xi \) such that \({\hat{\varphi }}^{\prime \prime }\) exists at \(t_k\) and \({\hat{\varphi }}^{\prime \prime }(t_k) \rightarrow {\hat{M}}_\xi \). Since by the isometry property of the parallel transport, \(\hat{\varphi }^{\prime \prime }(t_k) = \langle \nabla _{ {\dot{\gamma }}(t_k)} {{\,\mathrm{grad}\,}}\varphi (\gamma (t_k)) , {\dot{\gamma }}(t_k) \rangle = \langle P_{\gamma }^{t_k \mapsto \xi } \nabla _{P_\gamma ^{\xi \mapsto t_k} \dot{\gamma }(\xi )} {{\,\mathrm{grad}\,}}\varphi (\gamma (t_k)) , {\dot{\gamma }}(\xi ) \rangle \), then there is \(M_\xi \in \partial {{\,\mathrm{grad}\,}}\varphi (\gamma (\xi ))\) such that \({\hat{M}}_\xi = \left\langle M_\xi {\dot{\gamma }}(\xi ), \dot{\gamma }(\xi ) \right\rangle \). Note \({\dot{\gamma }}(\xi ) = P_\gamma ^{0\rightarrow \xi }{\dot{\gamma }}(0)\), then the conclusion holds. \(\square \)
1.3 A.3 Proof of Lemma 4.5
Proof
Suppose the conclusion does not hold, then there exist \(\delta >0\) and \(v_k \in T_p\mathcal {M}\rightarrow 0\) such that
By Lemma 4.1 and the locally Lipschitz condition on X, there exists \(r_0>0\) such that for every \(p_1,p_2\in B_{r_0}(p)\), there exists a unique shortest geodesic joining \(p_1\) and \(p_2\) and X is L-Lipschitz in \(B_{r_0}(p)\). Let \(0<r<r_0\), by taking the subsequence of \(\{v_k\}\), we assume that \({\bar{v}}_k := \frac{rv_k}{\left\| v_k \right\| } \rightarrow v \in T_p\mathcal {M}\) with \(\left\| v \right\| = r\). We define \(t_k = \left\| v_k \right\| /r\), \(q_k = \exp _p v_k\), \(r_k = \exp _p \bar{v}_k\) and \(s_k = \exp _p t_k v\), then it holds that \(\{q_k,r_k,s_k\}\subset B_{r_0}(p)\), and hence we have
From Lemma 10 in [41], there exists \(C>0\) such that
since \(\Vert X(q_k)\Vert \le L \Vert v_k\Vert \rightarrow 0\) as \(v_k\rightarrow 0\). Moreover, since X is locally Lipschitz and directional differentiable at p, using Lemma 4.1, as \(k\rightarrow \infty \), we have
where the first estimate follows from Definition 2.10 and \(\ell (\gamma ) = d(q_k, s_k)\). Let \(q_k^t = \exp _p t{\bar{v}}_k\) and \(q_v^t = \exp _p tv\), from Lemma 10 in [41] and Lemma 4.1, we have when \(t \rightarrow 0\),
where the constants in the big-O notation depend on p, v and \(\mathcal {M}\). An intermediate result of the above estimate is
Thus, the last term in (A.2) is
Combining (A.3), (A.4) and (A.6), we find that \(\Vert P_{q_k,p}X(q_k) - \nabla X(p;v_k)\Vert = o(t_k) = o(\Vert v_k\Vert )\), which contradicts with our assumption. \(\square \)
1.4 A.4 Proof of Lemma 4.6
Proof
Define \(q_{k + 1} := R_{p_k} V_k\). Let \(B_r(p), K\) be the quantities in Lemma 4.1. Without loss of generality, we may assume \(p_k \in B_r(p)\) and \(\Vert v\Vert < r\) for every k, and shrink the ball \(B_r(p)\) such that (4.10) holds in \(B_r(p)\). Using Lemma 4.1, we know the following inequality holds for every geodesic triangle in \(B_r(p)\) whose edges are a, b, c:
where A is the angle between b and c. Let \(a = d(p_k, q_{k+1})\), \(b = d(p_k, p)\), \(c = d(q_{k+1}, p)\), then we have
Define \(a = d(p_{k}, p)\), \(b = d(p_k, q_{k+1})\), \(c = d(q_{k + 1}, p)\), note that \(d(q_{k+1},p) = o(d(p_k, p))\), then for any \(\varepsilon > 0\) there exists \(k_\varepsilon > 0\) such that for any \(k \ge k_\varepsilon \), it holds that \(d(q_{k + 1}, p)^2 \le \varepsilon d(p_k, p)^2 \le 2\varepsilon d(q_{k + 1}, p)^2 + 2\varepsilon d(p_{k}, q_{k+1})^2 + \varepsilon K d(q_{k+1}, p)^2 d(p_k, q_{k+1})^2\). From (A.7) and the fact that \(p_k \rightarrow p\) as \(k \rightarrow \infty \), there exists \(K_0 > 0\) such that \(K d(p_k, q_{k+1})^2 \le 2K d(p_k, p)^2 \le 2\) for \(k \ge K_0\). Thus, for all \(\varepsilon \in (0, 1/8)\) and \(k \ge \max \{ k_\varepsilon , K_0 \}\), it holds
i.e., \(d(q_{k+1}, p)^2 = o( d(p_k, q_{k+1})^2)\). From (4.10), we have
and hence it holds \(d(q_{k+1},p) = o( d(p_k, q_{k+1})) = o( \Vert V_k \Vert )\).
On the other hand,
Combining with (A.7), we have \({\mathop {\lim }\limits _{k \rightarrow \infty }} \frac{d(p_k, p)}{d(p_k, q_{k+1})} = 1\). \(\square \)
Appendix B: The deferred proofs in Sect. 4.3
1.1 B.1 Proof of Theorem 4.4
We prove Theorem 4.4 in this section. First, we give a simple result that will be frequently used in our proof: For \(p \in {\bar{\mathcal {M}}}\), \(\xi \in T_p {\bar{\mathcal {M}}}\), and a curve \(\gamma : [0, 1] \rightarrow {\bar{\mathcal {M}}}\) with \(\gamma (0) = p\), the vector field \(Z(t) := {\bar{P}}_\gamma ^{0 \rightarrow t} \xi \) is parallel to \(\gamma \), and hence by the definition of parallel transport we have
Similarly, for \(p \in \mathcal {M}, \zeta \in T_p \mathcal {M}\) and \(\gamma : [0, 1] \rightarrow \mathcal {M}\), it holds that \(\nabla _{{\dot{\gamma }}} P^{0 \rightarrow t}_\gamma \zeta = 0\).
The following lemma compares the parallel transports of the manifold \(\mathcal {M}\) and its ambient space \({\bar{\mathcal {M}}}\).
Lemma B.1
There exists \(C > 0\) such that for every \(p \in \mathcal {M}\), and every smooth curve \(\gamma : [-1, 1] \rightarrow \mathcal {M}\) with \(\gamma (0) = p\), and every \(v \in T_p \mathcal {M}\), \(t \in [0, 1]\), it holds that
Moreover, there exists \(C^\prime > 0\) such that for every \(p \in \mathcal {M}\), \(w \in T_p \mathcal {M}\) with \(\Vert w\Vert \le 2\), and every \(t \in [0, 1]\), the following inequality holds for \(\gamma (t) := \exp _p(tw)\).
Proof
Let \(Y: [0, 1] \rightarrow T\mathcal {M}\) be a smooth vector field along \(\gamma \). For any fixed \(t \in [0, 1]\) and \(\xi \in T_{\gamma (t)} {\bar{\mathcal {M}}}\), it holds that
Therefore, \(\frac{\mathrm {d}}{\mathrm {d}s} {\bar{P}}_\gamma ^{s \rightarrow t}Y(s) = \bar{P}_{\gamma }^{s \rightarrow t} {\bar{\nabla }}_{{\dot{\gamma }}} Y(s)\).
Setting \(Y(t) = P_\gamma ^{0 \rightarrow t} v\) for \(v \in T_p \mathcal {M}\), we know
and thus,
Since \(\mathrm {II}\) smoothly depends on the Riemannian metric of \({\bar{\mathcal {M}}}\) and \(\mathcal {M}\) is compact, the constant \(C := \sup \{ \Vert \mathrm {II}(u, v) \Vert : p \in \mathcal {M}, u, v \in T_p\mathcal {M}, \Vert u\Vert = \Vert v\Vert = 1 \}\) is finite. As the parallel transport is an isometry, it holds that \(\Vert Y(t) \Vert = \Vert Y(0)\Vert = \Vert v\Vert \) and
Moreover, we have
When \(\gamma (t) := \exp _p (tw)\) for \(w \in T_p \mathcal {M}\) with \(\Vert w \Vert \le 2\), the above integrand is continuous with respect to v, w, t, u, p, and thus, the constant \(C^\prime := 2\sup \{ \Vert {\bar{\nabla }}_{{\dot{\gamma }}} \mathrm {II}({\dot{\gamma }}, P_\gamma ^{0 \rightarrow t} v) \Vert : p \in \mathcal {M}\text { and } w, v \in T_p\mathcal {M}, \Vert w \Vert \le 2, \Vert v\Vert \le 2, t \in [0, 1] \}\) is finite. Therefore, (B.3) holds. \(\square \)
The following lemma controls the error of parallelly transporting a vector along the shortest geodesic on \(\mathcal {M}\) and then moving back along the shortest geodesic on \({\bar{\mathcal {M}}}\).
Lemma B.2
Fix \(p \in \mathcal {M}\), then there exists \(C > 0\) such that for every \(v \in T_p \mathcal {M}\) with \(\Vert v \Vert \le 2\), \(t \in [0, 1]\) and \(w, \xi \in T_p {\bar{\mathcal {M}}}\), the following inequality holds
where \(\gamma (t) := \exp _p (tv)\). Consequently,
Proof
Let \(w, \xi \in T_p {\bar{\mathcal {M}}}\) and \(v \in T_p \mathcal {M}\) with \(\Vert v\Vert \le 2\) and \(\gamma (t) := \exp _p(tv)\), then
Let \(Y(q) := {\bar{P}}_{pq} \xi \) for \(q \in {\bar{\mathcal {M}}}\), then \(Y(\gamma (t)) = {\bar{P}}_{p, \gamma (t)} \xi \). From [19, Proposition 2.2(c)], we know \({\bar{\nabla }}_{{\dot{\gamma }}} Y = {\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}_{p, \gamma (t)} \xi \). By the definition of Y, \({\bar{\nabla }}_v Y(p) = 0\) for every \(v \in T_p {\bar{\mathcal {M}}}\). Therefore, the right-hand side of (B.7) vanishes at \(t = 0\). Since \(\bar{\nabla }_{{\dot{\gamma }}}{\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}_{p, \gamma (t)} \xi \) is continuous with respect to t, v and is linear with respect to \(\xi \), then
Applying Taylor’s theorem at \(t = 0\), we know for every \(t \in [0, 1]\) and \(w, \xi \in T_p {\bar{\mathcal {M}}}\),
Since C is independent of v, then the above holds for every \(\Vert v \Vert \le 2\). Finally, (B.6) follows from
This completes the proof. \(\square \)
The lemma below bounds the error of the inverse exponential maps of \(\mathcal {M}\) and \({\bar{\mathcal {M}}}\), which is proved by extending \(\exp _p\) to a retraction on \({\bar{\mathcal {M}}}\) using the Fermi coordinate [48, p. 135] and applying [71, Lemma 3] to bound the difference of two inverse retractions.
Lemma B.3
Fix \(p_0 \in \mathcal {M}\), there exist \(C, r > 0\) such that the following inequality holds for every \(p, q \in B_{r}(p_0)\).
Proof
From Lemma 4.1, there exists \(r > 0\) such that \(\exp _p\) is a diffeomorphism from \(\{ v \in T_p \mathcal {M}: \Vert v \Vert < 4r \}\) to \(B_{4r}(p)\) for every \(p \in B_{4r}(p_0)\). According to Theorem 5.25 in [48], there exists \(r^\prime > 0\) such that \({\overline{\exp }}\) is a diffeomorphism from \(V := \{ (p, v) \in N\mathcal {M}: p \in B_{2r}(p_0), \Vert v \Vert < r^\prime \}\) to its image under V, where \(N\mathcal {M}:= \bigcup _{p \in \mathcal {M}} (T_p \mathcal {M})^{\perp }\) is the normal bundle. Let \(E_1, \ldots , E_{d}\) be vector fields on \(B_{4r}(p_0)\) such that \(\{ E_1(p), \ldots , E_{n}(p) \}\) and \(\{ E_{n+1}(p), \ldots , E_{d}(p) \}\) are orthonormal bases of \(T_p\mathcal {M}\) and \((T_p\mathcal {M})^\perp \), respectively. For \(p \in B_r(p_0)\), \(v := \sum _{i=1}^d w_i E_i(p) \in T_p {\bar{\mathcal {M}}}\) with \(\Vert v_\top \Vert < r\) and \(\Vert v_\perp \Vert < r^\prime \), we define \(q(p, v) := \exp _p (v_\top )\) and
The above discussion shows that E is well-defined, and \(E(p, v) = \exp _p v\) if \(v \in T_p \mathcal {M}\) and \(\Vert v\Vert < r\), and \(E(p, v) = {\overline{\exp }}_p v\) if \(v \in (T_p \mathcal {M})^\perp \) and \(\Vert v\Vert < r^\prime \). Thus, E can be regarded as a retraction on \({\bar{\mathcal {M}}}\) defined near \(p_0\), and (B.9) follows from [71, Lemma 3]. \(\square \)
Proof of Theorem 4.4
(i) First, we verify the local Lipschitz property of X. Since \({\bar{X}}\) is locally Lipschitz at \(p \in \mathcal {M}\), then there exist \(L_p > 0\) and a neighborhood \({\bar{U}}_p \subset {\bar{\mathcal {M}}}\) at p such that \({\bar{X}}\) is \(L_p\)-Lipschitz in \({\bar{U}}_p\). By Lemma B.1, we know for every \(p, q \in \mathcal {M}\cap {\bar{U}}_p\) and every geodesic \(\gamma \) joining p and q, it holds that
where C is the constant in Lemma B.1. Since \(\mathcal {M}\) is compact and X is continuous, \(M := \sup _{p \in \mathcal {M}} \Vert X(p) \Vert < \infty \), then X is Lipschitz on \(\mathcal {M}\cap {\bar{U}}_p\), i.e., X is locally Lipschitz at p.
(ii) Next, we show that X is directionally differentiable at p in the Hadamard sense. Fix \(v \in T_p \mathcal {M}\) with \(\Vert v \Vert = 1\) and let \(\gamma (t) := \exp _p(tv^\prime )\) for \(t \in (-1, 1)\) and \(v^\prime \in T_p \mathcal {M}\) with \(\Vert v^\prime \Vert \le 2\), and decompose \(P_\gamma ^{t \rightarrow 0} X(\exp _p(tv^\prime )) - X(p) \) into
where \(S(t) := \mathrm {II}(-{\dot{\gamma }}(0), P_\gamma ^{t \rightarrow 0} X(\gamma (t)))\).
From Lemma B.1 and B.2, there exists \(C > 0\) (independent of \(v^\prime \)) such that
When \(t \rightarrow 0^{+}\) and \(v^\prime \rightarrow v\), we know \(|(\mathrm A)| + |(\mathrm B)| = O(t^2)\) and \(v^\prime _t := t^{-1}\overline{\exp }^{-1}_p \gamma (t) \rightarrow v\), and hence the continuity of S(t) and the Hadamard differentiability (4.24) imply that
Note that \(\mathrm {II}(v, X(p)) \in (T_p\mathcal {M})^\perp \), we find \(\nabla X(p; v) = {\bar{\nabla }} X(p; v) - \mathrm {II}(v, X(p)) = ({\bar{\nabla }} X(p; v))_\top \) for every \(\Vert v\Vert = 1\). Since \(\nabla X(p; tv) = t\nabla X(p; v)\) for all \(t > 0\), we know X is directionally differentiable at p in the Hadamard sense.
(iii.b) Finally, we need to verify the inequality (4.27). Suppose that the assumption (iii.b) holds. Since the parallel transport is an isometry, then from (4.26), there exist \(C > 0, \delta \in (0, 1)\) such that for every \(q \in \mathcal {M}\) with \(d_{{\bar{\mathcal {M}}}}(p, q) < \delta \) and \({\bar{H}}_q \in {\bar{\mathcal {K}}}(q)\), it holds that
Without the loss of generality, we could assume that \(B_\delta (p) \subset V_p\), where \(V_p\) is the neighborhood of p in the assumption (iii.b). For every \(q \in \mathcal {M}\) and \(H_q \in \mathcal {K}(q)\) such that \(d_\mathcal {M}(q, p) < \delta \), the assumption (iii.b) states that there exists \({\bar{H}}_q \in {\bar{\mathcal {K}}}(q)\) satisfying \(({\bar{H}}_q \exp ^{-1}_qp)_\perp = \mathrm {II}(\exp ^{-1}_qp, X(q))\) and \(({\bar{H}}_q \exp ^{-1}_qp)_\top = H_q \exp ^{-1}_qp\). Let \(\alpha = d_\mathcal {M}(p, q)\), \(\xi := \alpha ^{-1} \exp ^{-1}_p q\) and \(\gamma (t) := \exp _p( t \xi )\), it holds that
Let \(r, {\tilde{C}} > 0\) be the constants such that Lemma B.1, B.2 and B.3 hold with \(p_0 = p\), and \({\tilde{L}}_p > 0\) be the Lipschitz constant of X at p, and by the compactness of \(\mathcal {M}\) and the continuity of \(\mathrm {II}(v, w)\), we further assume that \(\sup \{ \Vert \mathrm {II}(v, w) \Vert : p \in \mathcal {M}, v, w \in T_p\mathcal {M}, \Vert v\Vert =\Vert w\Vert =1 \}< {\tilde{C}} < \infty \). Note that \({\dot{\gamma }}(\alpha ) = -\alpha ^{-1} \exp ^{-1}_q p\) and \(\mathrm {II}(\exp ^{-1}_qp, X(q)) = -\alpha \mathrm {II}({\dot{\gamma }}(\alpha ), X(q))\), then
where the last inequality follows from the Lipschitzness of X at p. From (B.6) we find that \(|(\mathrm B)| \le {\tilde{C}}M\alpha ^2\). Since \(\mathcal {K}\) is locally bounded, there exists \({\tilde{M}} > 0\) such that all elements in \(\bigcup _{d(p, q) \le \delta } {\bar{\mathcal {K}}}(q)\) are bounded by \(\tilde{M}\), and thus (B.9) yields \(|(\mathrm D)| \le {\tilde{C}} {\tilde{M}} \alpha ^2\). Combining with (B.11) and letting \(C_0 := {\tilde{C}}(M + {\tilde{M}} + {\tilde{L}}_p + 1)\), it holds that
Thus, the inequality (4.27) holds for \(q \in \mathcal {M}\) with \(d_\mathcal {M}(p, q) < {\tilde{\delta }} := \min \{ 1, r, \delta \}\) and \({\hat{C}} := C + C_0\). Moreover, when \(\mu \in [0, 1)\), the same inequality holds for \(q \in \mathcal {M}\) with \(d_\mathcal {M}(p, q) < {\tilde{\delta }} := \min \{ r, \delta , (C/C_0)^{\frac{1}{1 - \mu }} \}\) and \({\hat{C}} := 2C\).
(iii.a) When \(X(p) = 0\), we know
The above two terms can be controlled by (4.26) and (B.9), respectively, and therefore the conclusion follows from a similar discussion as in (iii.b).
(iv) Since when \(\mu = 0\), we know for every \(C > 0\), if (4.26) holds, then (4.27) holds with \({\hat{C}} = 2C\), which implies the semismoothness in Definition 4.1. The conclusion for the semismoothness with order \(\mu \in (0, 1]\) directly follows from (i), (ii) and (iii). \(\square \)
1.2 B.2 Proof of Proposition 4.2
First, we briefly review how to represent quantities on manifolds under a local coordinate. Let \((U, \varphi )\) be a chart near \(p \in \mathcal {M}\), and \(\{ e_i \}\) be the standard basis of \(\mathbb {R}^n\), i.e., the j-th component of \(e_i\) is \(\delta _{ij}\), and let \(E_i := (\mathrm {d}\varphi )^{-1}[e_i]\), then \(E_i\) is a smooth vector field near p and \(\{ E_i(q) \}\) is a basis of \(T_q\mathcal {M}\) for every \(q \in U\). Let \(g_{ij} := \left\langle E_i, E_j \right\rangle \), \(G := (g_{ij})_{i,j\in [n]}\) and \(\varGamma _{ij}^k\) be the Christoffel symbols such that \(\nabla _{E_i}{E_j} = \sum _{k=1}^n \varGamma _{ij}^k E_k\), and \(g^{ij}\) be such that \(( g^{ij} )_{i,j\in [n]} = G^{-1}\), and we call G the metric matrix. The gradient of a smooth function \(f:\mathcal {M}\rightarrow \mathbb {R}\) is \({{\,\mathrm{grad}\,}}f = \sum _{i, j=1}^n (g^{ij} E_i f)E_j\) (see [48, p. 27]). Two smooth vector fields X, Y defined near p can be represented as \(X = \sum _{i=1}^n X^i E_i\) and \(Y = \sum _{i=1}^n Y^i E_i\), and their inner product is \(\left\langle X, Y \right\rangle = \sum _{i,j=1}^n g_{ij} X^i Y^j\). The covariant derivative \(\nabla _XY\) can be written as (see [48, p. 92])
When \((U, \varphi )\) is the chart constructed in the proof of Lemma 4.2, \(\varphi ^{-1}\) is the normal coordinate at \(p \in \mathcal {M}\) (see [48, p. 132]). Proposition 5.24 in [48] shows that \(g_{ij}(p) = \delta _{ij}\), \(\varGamma _{ij}^k(p) = 0\), and all partial derivatives of \(g_{ij}\) vanish at p. Thus, \({{\,\mathrm{grad}\,}}f(p) = \sum _{i=1}^n (E_i f)(p) E_i(p)\), \(\nabla _XY(p) = \sum _{i=1}^n (X Y^i)(p) E_i(p)\) and \(\left\langle X(p), Y(p) \right\rangle = \sum _{i=1}^n X^i(p) Y^i(p)\).
The following two lemmas are some chain rules on manifolds, which are proved by pulling functions on manifolds back to Euclidean spaces and then applying the Euclidean chain rules.
Lemma B.4
Let \(\mathcal {M}, \mathcal {N}\) be two Riemannian manifolds, and \(F: \mathcal {M}\rightarrow \mathcal {N}\) be a smooth map, \(g: \mathcal {N}\rightarrow \mathbb {R}\) be a map that is locally Lipschitz at \(F(p) \in \mathcal {N}\), then the following chain rule holds at \(p \in \mathcal {M}\)
where the inclusion means that for every \(\xi \in \partial (g \circ F)(p)\), there exists \(\zeta \in \partial g(F(p))\) such that \(\left\langle \xi , v \right\rangle = \left\langle \zeta , \mathrm {d}F|_p[v] \right\rangle \) for every \(v \in T_p\mathcal {M}\).
Proof
Let \((U, \varphi )\) and \((V, \psi )\) be charts near \(p \in \mathcal {M}\) and \(F(p) \in \mathcal {N}\) with \(\varphi (0) = p\) and \(\psi (0) = F(p)\), respectively. We denote \(G_\mathcal {N}, G_\mathcal {M}\) as the metric matrices on \(\mathcal {N}, \mathcal {M}\) defined above, respectively. Define \({\hat{g}} := g \circ \psi ^{-1}\) and \({\hat{F}} := \psi \circ F \circ \varphi ^{-1}\). Then, Theorem 2.3.10 in [25] yields that
Fix \(\xi \in \partial (g \circ F)(p)\), then from (2.2) we know \({\hat{\xi }} := G_\mathcal {M}\mathrm {d}\varphi |_p[\xi ] \in \partial ({\hat{g}} \circ {\hat{F}})(0)\). The above display shows that there exists \({\hat{\zeta }} \in \partial {\hat{g}}({\hat{F}}(0))\) such that for every \({\hat{v}} \in \mathbb {R}^n\), it holds that \( \hat{\xi }^\top {\hat{v}} = {\hat{\zeta }}^\top (\mathrm {d}{\hat{F}}(0)[{\hat{v}}]) \). Let \(v := (\mathrm {d}\varphi |_p)^{-1}[{\hat{v}}]\), then \({\hat{\xi }}^\top {\hat{v}} = (\mathrm {d}\varphi |_p[\xi ])^\top G_\mathcal {M}(\mathrm {d}\varphi |_p[v]) = \left\langle \xi , v \right\rangle \). Since \(\zeta := (\mathrm {d}\psi |_{F(p)})^{-1} G_\mathcal {N}^{-1} {\hat{\zeta }} \in \partial g(F(p))\) and \(\mathrm {d}{\hat{F}} = \mathrm {d}\psi \circ \mathrm {d}F \circ (\mathrm {d}\varphi )^{-1}\), we find that \({\hat{\zeta }}^\top (\mathrm {d}{\hat{F}}(0)[{\hat{v}}]) = (\mathrm {d}\psi |_{F(p)}[\zeta ])^\top G_\mathcal {N}( \mathrm {d}\psi |_{F(p)} [\mathrm {d}F|_p[v]]) = \left\langle \zeta , \mathrm {d}F|_p[v] \right\rangle \). Therefore, \(\left\langle \xi , v \right\rangle = \left\langle \zeta , \mathrm {d}F|_p[v] \right\rangle \) for every \(v \in T_p \mathcal {M}\), and hence \(\xi \in \partial g(F(p)) \mathrm {d}F(p)\). \(\square \)
Lemma B.5
Let \(\mathcal {M}\) be a Riemannian manifold, X be a vector field on \(\mathcal {M}\) that is locally Lipschitz at \(p \in \mathcal {M}\), and Y be a smooth vector field on \(\mathcal {M}\). Define \(f := \left\langle X, Y \right\rangle \), then the following chain rule holds.
where \(\xi \in \partial f(p)\) is regarded as the operator \(\zeta \mapsto \left\langle \xi , \zeta \right\rangle \) for \(\zeta \in T_p\mathcal {M}\), and the right-hand side is understood as the set of operators \(\zeta \mapsto \left\langle H\zeta , Y(p) \right\rangle + \left\langle X(p), \nabla _\zeta Y(p) \right\rangle \) for \(\zeta \in T_p\mathcal {M}\) and \(H \in \partial X(p)\).
Proof
Let \((U_p, \varphi )\) be a chart such that \(\varphi ^{-1}\) is the normal coordinate. Suppose that \(X = \sum _{i=1}^n X^i E_i\) and \(Y = \sum _{i=1}^n Y^i E_i\), we define \({\hat{f}} := f \circ \varphi ^{-1}\), \({\hat{X}}^j := X^j \circ \varphi ^{-1}\), \({\hat{X}} := ({\hat{X}}^1, \ldots , {\hat{X}}^n)\), and \({\hat{Y}}\) likewise. Thus, \({\hat{f}}(x) = {\hat{X}}(x)^\top G(x) {\hat{Y}}(x)\) for \(x \in {\hat{U}}_p := \varphi (U_p)\), and can be regarded as the composition \(\mathcal {A}\circ {\tilde{X}}\), where \(\mathcal {A}(x, v) := v^\top G(x) {\hat{Y}}(x)\) and \({\tilde{X}}(x) := (x, {\hat{X}}(x))\) for \(x \in U_p, v \in \mathbb {R}^n\). Since \(\mathcal {A}\) is smooth and \(G(0) = I_n, \nabla G(0) = 0\), we know \(\nabla \mathcal {A}(0, v)[w, \xi ] = v^\top \nabla {\hat{Y}}(0)[w] + \xi ^\top {\hat{Y}}(0)\) for \(w, \xi \in \mathbb {R}^n\). Note that \(\partial {\tilde{X}} = (\mathrm {Id}, \partial {\hat{X}})\), then the chain rule in Euclidean spaces [25, Theorem 2.6.6] implies that
We define the linear bijection \(J: \mathcal {L}(\mathbb {R}^n) \rightarrow \mathcal {L}(T_p\mathcal {M})\) as
where \({\hat{H}} \in \mathcal {L}(\mathbb {R}^n)\), \(v \in T_p \mathcal {M}\) and \({\hat{v}} := \mathrm {d}\varphi |_p[v]\). Below we show that \(J(\partial {\hat{X}}(0)) = \partial X(p)\).
Suppose \({\hat{H}} \in \partial _B {\hat{X}}(0)\), then there exists \(x_k \rightarrow 0\) such that \({\hat{X}}\) is differentiable at \(x_k\) and \(\nabla {\hat{X}}(x_k) \rightarrow {\hat{H}}\) as \(k\rightarrow \infty \). Let \(H := J{\hat{H}}\), \({\hat{v}} \in \mathbb {R}^n\), \(v := (\mathrm {d}\varphi |_p)^{-1}[{\hat{v}}]\), \(p_k := \varphi ^{-1}(x_k)\) and \(v_k := (\mathrm {d}\varphi |_{p_k})^{-1}[{\hat{v}}]\), then \(\nabla {\hat{X}}^i(x_k)[{\hat{v}}] = \mathrm {d}{\hat{X}}^i|_{x_k}[{\hat{v}}] = \mathrm {d}X^i|_{p_k}[v_k] = v_kX^i(p_k)\), and therefore
Let \(k \rightarrow \infty \) and note \(\varGamma _{ij}^k(p) = 0\), we find \(\nabla _{v_k} X(p_k) \rightarrow \sum _{i=1}^n [{\hat{H}} {\hat{v}}]_i E_i(p) = Hv\). Since \({\hat{v}}\) is arbitrary, we conclude that \(\nabla X(p_k) \rightarrow H \in \partial _B X(p)\). By reversing this argument, we can show that \(J^{-1}H \in \partial {\hat{X}}(0)\) for every \(H \in \partial X(p)\). Then, \(J(\partial _B {\hat{X}}(0)) = \partial _B X(p)\), and by the linearity of J, we know \(J(\partial {\hat{X}}(0)) = \partial X(p)\).
Thus, \({\hat{Y}}(0)^\top ({\hat{H}}{\hat{v}}) = \sum _{i=1}^n Y^i(p) \langle J{\hat{H}}v, E_i(p) \rangle = \langle J{\hat{H}}v, Y(p) \rangle \) for every \({\hat{H}} \in \partial {\hat{X}}(0)\), \(v \in T_p \mathcal {M}\) and \({\hat{v}} := \mathrm {d}\varphi |_p[v]\). Similarly, it holds that \({\hat{X}}(0)^\top \nabla {\hat{Y}}(0)[ {\hat{v}}] = \left\langle X(p), \nabla Y(p)[v] \right\rangle \). Besides, (2.2) yields \(\partial {\hat{f}}(0)[{\hat{v}}] = \partial f(p)[v]\). Therefore, the proof is completed by plugging these equations into (B.15). \(\square \)
Proof of Proposition 4.2
Since Theorem 4.4 shows that X is also locally Lipschitz at p, then \(\partial X(p)\) is well-defined. Let \(U_p \subset \mathcal {M}\) be a neighborhood of p, \(\xi \in T_p\mathcal {M}\) and V be a smooth vector field on \({\bar{\mathcal {M}}}\) such that \(V_\top (q) = P_{pq}\xi \) for \(q \in U_p\), and \(\iota : \mathcal {M}\hookrightarrow \bar{\mathcal {M}}\) be the embedding map. Define \({\bar{f}}_V := \left\langle {\bar{X}}, V \right\rangle \) and \(f_V := {\bar{f}}_V \circ \iota \), then Lemma B.4 shows that \(\partial f_V(p) \subseteq {\bar{\partial }} {\bar{f}}_V(p) \mathrm {d}\iota (p) = \bar{\partial } {\bar{f}}_V(p)|_{T_p \mathcal {M}}\).
Let \(v \in T_p \mathcal {M}\), then Lemma B.5 gives that \({\bar{\partial }} {\bar{f}}_V(p)[v] = \left\langle {\bar{\partial }} \bar{X}(p)[v], V(p) \right\rangle + \left\langle {\bar{X}}(p), {\bar{\nabla }}_v V(p) \right\rangle \). Since \(v, {\bar{X}}(p) \in T_p \mathcal {M}\), Proposition 8.4 in [48] shows that
Note that \(\nabla _v V_\top (p) = \nabla _v (P_{pq}\xi )(p) = 0\), then \(\left\langle {\bar{X}}(p), {\bar{\nabla }}_v V_\top (p) \right\rangle = \left\langle X(p), \nabla _v V_\top (p) \right\rangle = 0\), and thus,
where the last equation follows from \(\left\langle \mathrm {II}(v, X(p)), V_\top (p) \right\rangle = 0\).
Combining these arguments, we know \(\partial f_V(p)[v]\! \subseteq \! \langle {\bar{\partial }} {\bar{X}}(p)[v] \!-\! \mathrm {II}(v, X(p)), V(p) \rangle \). Note that \({\bar{X}}(q) \in T_q \mathcal {M}\) for \(q \in U_p\), then \(f_V(q) = \left\langle X(q), V_\top (q) \right\rangle \) for \(q \in U_p\). Since \(\nabla _v V_\top (p) = 0\) and \(\partial X(p)[v] \subset T_p\mathcal {M}\), then Lemma B.5 yields \(\partial f_V(p)[v] = \left\langle \partial X(p)[v], V_\top (p) \right\rangle = \left\langle \partial X(p)[v], V(p) \right\rangle \). Therefore, we know
Note that \(V(p)_\top = \xi \) and \(V(p)_\perp \) are arbitrary and \(\partial X(p)[v], {\bar{\partial }} {\bar{X}}(p)[v]\) are compact convex sets, the inclusion relationship (4.28) follows from [57, Corollary 13.1.1]. \(\square \)
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, Y., Bao, C., Ding, C. et al. A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds. Math. Program. 201, 1–61 (2023). https://doi.org/10.1007/s10107-022-01898-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01898-1
Keywords
- Nonsmooth manifold optimization
- Semismooth Newton method
- Augmented Lagrangian method
- Riemannian manifold