A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds | Mathematical Programming Skip to main content
Log in

A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

This paper is devoted to studying an augmented Lagrangian method for solving a class of manifold optimization problems, which have nonsmooth objective functions and nonlinear constraints. Under the constant positive linear dependence condition on manifolds, we show that the proposed method converges to a stationary point of the nonsmooth manifold optimization problem. Moreover, we propose a globalized semismooth Newton method to solve the augmented Lagrangian subproblem on manifolds efficiently. The local superlinear convergence of the manifold semismooth Newton method is also established under some suitable conditions. We also prove that the semismoothness on submanifolds can be inherited from that in the ambient manifold. Finally, numerical experiments on compressed modes and (constrained) sparse principal component analysis illustrate the advantages of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The subscript p in \(\left\langle \cdot , \cdot \right\rangle _p\) is usually omitted for simplicity.

  2. An affine connection is \(\mathcal {D}(\mathcal {M})\)-linear w.r.t. X, \(\mathbb {R}\)-linear w.r.t. XY, and satisfies the product rule, see Chapter 5 in [2].

  3. We say a function f on a manifold is locally Lipschitz if \(f \circ \varphi ^{-1}\) is locally Lipschitz in U for every chart \((U, \varphi )\).

  4. This is well-defined in a small neighborhood of q [19, Proposition 3.2.9].

  5. Let \(g: \mathcal {M}\rightarrow \mathbb {R}^m\) and \(f: \mathbb {R}^m \rightarrow \mathbb {R}\) be two smooth maps, we can use Definition 2.4 to compute the gradient of \(f \circ g\). For \(p \in \mathcal {M}\), \(\xi _p \in T_p\mathcal {M}\), choosing any curve \(\gamma : (-1, 1) \rightarrow \mathcal {M}\) such that \(\gamma (0) = p\) and \({\dot{\gamma }}(0) = \xi _p\), we have \(\left\langle \xi _p, {{\,\mathrm{grad}\,}}\,(f\circ g)(p) \right\rangle = \xi _p (f \circ g) = \left. \frac{\mathrm {d}(f \circ g \circ \gamma )(t)}{\mathrm {d}t}\right| _{t=0} = \sum _{i=1}^m\alpha _i\left. \frac{\mathrm {d}(g_i \circ \gamma )(t)}{\mathrm {d}t}\right| _{t=0} = \sum _{i=1}^m \alpha _i \cdot (\xi _p g_i) = \sum _{i=1}^m \alpha _i \left\langle \xi _p, {{\,\mathrm{grad}\,}}\,[g(p)]_i \right\rangle \), where \(\alpha := \nabla f(q)\), \(q := g(\gamma (t))\) and \(g_i\) is the i-th component of g. Thus, \({{\,\mathrm{grad}\,}}\, (f \circ g)(p) = \sum _{i=1}^m \alpha _i {{\,\mathrm{grad}\,}}\, [g(p)]_i\).

  6. We say the map \(\mathcal {K}\) is upper semicontinuous if for every \(\varepsilon > 0\) there exists \( \delta >0\) such that for all \(q \in B_\delta (p)\) we have \(P_{qp}\mathcal {K}(q) \subset \mathcal {K}(p) + {\hat{B}}_\varepsilon (0)\), where \({\hat{B}}_\varepsilon (0) := \{ V \in \mathcal {L}(T_p\mathcal {M}) : \left\| V \right\| < \varepsilon \}\).

  7. Since \(\gamma \) is a geodesic which is parallel to itself, i.e., \(\nabla _{{\dot{\gamma }}} {\dot{\gamma }} = 0\), then \(\frac{\mathrm {d}}{\mathrm {d}t} \Vert {\dot{\gamma }}(t) \Vert ^2 = 2\left\langle \nabla _{{\dot{\gamma }}}{\dot{\gamma }}, {\dot{\gamma }} \right\rangle \equiv 0\). Therefore, we have \(\Vert {\dot{\gamma }}(t)\Vert = \Vert {\dot{\gamma }}(0) \Vert = \Vert V \Vert \) and \(\ell (\gamma |_{[s,t]}) = \int _s^t \Vert {\dot{\gamma }}(u) \Vert \mathrm {d}u = |t - s| \Vert {\dot{\gamma }}(0)\Vert \).

  8. We only consider our method in this setting since other algorithms mentioned in Table 2 have difficulty in reaching these stopping conditions. Indeed, these observations are also valid for the original termination conditions with a lower significance.

  9. Indeed, this is called the normal coordinate (see, e.g., [19, p. 86] and [48, p. 131]).

References

  1. Absil, P.-A., Hosseini, S.: A collection of nonsmooth Riemannian optimization problems. In: Nonsmooth Optimization and its Applications, Springer, pp. 1–15 (2019)

  2. Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)

    MATH  Google Scholar 

  3. Absil, P.-A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22, 135–158 (2012)

    MathSciNet  MATH  Google Scholar 

  4. Adler, R.L., Dedieu, J.-P., Margulies, J.Y., Martens, M., Shub, M.: Newton’s method on Riemannian manifolds and a geometric model for the human spine. IMA J. Numer. Anal. 22, 359–390 (2002)

    MathSciNet  MATH  Google Scholar 

  5. Andreani, R., Birgin, E.G., Martínez, J.M., Schuverdt, M.L.: On augmented Lagrangian methods with general lower-level constraints. SIAM J. Optim. 18, 1286–1309 (2008)

    MathSciNet  MATH  Google Scholar 

  6. Andreani, R., Haeser, G., Schuverdt, M.L., Silva, P.J.: A relaxed constant positive linear dependence constraint qualification and applications. Math. Program. 135, 255–273 (2012)

    MathSciNet  MATH  Google Scholar 

  7. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)

    MathSciNet  MATH  Google Scholar 

  8. Azagra, D., Ferrera, J., López-Mesas, F.: Nonsmooth analysis and Hamilton-Jacobi equations on Riemannian manifolds. J. Funct. Anal. 220, 304–361 (2005)

    MathSciNet  MATH  Google Scholar 

  9. Bacák, M., Bergmann, R., Steidl, G., Weinmann, A.: A second order nonsmooth variational model for restoring manifold-valued images. SIAM J. Sci. Comput. 38, A567–A597 (2016)

    MathSciNet  MATH  Google Scholar 

  10. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, London (1982)

    MATH  Google Scholar 

  11. Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48, 334–334 (1997)

    Google Scholar 

  12. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2014)

    MathSciNet  MATH  Google Scholar 

  13. Bonnans, J.F., Shapiro, A.: Perturbation Analysis of Optimization Problems. Springer, Berlin (2000)

    MATH  Google Scholar 

  14. Borckmans, P.B., Selvan, S.E., Boumal, N., Absil, P.-A.: A Riemannian subgradient algorithm for economic dispatch with valve-point effect. J. Comput. Appl. Math. 255, 848–866 (2014)

    MathSciNet  MATH  Google Scholar 

  15. Boumal, N., Absil, P.-A.: RTRMC: a Riemannian trust-region method for low-rank matrix completion. In: Advances in Neural Information Processing Systems, vol. 24 (2011)

  16. Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)

    MATH  Google Scholar 

  17. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2011)

    MATH  Google Scholar 

  18. Cai, J.-F., Liu, H., Wang, Y.: Fast rank-one alternating minimization algorithm for phase retrieval. J. Sci. Comput. 79, 128–147 (2019)

    MathSciNet  MATH  Google Scholar 

  19. Carmo, M.P.: Riemannian Geometry. Birkhäuser, London (1992)

    MATH  Google Scholar 

  20. Chen, S., Ma, S., Man-Cho So, A., Zhang, T.: Proximal gradient method for nonsmooth optimization over the Stiefel manifold. SIAM J. Optim. 30, 210–239 (2020)

    MathSciNet  MATH  Google Scholar 

  21. Chen, S., Ma, S., Xue, L., Zou, H.: An alternating manifold proximal gradient method for sparse principal component analysis and sparse canonical correlation analysis. Inform. J. Optim. 2, 192–208 (2020)

    MathSciNet  Google Scholar 

  22. Chen, W., Ji, H., You, Y.: An augmented Lagrangian method for \(\ell _1\)-regularized optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 38, B570–B592 (2016)

    MATH  Google Scholar 

  23. Chen, X., Guo, L., Lu, Z., Ye, J.J.: An augmented Lagrangian method for non-Lipschitz nonconvex programming. SIAM J. Numer. Anal. 55, 168–193 (2017)

    MathSciNet  MATH  Google Scholar 

  24. Cho, M., Lee, J.: Riemannian approach to batch normalization. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

  25. Clarke, F.H.: Optimization and nonsmooth analysis. SIAM 2, 668 (1990)

    Google Scholar 

  26. Curtis, F.E., Jiang, H., Robinson, D.P.: An adaptive augmented Lagrangian method for large-scale constrained optimization. Math. Program. 152, 201–245 (2015)

    MathSciNet  MATH  Google Scholar 

  27. Daniilidis, A., Deville, R., Durand-Cartagena, E., Rifford, L.: Self-contracted curves in Riemannian manifolds. J. Math. Anal. Appl. 457, 1333–1352 (2018)

    MathSciNet  MATH  Google Scholar 

  28. de Oliveira, F.R., Ferreira, O.P.: Newton method for finding a singularity of a special class of locally Lipschitz continuous vector fields on Riemannian manifolds. J. Optim. Theory Appl. 185, 522–539 (2020)

    MathSciNet  MATH  Google Scholar 

  29. de Oliveira, F.R., Oliveira, F.R.: A global Newton method for the nonsmooth vector fields on Riemannian manifolds. J. Optim. Theory Appl. 2, 1–15 (2021)

    MathSciNet  MATH  Google Scholar 

  30. Deng, K., Peng, Z.: An inexact augmented Lagrangian method for nonsmooth optimization on Riemannian manifold, arXiv preprint arXiv:1911.09900 (2019)

  31. Dirr, G., Helmke, U., Lageman, C.: Nonsmooth Riemannian optimization with applications to sphere packing and grasping. In: Lagrangian and Hamiltonian Methods for Nonlinear Control, Springer, vol. 2007, pp. 29–45 (2006)

  32. Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, Berlin (2007)

    MATH  Google Scholar 

  33. Ferreira, O., Oliveira, P.: Proximal point algorithm on Riemannian manifolds. Optimization 51, 257–270 (2002)

    MathSciNet  MATH  Google Scholar 

  34. Gabay, D.: Minimizing a differentiable function over a differential manifold. J. Optim. Theory Appl. 37, 177–219 (1982)

    MathSciNet  MATH  Google Scholar 

  35. Ghahraei, E., Hosseini, S., Pouryayevali, M.R.: Pseudo-Jacobian and characterization of monotone vector fields on Riemannian manifolds. J. Convex Anal. 24, 149–168 (2017)

    MathSciNet  MATH  Google Scholar 

  36. Hiriart-Urruty, J.-B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms, vol. 2. Springer, Berlin (1993)

    MATH  Google Scholar 

  37. Hiriart-Urruty, J.-B., Strodiot, J.-J., Nguyen, V.H.: Generalized Hessian matrix and second-order optimality conditions for problems with \(C^{1,1}\) data. Appl. Math. Optim. 11, 43–56 (1984)

    MathSciNet  MATH  Google Scholar 

  38. Hosseini, S., Pouryayevali, M.: Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds. Nonlinear Anal. Theory Methods Appl. 74, 3884–3895 (2011)

    MathSciNet  MATH  Google Scholar 

  39. Hu, J., Liu, X., Wen, Z.-W., Yuan, Y.-X.: A brief introduction to manifold optimization. J. Oper. Res. Soc. China 8, 199–248 (2020)

    MathSciNet  MATH  Google Scholar 

  40. Hu, J., Milzarek, A., Wen, Z., Yuan, Y.: Adaptive quadratically regularized Newton method for Riemannian optimization. SIAM J. Matrix Anal. Appl. 39, 1181–1207 (2018)

    MathSciNet  MATH  Google Scholar 

  41. Huang, W., Absil, P.-A., Gallivan, K.A.: A Riemannian symmetric rank-one trust-region method. Math. Program. 150, 179–216 (2015)

    MathSciNet  MATH  Google Scholar 

  42. Huang, W., Wei, K.: Extending FISTA to Riemannian optimization for sparse PCA, arXiv preprint arXiv:1909.05485 (2019)

  43. Huang, W., Wei, K.: Riemannian proximal gradient methods. Math. Programm. 5, 1–43 (2021)

    Google Scholar 

  44. Kovnatsky, A., Glashoff, K., Bronstein, M.M.: MADMM: A generic algorithm for non-smooth optimization on manifolds. In: European Conference on Computer Vision, Springer, pp. 680–696 (2016)

  45. Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58, 431–449 (2014)

    MathSciNet  MATH  Google Scholar 

  46. Lai, R., Wen, Z., Yin, W., Gu, X., Lui, L.M.: Folding-free global conformal mapping for genus-0 surfaces by harmonic energy minimization. J. Sci. Comput. 58, 705–725 (2014)

    MathSciNet  MATH  Google Scholar 

  47. Lee, J.M.: Introduction to Smooth Manifolds. Springer, Berlin (2012)

    Google Scholar 

  48. Lee, J.M.: Introduction to Riemannian Manifolds. Springer, Berlin (2018)

    MATH  Google Scholar 

  49. Li, X., Sun, D., Toh, K.-C.: A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems. SIAM J. Optim. 28, 433–458 (2018)

    MathSciNet  MATH  Google Scholar 

  50. Lu, Z., Zhang, Y.: An augmented Lagrangian approach for sparse principal component analysis. Math. Program. 135, 149–193 (2012)

    MathSciNet  MATH  Google Scholar 

  51. Mifflin, R.: Semismooth and semiconvex functions in constrained optimization. SIAM J. Control. Optim. 15, 959–972 (1977)

    MathSciNet  MATH  Google Scholar 

  52. Montanari, A., Richard, E.: Non-negative principal component analysis: Message passing algorithms and sharp asymptotics. IEEE Trans. Inf. Theory 62, 1458–1484 (2015)

    MathSciNet  MATH  Google Scholar 

  53. Ozoliņš, V., Lai, R., Caflisch, R., Osher, S.: Compressed modes for variational problems in mathematics and physics. Proc. Natl. Acad. Sci. U.S.A. 110, 18368–18373 (2013)

    MathSciNet  MATH  Google Scholar 

  54. Qi, L., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58, 353–367 (1993)

    MathSciNet  MATH  Google Scholar 

  55. Qi, L., Wei, Z.: On the constant positive linear dependence condition and its application to SQP methods. SIAM J. Optim. 10, 963–981 (2000)

    MathSciNet  MATH  Google Scholar 

  56. Rampazzo, F., Sussmann, H.J.: Commutators of flow maps of nonsmooth vector fields. J. Differ. Equ. 232, 134–175 (2007)

    MathSciNet  MATH  Google Scholar 

  57. Rockafellar, R.T.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1970)

    MATH  Google Scholar 

  58. Spivak, M.D.: A Comprehensive Introduction to Differential Deometry, vol. 2, 3rd edn. Publish or Perish, New York (1999)

    MATH  Google Scholar 

  59. Sun, D., Sun, J.: Semismooth matrix-valued functions. Math. Oper. Res. 27, 150–169 (2002)

    MathSciNet  MATH  Google Scholar 

  60. Sun, D., Sun, J., Zhang, L.: The rate of convergence of the augmented Lagrangian method for nonlinear semidefinite programming. Math. Program. 114, 349–391 (2008)

    MathSciNet  MATH  Google Scholar 

  61. Vandereycken, B.: Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 23, 1214–1236 (2013)

    MathSciNet  MATH  Google Scholar 

  62. Waldspurger, I., d’Aspremont, A., Mallat, S.: Phase recovery, maxcut and complex semidefinite programming. Math. Program. 149, 47–81 (2015)

    MathSciNet  MATH  Google Scholar 

  63. Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142, 397–434 (2013)

    MathSciNet  MATH  Google Scholar 

  64. Xiao, X., Li, Y., Wen, Z., Zhang, L.: A regularized semi-smooth Newton method with projection steps for composite convex programs. J. Sci. Comput. 76, 364–389 (2018)

    MathSciNet  MATH  Google Scholar 

  65. Yang, L., Sun, D., Toh, K.-C.: SDPNAL\(+\): A majorized semismooth Newton-CG augmented Lagrangian method for semidefinite programming with nonnegative constraints. Math. Program. Comput. 7, 331–366 (2015)

    MathSciNet  MATH  Google Scholar 

  66. Yang, W.H., Zhang, L.-H., Song, R.: Optimality conditions for the nonlinear programming problems on Riemannian manifolds. Pac. J. Optim. 10, 415–434 (2014)

    MathSciNet  MATH  Google Scholar 

  67. Zhang, H., Reddi, S.J., Sra, S.: Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4592–4600 (2016)

  68. Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In: Conference on Learning Theory, pp. 1617–1638 (2016)

  69. Zhao, X.-Y., Sun, D., Toh, K.-C.: A Newton-CG augmented Lagrangian method for semidefinite programming. SIAM J. Optim. 20, 1737–1765 (2010)

    MathSciNet  MATH  Google Scholar 

  70. Zhu, H., Zhang, X., Chu, D., Liao, L.-Z.: Nonconvex and nonsmooth optimization with generalized orthogonality constraints: an approximate augmented Lagrangian method. J. Sci. Comput. 72, 331–372 (2017)

    MathSciNet  MATH  Google Scholar 

  71. Zhu, X., Sato, H.: Riemannian conjugate gradient methods with inverse retraction. Comput. Optim. Appl. 77, 779–810 (2020)

    MathSciNet  MATH  Google Scholar 

  72. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Gr. Stat. 15, 265–286 (2006)

    MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the two anonymous referees and the associate editor for their valuable comments and constructive suggestions, which have greatly improved the quality and the presentation of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenglong Bao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work was supported in part by National Natural Science Foundation of China (11901338, 61620106010) and Tsinghua University Initiative Scientific Research Program. The research of the third author was supported in part by the National Natural Science Foundation of China (12071464) and the Beijing Natural Science Foundation (Z190002)

Appendices

Appendix A: The deferred Proofs in Sects. 4.1 and 4.2

1.1 A.1 Proof of Lemma 4.2

Proof

From the compactness of U, there exist \(C, K, r_0 > 0\) such that Lemma 4.1 holds for every \(q \in U\). By shrinking the constant \(r_0\) if necessary, we could also assume that for every \(q \in U\), \(R_q v\) is defined for every \(v \in T_q\mathcal {M}\) with \(\Vert v \Vert < r_0\). For a fixed point \(q \in \mathcal {M}\), let \(\{ e_i \}_{i \in [n]}\) be an orthonormal basis of \(T_q \mathcal {M}\). We define the isomorphism \(\psi _q: \mathbb {R}^n \rightarrow T_q \mathcal {M}\) as \(\psi _q(w_1, \ldots , w_n) = \sum _{i=1}^n w_i e_i\), and the chart \(\varphi _q: B_{r_0}(q) \rightarrow \mathbb {R}^n\) as \(\varphi _q\left( \exp _q \psi _q(w) \right) = w\), where \(\Vert w\Vert _{\mathbb {R}^n} < r_0\),Footnote 9 which implies that \((\psi _q \circ \varphi _q )(p) = \exp _q^{-1} p\) for every \(p \in B_q(r_0)\).

For \(\Vert v \Vert < r_0\), we define the curve \(\gamma (t) := R_q (tv)\), then

$$\begin{aligned} d(R_q v, q) \le \ell (\gamma |_{[0, 1]}) = \int _0^1 \Vert {\dot{\gamma }}(t) \Vert \mathrm {d}t = \int _0^1 \Vert \mathrm {d}R_q|_{tv} [v] \Vert \mathrm {d}t \le {\tilde{C}}_q \Vert v \Vert , \end{aligned}$$
(A.1)

where \({\tilde{C}}_q := \sup _{\Vert v\Vert < r_0} \Vert \mathrm {d}R_q|_v \Vert \). Since \(R_q v\) is \(C^2\) with respect to q and v and U is compact, we know \({\tilde{C}} := \sup _{q \in U} {\tilde{C}}_q < \infty \). Thus, \(\exp ^{-1}_q R_q v\) is defined for \(q \in U\) and \(\Vert v \Vert < r := r_0 / {\tilde{C}}\).

Let \({\hat{B}}_{r} := \{ w \in \mathbb {R}^n : \Vert w \Vert _{\mathbb {R}^n} < r \}\), and \({\hat{R}}_q, {\hat{E}}_q: {\hat{B}}_{r} \rightarrow \mathbb {R}^n\) be the maps \(\varphi _q \circ R_q \circ \psi _q|_{{\hat{B}}_r}\), \(\varphi _q \circ \exp _q \circ \psi _q|_{{\hat{B}}_r}\), respectively. Since the differentials of \(\exp _q\) and \(R_q\) are the same at 0, then the differentials of \({\hat{E}}_q\) and \({\hat{R}}_q\) also coincide at 0. Let \({\hat{D}}_q := {\hat{E}}_q - {\hat{R}}_q\), then Taylor’s theorem states that for any \(w \in {\hat{B}}_r\), there exists \(t_w \in (0, 1)\) such that \(\Vert {\hat{D}}_q(w) \Vert _{\mathbb {R}^n} = \frac{1}{2} \Vert \nabla ^2 {\hat{D}}_q(t_w w)[w, w] \Vert _{\mathbb {R}^n} \le {\hat{C}}_q \Vert w \Vert ^2_{\mathbb {R}^n}\), where \({\hat{C}}_q := \frac{1}{2}\sup _{ \Vert w \Vert _{\mathbb {R}^n} < r} \Vert \nabla ^2 {\hat{D}}_q(w) \Vert \).

Since \(\psi _q\) is a linear isomorphism, then \((\psi _q \circ {\hat{D}}_q \circ \psi _q^{-1})(v) = (\psi _q \circ (\varphi _q \circ \exp _q - \varphi _q \circ R_q))(v) = (\psi _q \circ \varphi _q \circ \exp _q)(v) - (\psi _q \circ \varphi _q \circ R_q)(v) = v - \exp ^{-1}_q R_q v\). Therefore, by Lemma 4.1, we have

$$\begin{aligned} d(R_q v, \exp _q v)^2 \le \Vert v - \exp ^{-1}_q R_q v \Vert ^2 + K \Vert \exp ^{-1}_q R_q v \Vert ^2 \Vert v \Vert ^2 \le ({\hat{C}}_q^2+ {\tilde{C}}^2 K ) \Vert v \Vert ^4 \end{aligned}$$

for every \(v \in T_q \mathcal {M}\) with \(\Vert v \Vert < r\), where the last inequality follows from (A.1) and \(\Vert \exp ^{-1}_q R_q v \Vert = d(R_q v, q)\). Finally, since \(R_q v\) and \(\exp _q v\) are both \(C^2\) with respect to q and v, \({\hat{C}}_q\) is uniformly bounded on the compact set U. Thus, the required constant C exists and is finite. \(\square \)

1.2 A.2 Proof of Lemma 4.4

Proof

Define \({\hat{\varphi }} = \varphi \circ \gamma \). We know that \(\hat{\varphi }\) is continuously differentiable with Lipschitz gradient in [0, 1]. By Theorem 2.3 in [37], there exist \(\xi \in (0, 1), {\hat{M}}_\xi \in \partial ^2 {\hat{\varphi }}(\xi )\) such that \({\hat{\varphi }}(1) - {\hat{\varphi }}(0) = {\hat{\varphi }}^\prime (0) + \frac{{\hat{M}}_\xi }{2}\). Note that due to \({\hat{\varphi }}^\prime (t) = \left\langle {{\,\mathrm{grad}\,}}\varphi (\gamma (t)), {\dot{\gamma }}(t) \right\rangle \) and \(\nabla _{\dot{\gamma }}{\dot{\gamma }} = 0\), we find \({\hat{\varphi }}^{\prime \prime }(t) = \left\langle \nabla _{{\dot{\gamma }}(t)} {{\,\mathrm{grad}\,}}\varphi (\gamma (t)), {\dot{\gamma }}(t) \right\rangle \), whenever \({\hat{\varphi }}^\prime \) is differentiable at t. From the linearity of the parallel transport, it suffices to consider the case \({\hat{M}}_\xi \in \partial ^2_B {\hat{\varphi }}(\xi )\), i.e., there exists \(\{ t_k \} \rightarrow \xi \) such that \({\hat{\varphi }}^{\prime \prime }\) exists at \(t_k\) and \({\hat{\varphi }}^{\prime \prime }(t_k) \rightarrow {\hat{M}}_\xi \). Since by the isometry property of the parallel transport, \(\hat{\varphi }^{\prime \prime }(t_k) = \langle \nabla _{ {\dot{\gamma }}(t_k)} {{\,\mathrm{grad}\,}}\varphi (\gamma (t_k)) , {\dot{\gamma }}(t_k) \rangle = \langle P_{\gamma }^{t_k \mapsto \xi } \nabla _{P_\gamma ^{\xi \mapsto t_k} \dot{\gamma }(\xi )} {{\,\mathrm{grad}\,}}\varphi (\gamma (t_k)) , {\dot{\gamma }}(\xi ) \rangle \), then there is \(M_\xi \in \partial {{\,\mathrm{grad}\,}}\varphi (\gamma (\xi ))\) such that \({\hat{M}}_\xi = \left\langle M_\xi {\dot{\gamma }}(\xi ), \dot{\gamma }(\xi ) \right\rangle \). Note \({\dot{\gamma }}(\xi ) = P_\gamma ^{0\rightarrow \xi }{\dot{\gamma }}(0)\), then the conclusion holds. \(\square \)

1.3 A.3 Proof of Lemma 4.5

Proof

Suppose the conclusion does not hold, then there exist \(\delta >0\) and \(v_k \in T_p\mathcal {M}\rightarrow 0\) such that

$$\begin{aligned} \Vert P_{\exp _p v_k, p}X(\exp _p v_k) - \nabla X(p; v_k) \Vert \ge \delta \left\| v_k \right\| . \end{aligned}$$

By Lemma 4.1 and the locally Lipschitz condition on X, there exists \(r_0>0\) such that for every \(p_1,p_2\in B_{r_0}(p)\), there exists a unique shortest geodesic joining \(p_1\) and \(p_2\) and X is L-Lipschitz in \(B_{r_0}(p)\). Let \(0<r<r_0\), by taking the subsequence of \(\{v_k\}\), we assume that \({\bar{v}}_k := \frac{rv_k}{\left\| v_k \right\| } \rightarrow v \in T_p\mathcal {M}\) with \(\left\| v \right\| = r\). We define \(t_k = \left\| v_k \right\| /r\), \(q_k = \exp _p v_k\), \(r_k = \exp _p \bar{v}_k\) and \(s_k = \exp _p t_k v\), then it holds that \(\{q_k,r_k,s_k\}\subset B_{r_0}(p)\), and hence we have

$$\begin{aligned}&\Vert P_{q_k,p}X(q_k) - \nabla X(p;v_k)\Vert \nonumber \\&\quad \le \Vert P_{q_k,p} X(q_k)-P_{s_k,p} X(s_k)\Vert + \Vert P_{s_k,p} X(s_k) - \nabla X(p;v_k)\Vert \nonumber \\&\quad \le \underbrace{\Vert P_{p,s_k}P_{q_k,p} X(q_k) - P_{q_k,s_k} X(q_k))\Vert }_{\mathrm{(A)}} + \underbrace{\Vert P_{q_k,s_k} X(q_k) - X(s_k)\Vert }_{\mathrm{(B)}}\nonumber \\&\qquad + \underbrace{\Vert P_{s_k,p} X(s_k) - t_k \nabla X(p;v)\Vert }_\mathrm{(C)} + \underbrace{\Vert \nabla X(p;v_k) - t_k\nabla X(p;v)\Vert }_\mathrm{(D)}. \end{aligned}$$
(A.2)

From Lemma 10 in [41], there exists \(C>0\) such that

$$\begin{aligned} \mathrm{(A)} \le \Vert P_{s_k,q_k}P_{p,s_k}P_{q_k,p} - {\mathrm{id}}\Vert \Vert X(q_k)\Vert \le C\max (d(p,q_k), d(q_k,s_k)) \Vert X(q_k)\Vert = o(t_k),\nonumber \\ \end{aligned}$$
(A.3)

since \(\Vert X(q_k)\Vert \le L \Vert v_k\Vert \rightarrow 0\) as \(v_k\rightarrow 0\). Moreover, since X is locally Lipschitz and directional differentiable at p, using Lemma 4.1, as \(k\rightarrow \infty \), we have

$$\begin{aligned} \mathrm{(B)} \le L d(q_k,s_k) = o(t_k), \quad \mathrm{(C)} = \Vert P_{s_k,p} X(s_k) - X(p) - t_k\nabla X(p;v)\Vert = o(t_k),\nonumber \\ \end{aligned}$$
(A.4)

where the first estimate follows from Definition 2.10 and \(\ell (\gamma ) = d(q_k, s_k)\). Let \(q_k^t = \exp _p t{\bar{v}}_k\) and \(q_v^t = \exp _p tv\), from Lemma 10 in [41] and Lemma 4.1, we have when \(t \rightarrow 0\),

$$\begin{aligned} \begin{aligned} \Vert P_{q_k^t,p}X(q_k^t)-P_{q_v^t,p}X(q_v^t)\Vert&\le \Vert P_{p,q_v^t}P_{q_k^t,p}X(q_k^t)- P_{q_k^t,q_v^t}X(q_k^t)\Vert \\&\quad + \Vert P_{q_k^t,q_v^t}X(q_k^t)-X(q_v^t)\Vert \\&\le \Vert P_{q_v^t,q_k^t}P_{p,q_v^t}P_{q_k^t,p}-{\mathrm{id}}\Vert \Vert X(q_k^t)\Vert + L d(q_k^t,q_v^t) \\&= O(t^2) + O(t)\Vert {\bar{v}}_k - v\Vert , \end{aligned} \end{aligned}$$

where the constants in the big-O notation depend on p, v and \(\mathcal {M}\). An intermediate result of the above estimate is

$$\begin{aligned} \Vert \nabla X(p;{\bar{v}}_k) - \nabla X(p;v)\Vert \le O(\Vert {\bar{v}}_k - v\Vert ) = o(1). \end{aligned}$$
(A.5)

Thus, the last term in (A.2) is

$$\begin{aligned} \mathrm{(D)} = t_k\Vert \nabla X(p;{\bar{v}}_k) - \nabla X(p;v)\Vert = o(t_k) \end{aligned}$$
(A.6)

Combining (A.3), (A.4) and (A.6), we find that \(\Vert P_{q_k,p}X(q_k) - \nabla X(p;v_k)\Vert = o(t_k) = o(\Vert v_k\Vert )\), which contradicts with our assumption. \(\square \)

1.4 A.4 Proof of Lemma 4.6

Proof

Define \(q_{k + 1} := R_{p_k} V_k\). Let \(B_r(p), K\) be the quantities in Lemma 4.1. Without loss of generality, we may assume \(p_k \in B_r(p)\) and \(\Vert v\Vert < r\) for every k, and shrink the ball \(B_r(p)\) such that (4.10) holds in \(B_r(p)\). Using Lemma 4.1, we know the following inequality holds for every geodesic triangle in \(B_r(p)\) whose edges are abc:

$$\begin{aligned} a^2 \le b^2 + c^2 - 2bc \cos A + Kb^2c^2 \le (b + c)^2 + Kb^2c^2, \end{aligned}$$

where A is the angle between b and c. Let \(a = d(p_k, q_{k+1})\), \(b = d(p_k, p)\), \(c = d(q_{k+1}, p)\), then we have

$$\begin{aligned} \limsup _{k \rightarrow \infty } \frac{d(p_k,q_{k+1})^2}{d(p_k,p)^2} \le K \limsup _{k \rightarrow \infty } d(q_{k+1}, p)^2 + \limsup _{k \rightarrow \infty } \left( 1 + \frac{d(q_{k+1},p)}{d(p_k,p)} \right) ^2 = 1.\nonumber \\ \end{aligned}$$
(A.7)

Define \(a = d(p_{k}, p)\), \(b = d(p_k, q_{k+1})\), \(c = d(q_{k + 1}, p)\), note that \(d(q_{k+1},p) = o(d(p_k, p))\), then for any \(\varepsilon > 0\) there exists \(k_\varepsilon > 0\) such that for any \(k \ge k_\varepsilon \), it holds that \(d(q_{k + 1}, p)^2 \le \varepsilon d(p_k, p)^2 \le 2\varepsilon d(q_{k + 1}, p)^2 + 2\varepsilon d(p_{k}, q_{k+1})^2 + \varepsilon K d(q_{k+1}, p)^2 d(p_k, q_{k+1})^2\). From (A.7) and the fact that \(p_k \rightarrow p\) as \(k \rightarrow \infty \), there exists \(K_0 > 0\) such that \(K d(p_k, q_{k+1})^2 \le 2K d(p_k, p)^2 \le 2\) for \(k \ge K_0\). Thus, for all \(\varepsilon \in (0, 1/8)\) and \(k \ge \max \{ k_\varepsilon , K_0 \}\), it holds

$$\begin{aligned} \begin{aligned} d(q_{k + 1}, p)^2 \le \frac{2\varepsilon }{1 - 2\varepsilon - \varepsilon K d(p_k, q_{k+1})^2} d(p_k, q_{k+1})^2 \le 4 \varepsilon d(p_k, q_{k+1})^2, \end{aligned} \end{aligned}$$

i.e., \(d(q_{k+1}, p)^2 = o( d(p_k, q_{k+1})^2)\). From (4.10), we have

$$\begin{aligned}d(p_k, q_{k+1}) \le d(p_k, \exp _{p_k} V_k) + d(\exp _{p_k} V_k, q_{k+1}) \le \Vert V_k \Vert + C \Vert V_k \Vert ^2,\end{aligned}$$

and hence it holds \(d(q_{k+1},p) = o( d(p_k, q_{k+1})) = o( \Vert V_k \Vert )\).

On the other hand,

$$\begin{aligned} \limsup _{k \rightarrow \infty } \frac{d(p_k,p)^2}{d(p_k,q_{k+1})^2} \le K \limsup _{k \rightarrow \infty } d(q_{k+1}, p)^2 + \limsup _{k \rightarrow \infty } \left( 1 + \frac{d(q_{k+1},p)}{d(p_k,q_{k+1})} \right) ^2 = 1. \end{aligned}$$

Combining with (A.7), we have \({\mathop {\lim }\limits _{k \rightarrow \infty }} \frac{d(p_k, p)}{d(p_k, q_{k+1})} = 1\). \(\square \)

Appendix B: The deferred proofs in Sect. 4.3

1.1 B.1 Proof of Theorem 4.4

We prove Theorem 4.4 in this section. First, we give a simple result that will be frequently used in our proof: For \(p \in {\bar{\mathcal {M}}}\), \(\xi \in T_p {\bar{\mathcal {M}}}\), and a curve \(\gamma : [0, 1] \rightarrow {\bar{\mathcal {M}}}\) with \(\gamma (0) = p\), the vector field \(Z(t) := {\bar{P}}_\gamma ^{0 \rightarrow t} \xi \) is parallel to \(\gamma \), and hence by the definition of parallel transport we have

$$\begin{aligned} {\bar{\nabla }}_{{\dot{\gamma }}} \bar{P}_\gamma ^{0 \rightarrow t} \xi = 0. \end{aligned}$$
(B.1)

Similarly, for \(p \in \mathcal {M}, \zeta \in T_p \mathcal {M}\) and \(\gamma : [0, 1] \rightarrow \mathcal {M}\), it holds that \(\nabla _{{\dot{\gamma }}} P^{0 \rightarrow t}_\gamma \zeta = 0\).

The following lemma compares the parallel transports of the manifold \(\mathcal {M}\) and its ambient space \({\bar{\mathcal {M}}}\).

Lemma B.1

There exists \(C > 0\) such that for every \(p \in \mathcal {M}\), and every smooth curve \(\gamma : [-1, 1] \rightarrow \mathcal {M}\) with \(\gamma (0) = p\), and every \(v \in T_p \mathcal {M}\), \(t \in [0, 1]\), it holds that

$$\begin{aligned} \big \Vert P_\gamma ^{0 \rightarrow t} v - \bar{P}_{\gamma }^{0 \rightarrow t} v \big \Vert \le C \ell (\gamma |_{[0,t]}) \Vert v \Vert . \end{aligned}$$
(B.2)

Moreover, there exists \(C^\prime > 0\) such that for every \(p \in \mathcal {M}\), \(w \in T_p \mathcal {M}\) with \(\Vert w\Vert \le 2\), and every \(t \in [0, 1]\), the following inequality holds for \(\gamma (t) := \exp _p(tw)\).

$$\begin{aligned} \big \Vert P_\gamma ^{0 \rightarrow t} v - \bar{P}_{\gamma }^{0 \rightarrow t} v - t \mathrm {II}({\dot{\gamma }}(t), P^{0 \rightarrow t}_\gamma v) \big \Vert \le C^\prime t^2. \end{aligned}$$
(B.3)

Proof

Let \(Y: [0, 1] \rightarrow T\mathcal {M}\) be a smooth vector field along \(\gamma \). For any fixed \(t \in [0, 1]\) and \(\xi \in T_{\gamma (t)} {\bar{\mathcal {M}}}\), it holds that

$$\begin{aligned} \begin{aligned} \left\langle \frac{\mathrm {d}}{\mathrm {d}s} {\bar{P}}^{s \rightarrow t}_\gamma Y(s), \xi \right\rangle&= \frac{\mathrm {d}}{\mathrm {d}s} \left\langle {\bar{P}}^{s \rightarrow t}_\gamma Y(s), \xi \right\rangle = \frac{\mathrm {d}}{\mathrm {d}s} \left\langle Y(s), {\bar{P}}^{t \rightarrow s}_\gamma \xi \right\rangle \\&= \left\langle {\bar{\nabla }}_{{\dot{\gamma }}} Y(s), {\bar{P}}^{t \rightarrow s}_\gamma \xi \right\rangle + \left\langle Y(s), {\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}^{t \rightarrow s}_\gamma \xi \right\rangle \\&\overset{(\mathrm{B.1})}{=} \left\langle {\bar{P}}^{s \rightarrow t}_\gamma {\bar{\nabla }}_{{\dot{\gamma }}} Y(s), \xi \right\rangle , \end{aligned} \end{aligned}$$

Therefore, \(\frac{\mathrm {d}}{\mathrm {d}s} {\bar{P}}_\gamma ^{s \rightarrow t}Y(s) = \bar{P}_{\gamma }^{s \rightarrow t} {\bar{\nabla }}_{{\dot{\gamma }}} Y(s)\).

Setting \(Y(t) = P_\gamma ^{0 \rightarrow t} v\) for \(v \in T_p \mathcal {M}\), we know

$$\begin{aligned} {\bar{\nabla }}_{{\dot{\gamma }}} Y \overset{(4.22)}{=} \nabla _{{\dot{\gamma }}} Y + \mathrm {II}({\dot{\gamma }}, Y) \overset{(\mathrm{B.1})}{=} \mathrm {II}({\dot{\gamma }}, Y), \end{aligned}$$

and thus,

$$\begin{aligned} Y(t) - {\bar{P}}_{\gamma }^{0 \rightarrow t}Y(0) = \int _0^t \frac{\mathrm {d}}{\mathrm {d}s} {\bar{P}}_{\gamma }^{s \rightarrow t}Y(s) \mathrm {d}s = \int _0^t {\bar{P}}_{\gamma }^{s \rightarrow t}\mathrm {II}({\dot{\gamma }}(s), Y(s)) \mathrm {d}s. \end{aligned}$$
(B.4)

Since \(\mathrm {II}\) smoothly depends on the Riemannian metric of \({\bar{\mathcal {M}}}\) and \(\mathcal {M}\) is compact, the constant \(C := \sup \{ \Vert \mathrm {II}(u, v) \Vert : p \in \mathcal {M}, u, v \in T_p\mathcal {M}, \Vert u\Vert = \Vert v\Vert = 1 \}\) is finite. As the parallel transport is an isometry, it holds that \(\Vert Y(t) \Vert = \Vert Y(0)\Vert = \Vert v\Vert \) and

$$\begin{aligned} \begin{aligned} \big \Vert Y(t) - {\bar{P}}_{\gamma }^{0 \rightarrow t}Y(0) \big \Vert \le \int _0^t C\Vert v \Vert \Vert {\dot{\gamma }}(s) \Vert \mathrm {d}s = C\Vert v \Vert \ell (\gamma |_{[0,t]}). \end{aligned} \end{aligned}$$

Moreover, we have

$$\begin{aligned} \begin{aligned} t \mathrm {II}({\dot{\gamma }}(t), Y(t)) - [Y(t) - {\bar{P}}_{\gamma }^{0 \rightarrow t} Y(0)]&\overset{(\mathrm{B.4})}{=} \int _0^t \mathrm {II}({\dot{\gamma }}(t), Y(t)) - {\bar{P}}_{\gamma }^{s \rightarrow t}\mathrm {II}({\dot{\gamma }}(s), Y(s)) \mathrm {d}s \\&= \int _0^t \int _s^t \frac{\mathrm {d}}{\mathrm {d}u} \bar{P}_{\gamma }^{u \rightarrow t}\mathrm {II}({\dot{\gamma }}(u), Y(u)) \mathrm {d}u \mathrm {d}s\\&= \int _0^t \int _s^t {\bar{P}}_{\gamma }^{u \rightarrow t} {\bar{\nabla }}_{{\dot{\gamma }}} \mathrm {II}({\dot{\gamma }}(u), Y(u)) \mathrm {d}u \mathrm {d}s. \end{aligned} \end{aligned}$$

When \(\gamma (t) := \exp _p (tw)\) for \(w \in T_p \mathcal {M}\) with \(\Vert w \Vert \le 2\), the above integrand is continuous with respect to vwtup, and thus, the constant \(C^\prime := 2\sup \{ \Vert {\bar{\nabla }}_{{\dot{\gamma }}} \mathrm {II}({\dot{\gamma }}, P_\gamma ^{0 \rightarrow t} v) \Vert : p \in \mathcal {M}\text { and } w, v \in T_p\mathcal {M}, \Vert w \Vert \le 2, \Vert v\Vert \le 2, t \in [0, 1] \}\) is finite. Therefore, (B.3) holds. \(\square \)

The following lemma controls the error of parallelly transporting a vector along the shortest geodesic on \(\mathcal {M}\) and then moving back along the shortest geodesic on \({\bar{\mathcal {M}}}\).

Lemma B.2

Fix \(p \in \mathcal {M}\), then there exists \(C > 0\) such that for every \(v \in T_p \mathcal {M}\) with \(\Vert v \Vert \le 2\), \(t \in [0, 1]\) and \(w, \xi \in T_p {\bar{\mathcal {M}}}\), the following inequality holds

$$\begin{aligned} |\left\langle {\bar{P}}_{\gamma (t),p} {\bar{P}}_{\gamma }^{0 \rightarrow t} w , \xi \right\rangle - \left\langle w, \xi \right\rangle | \le C t^2 \Vert w \Vert \Vert \xi \Vert , \end{aligned}$$
(B.5)

where \(\gamma (t) := \exp _p (tv)\). Consequently,

$$\begin{aligned} \big \Vert {\bar{P}}_{p,\gamma (t)} - {\bar{P}}^{0 \rightarrow t}_\gamma \big \Vert \le Ct^2. \end{aligned}$$
(B.6)

Proof

Let \(w, \xi \in T_p {\bar{\mathcal {M}}}\) and \(v \in T_p \mathcal {M}\) with \(\Vert v\Vert \le 2\) and \(\gamma (t) := \exp _p(tv)\), then

$$\begin{aligned} \frac{\mathrm {d}}{\mathrm {d}t} \left\langle {\bar{P}}_{\gamma (t),p} {\bar{P}}_{\gamma }^{0 \rightarrow t} w , \xi \right\rangle&= \frac{\mathrm {d}}{\mathrm {d}t} \left\langle {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&= \left\langle {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&\quad + \left\langle {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&\overset{(\mathrm{B.1})}{=} \left\langle {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle , \end{aligned}$$
(B.7)
$$\begin{aligned} \frac{\mathrm {d}^2}{\mathrm {d}t^2} \left\langle {\bar{P}}_{\gamma (t),p} \bar{P}_{\gamma }^{0 \rightarrow t} w , \xi \right\rangle&= \left\langle {\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&\quad + \left\langle {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}} {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&\quad \overset{(\mathrm{B.1})}{=} \left\langle \bar{P}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}} \bar{\nabla }_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle . \end{aligned}$$
(B.8)

Let \(Y(q) := {\bar{P}}_{pq} \xi \) for \(q \in {\bar{\mathcal {M}}}\), then \(Y(\gamma (t)) = {\bar{P}}_{p, \gamma (t)} \xi \). From [19, Proposition 2.2(c)], we know \({\bar{\nabla }}_{{\dot{\gamma }}} Y = {\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}_{p, \gamma (t)} \xi \). By the definition of Y, \({\bar{\nabla }}_v Y(p) = 0\) for every \(v \in T_p {\bar{\mathcal {M}}}\). Therefore, the right-hand side of (B.7) vanishes at \(t = 0\). Since \(\bar{\nabla }_{{\dot{\gamma }}}{\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}_{p, \gamma (t)} \xi \) is continuous with respect to tv and is linear with respect to \(\xi \), then

$$\begin{aligned} C := \sup \left\{ \Vert {\bar{\nabla }}_{{\dot{\gamma }}}\bar{\nabla }_{{\dot{\gamma }}} {\bar{P}}_{p, \gamma (t)} \xi \Vert : t \in [0, 1], \Vert v \Vert \le 2, \Vert \xi \Vert \le 1 \right\} < \infty . \end{aligned}$$

Applying Taylor’s theorem at \(t = 0\), we know for every \(t \in [0, 1]\) and \(w, \xi \in T_p {\bar{\mathcal {M}}}\),

$$\begin{aligned} |\left\langle {\bar{P}}_{\gamma (t),p} {\bar{P}}_{\gamma }^{0 \rightarrow t} w , \xi \right\rangle - \left\langle w, \xi \right\rangle | \le C t^2 \Vert w \Vert \Vert \xi \Vert . \end{aligned}$$

Since C is independent of v, then the above holds for every \(\Vert v \Vert \le 2\). Finally, (B.6) follows from

$$\begin{aligned} \big \Vert {\bar{P}}_{p,\gamma (t)} - {\bar{P}}^{0 \rightarrow t}_\gamma \big \Vert = \big \Vert \bar{P}_{\gamma (t),p} {\bar{P}}_\gamma ^{0 \rightarrow t} - \mathrm {Id}\big \Vert \le Ct^2. \end{aligned}$$

This completes the proof. \(\square \)

The lemma below bounds the error of the inverse exponential maps of \(\mathcal {M}\) and \({\bar{\mathcal {M}}}\), which is proved by extending \(\exp _p\) to a retraction on \({\bar{\mathcal {M}}}\) using the Fermi coordinate [48, p. 135] and applying [71, Lemma 3] to bound the difference of two inverse retractions.

Lemma B.3

Fix \(p_0 \in \mathcal {M}\), there exist \(C, r > 0\) such that the following inequality holds for every \(p, q \in B_{r}(p_0)\).

$$\begin{aligned} \big \Vert \exp ^{-1}_p q - \overline{\exp }^{-1}_p q \big \Vert \le C d_{{\bar{\mathcal {M}}}}(p, q)^2. \end{aligned}$$
(B.9)

Proof

From Lemma 4.1, there exists \(r > 0\) such that \(\exp _p\) is a diffeomorphism from \(\{ v \in T_p \mathcal {M}: \Vert v \Vert < 4r \}\) to \(B_{4r}(p)\) for every \(p \in B_{4r}(p_0)\). According to Theorem 5.25 in [48], there exists \(r^\prime > 0\) such that \({\overline{\exp }}\) is a diffeomorphism from \(V := \{ (p, v) \in N\mathcal {M}: p \in B_{2r}(p_0), \Vert v \Vert < r^\prime \}\) to its image under V, where \(N\mathcal {M}:= \bigcup _{p \in \mathcal {M}} (T_p \mathcal {M})^{\perp }\) is the normal bundle. Let \(E_1, \ldots , E_{d}\) be vector fields on \(B_{4r}(p_0)\) such that \(\{ E_1(p), \ldots , E_{n}(p) \}\) and \(\{ E_{n+1}(p), \ldots , E_{d}(p) \}\) are orthonormal bases of \(T_p\mathcal {M}\) and \((T_p\mathcal {M})^\perp \), respectively. For \(p \in B_r(p_0)\), \(v := \sum _{i=1}^d w_i E_i(p) \in T_p {\bar{\mathcal {M}}}\) with \(\Vert v_\top \Vert < r\) and \(\Vert v_\perp \Vert < r^\prime \), we define \(q(p, v) := \exp _p (v_\top )\) and

$$\begin{aligned} E(p, v) := {\overline{\exp }}_{q(p, v)} \left( \sum _{i=n+1}^d w_i E_i(q(p, v)) \right) . \end{aligned}$$

The above discussion shows that E is well-defined, and \(E(p, v) = \exp _p v\) if \(v \in T_p \mathcal {M}\) and \(\Vert v\Vert < r\), and \(E(p, v) = {\overline{\exp }}_p v\) if \(v \in (T_p \mathcal {M})^\perp \) and \(\Vert v\Vert < r^\prime \). Thus, E can be regarded as a retraction on \({\bar{\mathcal {M}}}\) defined near \(p_0\), and (B.9) follows from [71, Lemma 3]. \(\square \)

Proof of Theorem 4.4

(i) First, we verify the local Lipschitz property of X. Since \({\bar{X}}\) is locally Lipschitz at \(p \in \mathcal {M}\), then there exist \(L_p > 0\) and a neighborhood \({\bar{U}}_p \subset {\bar{\mathcal {M}}}\) at p such that \({\bar{X}}\) is \(L_p\)-Lipschitz in \({\bar{U}}_p\). By Lemma B.1, we know for every \(p, q \in \mathcal {M}\cap {\bar{U}}_p\) and every geodesic \(\gamma \) joining p and q, it holds that

$$\begin{aligned} \begin{aligned} \big \Vert P_{\gamma }^{0 \rightarrow 1} X(p) - X(q) \big \Vert&\le \big \Vert P_{\gamma }^{0 \rightarrow 1} X(p) - {\bar{P}}_{\gamma }^{0 \rightarrow 1}X(p) \big \Vert + \big \Vert {\bar{P}}_{\gamma }^{0 \rightarrow 1}X(p) - X(q) \big \Vert \\&\le \left( C \sup _{p \in \mathcal {M}}\Vert X(p) \Vert + L_p \right) \ell (\gamma ), \end{aligned} \end{aligned}$$

where C is the constant in Lemma B.1. Since \(\mathcal {M}\) is compact and X is continuous, \(M := \sup _{p \in \mathcal {M}} \Vert X(p) \Vert < \infty \), then X is Lipschitz on \(\mathcal {M}\cap {\bar{U}}_p\), i.e., X is locally Lipschitz at p.

(ii) Next, we show that X is directionally differentiable at p in the Hadamard sense. Fix \(v \in T_p \mathcal {M}\) with \(\Vert v \Vert = 1\) and let \(\gamma (t) := \exp _p(tv^\prime )\) for \(t \in (-1, 1)\) and \(v^\prime \in T_p \mathcal {M}\) with \(\Vert v^\prime \Vert \le 2\), and decompose \(P_\gamma ^{t \rightarrow 0} X(\exp _p(tv^\prime )) - X(p) \) into

$$\begin{aligned} \begin{aligned}&\underbrace{ (P_\gamma ^{t \rightarrow 0} - {\bar{P}}_\gamma ^{t \rightarrow 0})X(\gamma (t)) - t S(t) }_{\text {(A)}} + \underbrace{ (\bar{P}_\gamma ^{t \rightarrow 0} - {\bar{P}}_{\gamma (t),p}) X(\gamma (t)) }_{\text {(B)}} \\&\qquad + \underbrace{ {\bar{P}}_{\gamma (t),p} X(\gamma (t)) - X(p) + tS(t) }_{\text {(C)}}, \end{aligned} \end{aligned}$$
(B.10)

where \(S(t) := \mathrm {II}(-{\dot{\gamma }}(0), P_\gamma ^{t \rightarrow 0} X(\gamma (t)))\).

From Lemma B.1 and B.2, there exists \(C > 0\) (independent of \(v^\prime \)) such that

$$\begin{aligned} \begin{aligned}&| (\mathrm A) | \overset{(\mathrm{B.3})}{\le } Ct^2 \quad \text { and } \quad |(\mathrm B)| \le \Vert \bar{P}_\gamma ^{t \rightarrow 0} - {\bar{P}}_{\gamma (t), p} \Vert \Vert X(\gamma (t)) \Vert \overset{(\mathrm{B.6})}{\le } CM t^2. \end{aligned} \end{aligned}$$

When \(t \rightarrow 0^{+}\) and \(v^\prime \rightarrow v\), we know \(|(\mathrm A)| + |(\mathrm B)| = O(t^2)\) and \(v^\prime _t := t^{-1}\overline{\exp }^{-1}_p \gamma (t) \rightarrow v\), and hence the continuity of S(t) and the Hadamard differentiability (4.24) imply that

$$\begin{aligned} \begin{aligned}&\lim _{\begin{array}{c} t \downarrow 0 \\ v^\prime \rightarrow v \end{array}} \frac{1}{t} \left[ P_{\exp _p(tv^\prime ),p} X(\exp _p(tv^\prime )) - X(p) \right] = \lim _{\begin{array}{c} t \downarrow 0 \\ v^\prime \rightarrow v \end{array}} \frac{1}{t}(\mathrm C) \\&\quad = \lim _{t \downarrow 0} \frac{1}{t} \left[ P_{\overline{\exp }_p(tv_t^\prime ),p} X({\overline{\exp }}_p(tv_t^\prime )) - X(p) \right] \\&\qquad + \lim _{\begin{array}{c} t \downarrow 0 \\ v^\prime \rightarrow v \end{array}} S(t) \overset{(4.24)}{=} \bar{\nabla } X(p; v) - \mathrm {II}(v, X(p)). \end{aligned} \end{aligned}$$

Note that \(\mathrm {II}(v, X(p)) \in (T_p\mathcal {M})^\perp \), we find \(\nabla X(p; v) = {\bar{\nabla }} X(p; v) - \mathrm {II}(v, X(p)) = ({\bar{\nabla }} X(p; v))_\top \) for every \(\Vert v\Vert = 1\). Since \(\nabla X(p; tv) = t\nabla X(p; v)\) for all \(t > 0\), we know X is directionally differentiable at p in the Hadamard sense.

(iii.b) Finally, we need to verify the inequality (4.27). Suppose that the assumption (iii.b) holds. Since the parallel transport is an isometry, then from (4.26), there exist \(C > 0, \delta \in (0, 1)\) such that for every \(q \in \mathcal {M}\) with \(d_{{\bar{\mathcal {M}}}}(p, q) < \delta \) and \({\bar{H}}_q \in {\bar{\mathcal {K}}}(q)\), it holds that

$$\begin{aligned} \Vert {\bar{P}}_{pq} X(p) - X(q) - {\bar{H}}_q {\overline{\exp }}_q^{-1} p \Vert \le C d_{{\bar{\mathcal {M}}}}(p, q)^{1 + \mu } \le C d_\mathcal {M}(p, q)^{1 + \mu }.\nonumber \\ \end{aligned}$$
(B.11)

Without the loss of generality, we could assume that \(B_\delta (p) \subset V_p\), where \(V_p\) is the neighborhood of p in the assumption (iii.b). For every \(q \in \mathcal {M}\) and \(H_q \in \mathcal {K}(q)\) such that \(d_\mathcal {M}(q, p) < \delta \), the assumption (iii.b) states that there exists \({\bar{H}}_q \in {\bar{\mathcal {K}}}(q)\) satisfying \(({\bar{H}}_q \exp ^{-1}_qp)_\perp = \mathrm {II}(\exp ^{-1}_qp, X(q))\) and \(({\bar{H}}_q \exp ^{-1}_qp)_\top = H_q \exp ^{-1}_qp\). Let \(\alpha = d_\mathcal {M}(p, q)\), \(\xi := \alpha ^{-1} \exp ^{-1}_p q\) and \(\gamma (t) := \exp _p( t \xi )\), it holds that

$$\begin{aligned} \begin{aligned}&\Vert P_{pq} X(p) - X(q) - H_q \exp ^{-1}_q p \Vert \\&\quad \le \underbrace{ \Vert P_{pq} X(p) - {\bar{P}}_{\gamma }^{0 \rightarrow \alpha } X(p) + \mathrm {II}(\exp ^{-1}_q p, X(q)) \Vert }_{\text {(A)}} \\&\qquad + \underbrace{\Vert ( {\bar{P}}_{\gamma }^{0 \rightarrow \alpha } - {\bar{P}}_{pq} )X(p) \Vert }_{\text {(B)}} + \underbrace{\Vert \bar{P}_{pq}X(p) - X(q) - {\bar{H}}_q {\overline{\exp }}^{-1}_qp \Vert }_{\text {(C)}} \\&\qquad + \underbrace{\Vert {\bar{H}}_q ({\overline{\exp }}^{-1}_qp - \exp ^{-1}_q p) \Vert }_{\text {(D)}}. \end{aligned} \end{aligned}$$

Let \(r, {\tilde{C}} > 0\) be the constants such that Lemma B.1B.2 and B.3 hold with \(p_0 = p\), and \({\tilde{L}}_p > 0\) be the Lipschitz constant of X at p, and by the compactness of \(\mathcal {M}\) and the continuity of \(\mathrm {II}(v, w)\), we further assume that \(\sup \{ \Vert \mathrm {II}(v, w) \Vert : p \in \mathcal {M}, v, w \in T_p\mathcal {M}, \Vert v\Vert =\Vert w\Vert =1 \}< {\tilde{C}} < \infty \). Note that \({\dot{\gamma }}(\alpha ) = -\alpha ^{-1} \exp ^{-1}_q p\) and \(\mathrm {II}(\exp ^{-1}_qp, X(q)) = -\alpha \mathrm {II}({\dot{\gamma }}(\alpha ), X(q))\), then

$$\begin{aligned} \begin{aligned} |(\mathrm A)|&\le \Vert P_{pq} X(p) - {\bar{P}}_{\gamma }^{0 \rightarrow \alpha } X(p) - \alpha \mathrm {II}({\dot{\gamma }}(\alpha ), P_\gamma ^{0 \rightarrow \alpha }X(p)) \Vert \\&\qquad + \alpha \Vert \mathrm {II}({\dot{\gamma }}(\alpha ), P_\gamma ^{0 \rightarrow \alpha }X(p) - X(q)) \Vert \\&\quad \overset{(\mathrm{B.3})}{\le }{\tilde{C}} \alpha ^2 + {\tilde{C}} \alpha \Vert {\dot{\gamma }}(\alpha ) \Vert \Vert P_\gamma ^{0 \rightarrow \alpha }X(p) - X(q) \Vert \le {\tilde{C}}(1 + {\tilde{L}}_p) \alpha ^2, \end{aligned} \end{aligned}$$

where the last inequality follows from the Lipschitzness of X at p. From (B.6) we find that \(|(\mathrm B)| \le {\tilde{C}}M\alpha ^2\). Since \(\mathcal {K}\) is locally bounded, there exists \({\tilde{M}} > 0\) such that all elements in \(\bigcup _{d(p, q) \le \delta } {\bar{\mathcal {K}}}(q)\) are bounded by \(\tilde{M}\), and thus (B.9) yields \(|(\mathrm D)| \le {\tilde{C}} {\tilde{M}} \alpha ^2\). Combining with (B.11) and letting \(C_0 := {\tilde{C}}(M + {\tilde{M}} + {\tilde{L}}_p + 1)\), it holds that

$$\begin{aligned} \begin{aligned} \Vert P_{pq} X(p) - X(q) - H_q \exp ^{-1}_q p \Vert \le C d_\mathcal {M}(p, q)^{1 + \mu } + C_0 d_\mathcal {M}(p, q)^2. \end{aligned} \end{aligned}$$

Thus, the inequality (4.27) holds for \(q \in \mathcal {M}\) with \(d_\mathcal {M}(p, q) < {\tilde{\delta }} := \min \{ 1, r, \delta \}\) and \({\hat{C}} := C + C_0\). Moreover, when \(\mu \in [0, 1)\), the same inequality holds for \(q \in \mathcal {M}\) with \(d_\mathcal {M}(p, q) < {\tilde{\delta }} := \min \{ r, \delta , (C/C_0)^{\frac{1}{1 - \mu }} \}\) and \({\hat{C}} := 2C\).

(iii.a) When \(X(p) = 0\), we know

$$\begin{aligned} \begin{aligned} \Vert X(q) + H_q \exp ^{-1}_q p \Vert&= \Vert ({\bar{X}}(q) + {\bar{H}}_q \exp ^{-1}_q p)_\top \Vert \le \Vert {\bar{X}}(q) + {\bar{H}}_q \exp ^{-1}_q p\Vert \\&\le \Vert {\bar{X}}(q) + {\bar{H}}_q {\overline{\exp }}^{-1}_q p \Vert + \Vert \bar{H}_q ({\overline{\exp }}^{-1}_qp - \exp ^{-1}_q p) \Vert . \end{aligned} \end{aligned}$$

The above two terms can be controlled by (4.26) and (B.9), respectively, and therefore the conclusion follows from a similar discussion as in (iii.b).

(iv) Since when \(\mu = 0\), we know for every \(C > 0\), if (4.26) holds, then (4.27) holds with \({\hat{C}} = 2C\), which implies the semismoothness in Definition 4.1. The conclusion for the semismoothness with order \(\mu \in (0, 1]\) directly follows from (i), (ii) and (iii). \(\square \)

1.2 B.2 Proof of Proposition 4.2

First, we briefly review how to represent quantities on manifolds under a local coordinate. Let \((U, \varphi )\) be a chart near \(p \in \mathcal {M}\), and \(\{ e_i \}\) be the standard basis of \(\mathbb {R}^n\), i.e., the j-th component of \(e_i\) is \(\delta _{ij}\), and let \(E_i := (\mathrm {d}\varphi )^{-1}[e_i]\), then \(E_i\) is a smooth vector field near p and \(\{ E_i(q) \}\) is a basis of \(T_q\mathcal {M}\) for every \(q \in U\). Let \(g_{ij} := \left\langle E_i, E_j \right\rangle \), \(G := (g_{ij})_{i,j\in [n]}\) and \(\varGamma _{ij}^k\) be the Christoffel symbols such that \(\nabla _{E_i}{E_j} = \sum _{k=1}^n \varGamma _{ij}^k E_k\), and \(g^{ij}\) be such that \(( g^{ij} )_{i,j\in [n]} = G^{-1}\), and we call G the metric matrix. The gradient of a smooth function \(f:\mathcal {M}\rightarrow \mathbb {R}\) is \({{\,\mathrm{grad}\,}}f = \sum _{i, j=1}^n (g^{ij} E_i f)E_j\) (see [48, p. 27]). Two smooth vector fields XY defined near p can be represented as \(X = \sum _{i=1}^n X^i E_i\) and \(Y = \sum _{i=1}^n Y^i E_i\), and their inner product is \(\left\langle X, Y \right\rangle = \sum _{i,j=1}^n g_{ij} X^i Y^j\). The covariant derivative \(\nabla _XY\) can be written as (see [48, p. 92])

$$\begin{aligned} \nabla _{X}Y = \sum _{i=1}^n (X Y^i) E_i + \sum _{i, j, k=1}^n X^iY^j \varGamma _{ij}^k E_k. \end{aligned}$$
(B.12)

When \((U, \varphi )\) is the chart constructed in the proof of Lemma 4.2, \(\varphi ^{-1}\) is the normal coordinate at \(p \in \mathcal {M}\) (see [48, p. 132]). Proposition 5.24 in [48] shows that \(g_{ij}(p) = \delta _{ij}\), \(\varGamma _{ij}^k(p) = 0\), and all partial derivatives of \(g_{ij}\) vanish at p. Thus, \({{\,\mathrm{grad}\,}}f(p) = \sum _{i=1}^n (E_i f)(p) E_i(p)\), \(\nabla _XY(p) = \sum _{i=1}^n (X Y^i)(p) E_i(p)\) and \(\left\langle X(p), Y(p) \right\rangle = \sum _{i=1}^n X^i(p) Y^i(p)\).

The following two lemmas are some chain rules on manifolds, which are proved by pulling functions on manifolds back to Euclidean spaces and then applying the Euclidean chain rules.

Lemma B.4

Let \(\mathcal {M}, \mathcal {N}\) be two Riemannian manifolds, and \(F: \mathcal {M}\rightarrow \mathcal {N}\) be a smooth map, \(g: \mathcal {N}\rightarrow \mathbb {R}\) be a map that is locally Lipschitz at \(F(p) \in \mathcal {N}\), then the following chain rule holds at \(p \in \mathcal {M}\)

$$\begin{aligned} \partial (g \circ F)(p) \subseteq \partial g(F(p)) \mathrm {d}F(p), \end{aligned}$$
(B.13)

where the inclusion means that for every \(\xi \in \partial (g \circ F)(p)\), there exists \(\zeta \in \partial g(F(p))\) such that \(\left\langle \xi , v \right\rangle = \left\langle \zeta , \mathrm {d}F|_p[v] \right\rangle \) for every \(v \in T_p\mathcal {M}\).

Proof

Let \((U, \varphi )\) and \((V, \psi )\) be charts near \(p \in \mathcal {M}\) and \(F(p) \in \mathcal {N}\) with \(\varphi (0) = p\) and \(\psi (0) = F(p)\), respectively. We denote \(G_\mathcal {N}, G_\mathcal {M}\) as the metric matrices on \(\mathcal {N}, \mathcal {M}\) defined above, respectively. Define \({\hat{g}} := g \circ \psi ^{-1}\) and \({\hat{F}} := \psi \circ F \circ \varphi ^{-1}\). Then, Theorem 2.3.10 in [25] yields that

$$\begin{aligned} \partial ({\hat{g}} \circ {\hat{F}})(0) \subseteq \partial {\hat{g}}({\hat{F}}(0)) \mathrm {d}{\hat{F}}(0). \end{aligned}$$

Fix \(\xi \in \partial (g \circ F)(p)\), then from (2.2) we know \({\hat{\xi }} := G_\mathcal {M}\mathrm {d}\varphi |_p[\xi ] \in \partial ({\hat{g}} \circ {\hat{F}})(0)\). The above display shows that there exists \({\hat{\zeta }} \in \partial {\hat{g}}({\hat{F}}(0))\) such that for every \({\hat{v}} \in \mathbb {R}^n\), it holds that \( \hat{\xi }^\top {\hat{v}} = {\hat{\zeta }}^\top (\mathrm {d}{\hat{F}}(0)[{\hat{v}}]) \). Let \(v := (\mathrm {d}\varphi |_p)^{-1}[{\hat{v}}]\), then \({\hat{\xi }}^\top {\hat{v}} = (\mathrm {d}\varphi |_p[\xi ])^\top G_\mathcal {M}(\mathrm {d}\varphi |_p[v]) = \left\langle \xi , v \right\rangle \). Since \(\zeta := (\mathrm {d}\psi |_{F(p)})^{-1} G_\mathcal {N}^{-1} {\hat{\zeta }} \in \partial g(F(p))\) and \(\mathrm {d}{\hat{F}} = \mathrm {d}\psi \circ \mathrm {d}F \circ (\mathrm {d}\varphi )^{-1}\), we find that \({\hat{\zeta }}^\top (\mathrm {d}{\hat{F}}(0)[{\hat{v}}]) = (\mathrm {d}\psi |_{F(p)}[\zeta ])^\top G_\mathcal {N}( \mathrm {d}\psi |_{F(p)} [\mathrm {d}F|_p[v]]) = \left\langle \zeta , \mathrm {d}F|_p[v] \right\rangle \). Therefore, \(\left\langle \xi , v \right\rangle = \left\langle \zeta , \mathrm {d}F|_p[v] \right\rangle \) for every \(v \in T_p \mathcal {M}\), and hence \(\xi \in \partial g(F(p)) \mathrm {d}F(p)\). \(\square \)

Lemma B.5

Let \(\mathcal {M}\) be a Riemannian manifold, X be a vector field on \(\mathcal {M}\) that is locally Lipschitz at \(p \in \mathcal {M}\), and Y be a smooth vector field on \(\mathcal {M}\). Define \(f := \left\langle X, Y \right\rangle \), then the following chain rule holds.

$$\begin{aligned} \partial f(p) = \left\langle \partial X(p), Y(p) \right\rangle + \left\langle X(p), \nabla Y(p) \right\rangle , \end{aligned}$$
(B.14)

where \(\xi \in \partial f(p)\) is regarded as the operator \(\zeta \mapsto \left\langle \xi , \zeta \right\rangle \) for \(\zeta \in T_p\mathcal {M}\), and the right-hand side is understood as the set of operators \(\zeta \mapsto \left\langle H\zeta , Y(p) \right\rangle + \left\langle X(p), \nabla _\zeta Y(p) \right\rangle \) for \(\zeta \in T_p\mathcal {M}\) and \(H \in \partial X(p)\).

Proof

Let \((U_p, \varphi )\) be a chart such that \(\varphi ^{-1}\) is the normal coordinate. Suppose that \(X = \sum _{i=1}^n X^i E_i\) and \(Y = \sum _{i=1}^n Y^i E_i\), we define \({\hat{f}} := f \circ \varphi ^{-1}\), \({\hat{X}}^j := X^j \circ \varphi ^{-1}\), \({\hat{X}} := ({\hat{X}}^1, \ldots , {\hat{X}}^n)\), and \({\hat{Y}}\) likewise. Thus, \({\hat{f}}(x) = {\hat{X}}(x)^\top G(x) {\hat{Y}}(x)\) for \(x \in {\hat{U}}_p := \varphi (U_p)\), and can be regarded as the composition \(\mathcal {A}\circ {\tilde{X}}\), where \(\mathcal {A}(x, v) := v^\top G(x) {\hat{Y}}(x)\) and \({\tilde{X}}(x) := (x, {\hat{X}}(x))\) for \(x \in U_p, v \in \mathbb {R}^n\). Since \(\mathcal {A}\) is smooth and \(G(0) = I_n, \nabla G(0) = 0\), we know \(\nabla \mathcal {A}(0, v)[w, \xi ] = v^\top \nabla {\hat{Y}}(0)[w] + \xi ^\top {\hat{Y}}(0)\) for \(w, \xi \in \mathbb {R}^n\). Note that \(\partial {\tilde{X}} = (\mathrm {Id}, \partial {\hat{X}})\), then the chain rule in Euclidean spaces [25, Theorem 2.6.6] implies that

$$\begin{aligned} \partial {\hat{f}}(0) = \nabla \mathcal {A}(0, {\hat{X}}(0))[\mathrm {Id}, \partial {\hat{X}}(0)] = {\hat{X}}(0)^\top \nabla {\hat{Y}}(0) + {\hat{Y}}(0)^\top \partial {\hat{X}}(0). \end{aligned}$$
(B.15)

We define the linear bijection \(J: \mathcal {L}(\mathbb {R}^n) \rightarrow \mathcal {L}(T_p\mathcal {M})\) as

$$\begin{aligned} (J{\hat{H}})v := \sum _{i=1}^n [{\hat{H}} {\hat{v}}]_i E_i(p), \end{aligned}$$

where \({\hat{H}} \in \mathcal {L}(\mathbb {R}^n)\), \(v \in T_p \mathcal {M}\) and \({\hat{v}} := \mathrm {d}\varphi |_p[v]\). Below we show that \(J(\partial {\hat{X}}(0)) = \partial X(p)\).

Suppose \({\hat{H}} \in \partial _B {\hat{X}}(0)\), then there exists \(x_k \rightarrow 0\) such that \({\hat{X}}\) is differentiable at \(x_k\) and \(\nabla {\hat{X}}(x_k) \rightarrow {\hat{H}}\) as \(k\rightarrow \infty \). Let \(H := J{\hat{H}}\), \({\hat{v}} \in \mathbb {R}^n\), \(v := (\mathrm {d}\varphi |_p)^{-1}[{\hat{v}}]\), \(p_k := \varphi ^{-1}(x_k)\) and \(v_k := (\mathrm {d}\varphi |_{p_k})^{-1}[{\hat{v}}]\), then \(\nabla {\hat{X}}^i(x_k)[{\hat{v}}] = \mathrm {d}{\hat{X}}^i|_{x_k}[{\hat{v}}] = \mathrm {d}X^i|_{p_k}[v_k] = v_kX^i(p_k)\), and therefore

$$\begin{aligned} \nabla _{v_k} X(p_k) = \sum _{i=1}^n \nabla {\hat{X}}^i(x_k)[{\hat{v}}] E_i + \sum _{i, j, k=1}^n X^iY^j\varGamma _{ij}^k E_k. \end{aligned}$$

Let \(k \rightarrow \infty \) and note \(\varGamma _{ij}^k(p) = 0\), we find \(\nabla _{v_k} X(p_k) \rightarrow \sum _{i=1}^n [{\hat{H}} {\hat{v}}]_i E_i(p) = Hv\). Since \({\hat{v}}\) is arbitrary, we conclude that \(\nabla X(p_k) \rightarrow H \in \partial _B X(p)\). By reversing this argument, we can show that \(J^{-1}H \in \partial {\hat{X}}(0)\) for every \(H \in \partial X(p)\). Then, \(J(\partial _B {\hat{X}}(0)) = \partial _B X(p)\), and by the linearity of J, we know \(J(\partial {\hat{X}}(0)) = \partial X(p)\).

Thus, \({\hat{Y}}(0)^\top ({\hat{H}}{\hat{v}}) = \sum _{i=1}^n Y^i(p) \langle J{\hat{H}}v, E_i(p) \rangle = \langle J{\hat{H}}v, Y(p) \rangle \) for every \({\hat{H}} \in \partial {\hat{X}}(0)\), \(v \in T_p \mathcal {M}\) and \({\hat{v}} := \mathrm {d}\varphi |_p[v]\). Similarly, it holds that \({\hat{X}}(0)^\top \nabla {\hat{Y}}(0)[ {\hat{v}}] = \left\langle X(p), \nabla Y(p)[v] \right\rangle \). Besides, (2.2) yields \(\partial {\hat{f}}(0)[{\hat{v}}] = \partial f(p)[v]\). Therefore, the proof is completed by plugging these equations into (B.15). \(\square \)

Proof of Proposition 4.2

Since Theorem 4.4 shows that X is also locally Lipschitz at p, then \(\partial X(p)\) is well-defined. Let \(U_p \subset \mathcal {M}\) be a neighborhood of p, \(\xi \in T_p\mathcal {M}\) and V be a smooth vector field on \({\bar{\mathcal {M}}}\) such that \(V_\top (q) = P_{pq}\xi \) for \(q \in U_p\), and \(\iota : \mathcal {M}\hookrightarrow \bar{\mathcal {M}}\) be the embedding map. Define \({\bar{f}}_V := \left\langle {\bar{X}}, V \right\rangle \) and \(f_V := {\bar{f}}_V \circ \iota \), then Lemma B.4 shows that \(\partial f_V(p) \subseteq {\bar{\partial }} {\bar{f}}_V(p) \mathrm {d}\iota (p) = \bar{\partial } {\bar{f}}_V(p)|_{T_p \mathcal {M}}\).

Let \(v \in T_p \mathcal {M}\), then Lemma B.5 gives that \({\bar{\partial }} {\bar{f}}_V(p)[v] = \left\langle {\bar{\partial }} \bar{X}(p)[v], V(p) \right\rangle + \left\langle {\bar{X}}(p), {\bar{\nabla }}_v V(p) \right\rangle \). Since \(v, {\bar{X}}(p) \in T_p \mathcal {M}\), Proposition 8.4 in [48] shows that

$$\begin{aligned}\left\langle {\bar{X}}(p), {\bar{\nabla }}_v V_\perp (p) \right\rangle&= \left\langle {\bar{X}}(p), ({\bar{\nabla }}_v V_\perp (p))_\top \right\rangle = \left\langle X(p), \nabla _v V_\perp (p) \right\rangle \\&= -\left\langle \mathrm {II}(v, X(p)), V_\perp (p) \right\rangle . \end{aligned}$$

Note that \(\nabla _v V_\top (p) = \nabla _v (P_{pq}\xi )(p) = 0\), then \(\left\langle {\bar{X}}(p), {\bar{\nabla }}_v V_\top (p) \right\rangle = \left\langle X(p), \nabla _v V_\top (p) \right\rangle = 0\), and thus,

$$\begin{aligned} \left\langle {\bar{X}}(p), {\bar{\nabla }}_vV(p) \right\rangle= & {} \left\langle {\bar{X}}(p), {\bar{\nabla }}_vV_\perp (p) + \bar{\nabla }_vV_\top (p) \right\rangle \\= & {} -\left\langle \mathrm {II}(v,X(p)), V_\perp (p) \right\rangle \\= & {} -\left\langle \mathrm {II}(v,X(p)), V(p) \right\rangle , \end{aligned}$$

where the last equation follows from \(\left\langle \mathrm {II}(v, X(p)), V_\top (p) \right\rangle = 0\).

Combining these arguments, we know \(\partial f_V(p)[v]\! \subseteq \! \langle {\bar{\partial }} {\bar{X}}(p)[v] \!-\! \mathrm {II}(v, X(p)), V(p) \rangle \). Note that \({\bar{X}}(q) \in T_q \mathcal {M}\) for \(q \in U_p\), then \(f_V(q) = \left\langle X(q), V_\top (q) \right\rangle \) for \(q \in U_p\). Since \(\nabla _v V_\top (p) = 0\) and \(\partial X(p)[v] \subset T_p\mathcal {M}\), then Lemma B.5 yields \(\partial f_V(p)[v] = \left\langle \partial X(p)[v], V_\top (p) \right\rangle = \left\langle \partial X(p)[v], V(p) \right\rangle \). Therefore, we know

$$\begin{aligned} \left\langle \partial X(p)[v], V(p) \right\rangle \subseteq \langle {\bar{\partial }} {\bar{X}}(p)[v] - \mathrm {II}(v, X(p)), V(p) \rangle . \end{aligned}$$

Note that \(V(p)_\top = \xi \) and \(V(p)_\perp \) are arbitrary and \(\partial X(p)[v], {\bar{\partial }} {\bar{X}}(p)[v]\) are compact convex sets, the inclusion relationship (4.28) follows from [57, Corollary 13.1.1]. \(\square \)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Bao, C., Ding, C. et al. A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds. Math. Program. 201, 1–61 (2023). https://doi.org/10.1007/s10107-022-01898-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01898-1

Keywords

Mathematics Subject Classification

Navigation