A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds

Zhou, Yuhao; Bao, Chenglong; Ding, Chao; Zhu, Jun

doi:10.1007/s10107-022-01898-1

A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds

Full Length Paper
Series A
Published: 05 October 2022

Volume 201, pages 1–61, (2023)
Cite this article

Mathematical Programming Submit manuscript

Yuhao Zhou¹,
Chenglong Bao ORCID: orcid.org/0000-0002-1201-1212²,
Chao Ding³ &
…
Jun Zhu¹

1860 Accesses
10 Citations
Explore all metrics

Abstract

This paper is devoted to studying an augmented Lagrangian method for solving a class of manifold optimization problems, which have nonsmooth objective functions and nonlinear constraints. Under the constant positive linear dependence condition on manifolds, we show that the proposed method converges to a stationary point of the nonsmooth manifold optimization problem. Moreover, we propose a globalized semismooth Newton method to solve the augmented Lagrangian subproblem on manifolds efficiently. The local superlinear convergence of the manifold semismooth Newton method is also established under some suitable conditions. We also prove that the semismoothness on submanifolds can be inherited from that in the ambient manifold. Finally, numerical experiments on compressed modes and (constrained) sparse principal component analysis illustrate the advantages of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Nonconvex and Nonsmooth Optimization with Generalized Orthogonality Constraints: An Approximate Augmented Lagrangian Method

Article 20 January 2017

Optimization schemes on manifolds for structured matrices with fixed eigenvalues

Article Open access 02 December 2024

A Riemannian under-determined BFGS method for least squares inverse eigenvalue problems

Article 28 April 2021

Notes

The subscript p in $\left\langle \cdot , \cdot \right\rangle _p$ is usually omitted for simplicity.
An affine connection is $\mathcal {D}(\mathcal {M})$-linear w.r.t. X, $\mathbb {R}$-linear w.r.t. X, Y, and satisfies the product rule, see Chapter 5 in [2].
We say a function f on a manifold is locally Lipschitz if $f \circ \varphi ^{-1}$ is locally Lipschitz in U for every chart $(U, \varphi )$.
This is well-defined in a small neighborhood of q [19, Proposition 3.2.9].
Let $g: \mathcal {M}\rightarrow \mathbb {R}^m$ and $f: \mathbb {R}^m \rightarrow \mathbb {R}$ be two smooth maps, we can use Definition 2.4 to compute the gradient of $f \circ g$. For $p \in \mathcal {M}$, $\xi _p \in T_p\mathcal {M}$, choosing any curve $\gamma : (-1, 1) \rightarrow \mathcal {M}$ such that $\gamma (0) = p$ and ${\dot{\gamma }}(0) = \xi _p$, we have $\left\langle \xi _p, {{\,\mathrm{grad}\,}}\,(f\circ g)(p) \right\rangle = \xi _p (f \circ g) = \left. \frac{\mathrm {d}(f \circ g \circ \gamma )(t)}{\mathrm {d}t}\right| _{t=0} = \sum _{i=1}^m\alpha _i\left. \frac{\mathrm {d}(g_i \circ \gamma )(t)}{\mathrm {d}t}\right| _{t=0} = \sum _{i=1}^m \alpha _i \cdot (\xi _p g_i) = \sum _{i=1}^m \alpha _i \left\langle \xi _p, {{\,\mathrm{grad}\,}}\,[g(p)]_i \right\rangle $, where $\alpha := \nabla f(q)$, $q := g(\gamma (t))$ and $g_i$ is the i-th component of g. Thus, ${{\,\mathrm{grad}\,}}\, (f \circ g)(p) = \sum _{i=1}^m \alpha _i {{\,\mathrm{grad}\,}}\, [g(p)]_i$.
We say the map $\mathcal {K}$ is upper semicontinuous if for every $\varepsilon > 0$ there exists $ \delta >0$ such that for all $q \in B_\delta (p)$ we have $P_{qp}\mathcal {K}(q) \subset \mathcal {K}(p) + {\hat{B}}_\varepsilon (0)$, where ${\hat{B}}_\varepsilon (0) := \{ V \in \mathcal {L}(T_p\mathcal {M}) : \left\| V \right\| < \varepsilon \}$.
Since $\gamma $ is a geodesic which is parallel to itself, i.e., $\nabla _{{\dot{\gamma }}} {\dot{\gamma }} = 0$, then $\frac{\mathrm {d}}{\mathrm {d}t} \Vert {\dot{\gamma }}(t) \Vert ^2 = 2\left\langle \nabla _{{\dot{\gamma }}}{\dot{\gamma }}, {\dot{\gamma }} \right\rangle \equiv 0$. Therefore, we have $\Vert {\dot{\gamma }}(t)\Vert = \Vert {\dot{\gamma }}(0) \Vert = \Vert V \Vert $ and $\ell (\gamma |_{[s,t]}) = \int _s^t \Vert {\dot{\gamma }}(u) \Vert \mathrm {d}u = |t - s| \Vert {\dot{\gamma }}(0)\Vert $.
We only consider our method in this setting since other algorithms mentioned in Table 2 have difficulty in reaching these stopping conditions. Indeed, these observations are also valid for the original termination conditions with a lower significance.
Indeed, this is called the normal coordinate (see, e.g., [19, p. 86] and [48, p. 131]).

References

Absil, P.-A., Hosseini, S.: A collection of nonsmooth Riemannian optimization problems. In: Nonsmooth Optimization and its Applications, Springer, pp. 1–15 (2019)
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
MATH Google Scholar
Absil, P.-A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22, 135–158 (2012)
MathSciNet MATH Google Scholar
Adler, R.L., Dedieu, J.-P., Margulies, J.Y., Martens, M., Shub, M.: Newton’s method on Riemannian manifolds and a geometric model for the human spine. IMA J. Numer. Anal. 22, 359–390 (2002)
MathSciNet MATH Google Scholar
Andreani, R., Birgin, E.G., Martínez, J.M., Schuverdt, M.L.: On augmented Lagrangian methods with general lower-level constraints. SIAM J. Optim. 18, 1286–1309 (2008)
MathSciNet MATH Google Scholar
Andreani, R., Haeser, G., Schuverdt, M.L., Silva, P.J.: A relaxed constant positive linear dependence constraint qualification and applications. Math. Program. 135, 255–273 (2012)
MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)
MathSciNet MATH Google Scholar
Azagra, D., Ferrera, J., López-Mesas, F.: Nonsmooth analysis and Hamilton-Jacobi equations on Riemannian manifolds. J. Funct. Anal. 220, 304–361 (2005)
MathSciNet MATH Google Scholar
Bacák, M., Bergmann, R., Steidl, G., Weinmann, A.: A second order nonsmooth variational model for restoring manifold-valued images. SIAM J. Sci. Comput. 38, A567–A597 (2016)
MathSciNet MATH Google Scholar
Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, London (1982)
MATH Google Scholar
Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48, 334–334 (1997)
Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2014)
MathSciNet MATH Google Scholar
Bonnans, J.F., Shapiro, A.: Perturbation Analysis of Optimization Problems. Springer, Berlin (2000)
MATH Google Scholar
Borckmans, P.B., Selvan, S.E., Boumal, N., Absil, P.-A.: A Riemannian subgradient algorithm for economic dispatch with valve-point effect. J. Comput. Appl. Math. 255, 848–866 (2014)
MathSciNet MATH Google Scholar
Boumal, N., Absil, P.-A.: RTRMC: a Riemannian trust-region method for low-rank matrix completion. In: Advances in Neural Information Processing Systems, vol. 24 (2011)
Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)
MATH Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2011)
MATH Google Scholar
Cai, J.-F., Liu, H., Wang, Y.: Fast rank-one alternating minimization algorithm for phase retrieval. J. Sci. Comput. 79, 128–147 (2019)
MathSciNet MATH Google Scholar
Carmo, M.P.: Riemannian Geometry. Birkhäuser, London (1992)
MATH Google Scholar
Chen, S., Ma, S., Man-Cho So, A., Zhang, T.: Proximal gradient method for nonsmooth optimization over the Stiefel manifold. SIAM J. Optim. 30, 210–239 (2020)
MathSciNet MATH Google Scholar
Chen, S., Ma, S., Xue, L., Zou, H.: An alternating manifold proximal gradient method for sparse principal component analysis and sparse canonical correlation analysis. Inform. J. Optim. 2, 192–208 (2020)
MathSciNet Google Scholar
Chen, W., Ji, H., You, Y.: An augmented Lagrangian method for $\ell _1$-regularized optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 38, B570–B592 (2016)
MATH Google Scholar
Chen, X., Guo, L., Lu, Z., Ye, J.J.: An augmented Lagrangian method for non-Lipschitz nonconvex programming. SIAM J. Numer. Anal. 55, 168–193 (2017)
MathSciNet MATH Google Scholar
Cho, M., Lee, J.: Riemannian approach to batch normalization. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Clarke, F.H.: Optimization and nonsmooth analysis. SIAM 2, 668 (1990)
Google Scholar
Curtis, F.E., Jiang, H., Robinson, D.P.: An adaptive augmented Lagrangian method for large-scale constrained optimization. Math. Program. 152, 201–245 (2015)
MathSciNet MATH Google Scholar
Daniilidis, A., Deville, R., Durand-Cartagena, E., Rifford, L.: Self-contracted curves in Riemannian manifolds. J. Math. Anal. Appl. 457, 1333–1352 (2018)
MathSciNet MATH Google Scholar
de Oliveira, F.R., Ferreira, O.P.: Newton method for finding a singularity of a special class of locally Lipschitz continuous vector fields on Riemannian manifolds. J. Optim. Theory Appl. 185, 522–539 (2020)
MathSciNet MATH Google Scholar
de Oliveira, F.R., Oliveira, F.R.: A global Newton method for the nonsmooth vector fields on Riemannian manifolds. J. Optim. Theory Appl. 2, 1–15 (2021)
MathSciNet MATH Google Scholar
Deng, K., Peng, Z.: An inexact augmented Lagrangian method for nonsmooth optimization on Riemannian manifold, arXiv preprint arXiv:1911.09900 (2019)
Dirr, G., Helmke, U., Lageman, C.: Nonsmooth Riemannian optimization with applications to sphere packing and grasping. In: Lagrangian and Hamiltonian Methods for Nonlinear Control, Springer, vol. 2007, pp. 29–45 (2006)
Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, Berlin (2007)
MATH Google Scholar
Ferreira, O., Oliveira, P.: Proximal point algorithm on Riemannian manifolds. Optimization 51, 257–270 (2002)
MathSciNet MATH Google Scholar
Gabay, D.: Minimizing a differentiable function over a differential manifold. J. Optim. Theory Appl. 37, 177–219 (1982)
MathSciNet MATH Google Scholar
Ghahraei, E., Hosseini, S., Pouryayevali, M.R.: Pseudo-Jacobian and characterization of monotone vector fields on Riemannian manifolds. J. Convex Anal. 24, 149–168 (2017)
MathSciNet MATH Google Scholar
Hiriart-Urruty, J.-B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms, vol. 2. Springer, Berlin (1993)
MATH Google Scholar
Hiriart-Urruty, J.-B., Strodiot, J.-J., Nguyen, V.H.: Generalized Hessian matrix and second-order optimality conditions for problems with $C^{1,1}$ data. Appl. Math. Optim. 11, 43–56 (1984)
MathSciNet MATH Google Scholar
Hosseini, S., Pouryayevali, M.: Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds. Nonlinear Anal. Theory Methods Appl. 74, 3884–3895 (2011)
MathSciNet MATH Google Scholar
Hu, J., Liu, X., Wen, Z.-W., Yuan, Y.-X.: A brief introduction to manifold optimization. J. Oper. Res. Soc. China 8, 199–248 (2020)
MathSciNet MATH Google Scholar
Hu, J., Milzarek, A., Wen, Z., Yuan, Y.: Adaptive quadratically regularized Newton method for Riemannian optimization. SIAM J. Matrix Anal. Appl. 39, 1181–1207 (2018)
MathSciNet MATH Google Scholar
Huang, W., Absil, P.-A., Gallivan, K.A.: A Riemannian symmetric rank-one trust-region method. Math. Program. 150, 179–216 (2015)
MathSciNet MATH Google Scholar
Huang, W., Wei, K.: Extending FISTA to Riemannian optimization for sparse PCA, arXiv preprint arXiv:1909.05485 (2019)
Huang, W., Wei, K.: Riemannian proximal gradient methods. Math. Programm. 5, 1–43 (2021)
Google Scholar
Kovnatsky, A., Glashoff, K., Bronstein, M.M.: MADMM: A generic algorithm for non-smooth optimization on manifolds. In: European Conference on Computer Vision, Springer, pp. 680–696 (2016)
Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58, 431–449 (2014)
MathSciNet MATH Google Scholar
Lai, R., Wen, Z., Yin, W., Gu, X., Lui, L.M.: Folding-free global conformal mapping for genus-0 surfaces by harmonic energy minimization. J. Sci. Comput. 58, 705–725 (2014)
MathSciNet MATH Google Scholar
Lee, J.M.: Introduction to Smooth Manifolds. Springer, Berlin (2012)
Google Scholar
Lee, J.M.: Introduction to Riemannian Manifolds. Springer, Berlin (2018)
MATH Google Scholar
Li, X., Sun, D., Toh, K.-C.: A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems. SIAM J. Optim. 28, 433–458 (2018)
MathSciNet MATH Google Scholar
Lu, Z., Zhang, Y.: An augmented Lagrangian approach for sparse principal component analysis. Math. Program. 135, 149–193 (2012)
MathSciNet MATH Google Scholar
Mifflin, R.: Semismooth and semiconvex functions in constrained optimization. SIAM J. Control. Optim. 15, 959–972 (1977)
MathSciNet MATH Google Scholar
Montanari, A., Richard, E.: Non-negative principal component analysis: Message passing algorithms and sharp asymptotics. IEEE Trans. Inf. Theory 62, 1458–1484 (2015)
MathSciNet MATH Google Scholar
Ozoliņš, V., Lai, R., Caflisch, R., Osher, S.: Compressed modes for variational problems in mathematics and physics. Proc. Natl. Acad. Sci. U.S.A. 110, 18368–18373 (2013)
MathSciNet MATH Google Scholar
Qi, L., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58, 353–367 (1993)
MathSciNet MATH Google Scholar
Qi, L., Wei, Z.: On the constant positive linear dependence condition and its application to SQP methods. SIAM J. Optim. 10, 963–981 (2000)
MathSciNet MATH Google Scholar
Rampazzo, F., Sussmann, H.J.: Commutators of flow maps of nonsmooth vector fields. J. Differ. Equ. 232, 134–175 (2007)
MathSciNet MATH Google Scholar
Rockafellar, R.T.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1970)
MATH Google Scholar
Spivak, M.D.: A Comprehensive Introduction to Differential Deometry, vol. 2, 3rd edn. Publish or Perish, New York (1999)
MATH Google Scholar
Sun, D., Sun, J.: Semismooth matrix-valued functions. Math. Oper. Res. 27, 150–169 (2002)
MathSciNet MATH Google Scholar
Sun, D., Sun, J., Zhang, L.: The rate of convergence of the augmented Lagrangian method for nonlinear semidefinite programming. Math. Program. 114, 349–391 (2008)
MathSciNet MATH Google Scholar
Vandereycken, B.: Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 23, 1214–1236 (2013)
MathSciNet MATH Google Scholar
Waldspurger, I., d’Aspremont, A., Mallat, S.: Phase recovery, maxcut and complex semidefinite programming. Math. Program. 149, 47–81 (2015)
MathSciNet MATH Google Scholar
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142, 397–434 (2013)
MathSciNet MATH Google Scholar
Xiao, X., Li, Y., Wen, Z., Zhang, L.: A regularized semi-smooth Newton method with projection steps for composite convex programs. J. Sci. Comput. 76, 364–389 (2018)
MathSciNet MATH Google Scholar
Yang, L., Sun, D., Toh, K.-C.: SDPNAL$+$: A majorized semismooth Newton-CG augmented Lagrangian method for semidefinite programming with nonnegative constraints. Math. Program. Comput. 7, 331–366 (2015)
MathSciNet MATH Google Scholar
Yang, W.H., Zhang, L.-H., Song, R.: Optimality conditions for the nonlinear programming problems on Riemannian manifolds. Pac. J. Optim. 10, 415–434 (2014)
MathSciNet MATH Google Scholar
Zhang, H., Reddi, S.J., Sra, S.: Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4592–4600 (2016)
Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In: Conference on Learning Theory, pp. 1617–1638 (2016)
Zhao, X.-Y., Sun, D., Toh, K.-C.: A Newton-CG augmented Lagrangian method for semidefinite programming. SIAM J. Optim. 20, 1737–1765 (2010)
MathSciNet MATH Google Scholar
Zhu, H., Zhang, X., Chu, D., Liao, L.-Z.: Nonconvex and nonsmooth optimization with generalized orthogonality constraints: an approximate augmented Lagrangian method. J. Sci. Comput. 72, 331–372 (2017)
MathSciNet MATH Google Scholar
Zhu, X., Sato, H.: Riemannian conjugate gradient methods with inverse retraction. Comput. Optim. Appl. 77, 779–810 (2020)
MathSciNet MATH Google Scholar
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Gr. Stat. 15, 265–286 (2006)
MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to thank the two anonymous referees and the associate editor for their valuable comments and constructive suggestions, which have greatly improved the quality and the presentation of this paper.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Yuhao Zhou & Jun Zhu
Yau Mathematical Sciences Center, Tsinghua University, China and Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, China
Chenglong Bao
Institute of Applied Mathematics, Academy of Mathematics and System Sciences, Chinese Academy of Sciences, Beijing, China
Chao Ding

Authors

Yuhao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Chenglong Bao
View author publications
You can also search for this author in PubMed Google Scholar
Chao Ding
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenglong Bao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work was supported in part by National Natural Science Foundation of China (11901338, 61620106010) and Tsinghua University Initiative Scientific Research Program. The research of the third author was supported in part by the National Natural Science Foundation of China (12071464) and the Beijing Natural Science Foundation (Z190002)

Appendices

Appendix A: The deferred Proofs in Sects. 4.1 and 4.2

1.1 A.1 Proof of Lemma 4.2

Proof

From the compactness of U, there exist $C, K, r_0 > 0$ such that Lemma 4.1 holds for every $q \in U$. By shrinking the constant $r_0$ if necessary, we could also assume that for every $q \in U$, $R_q v$ is defined for every $v \in T_q\mathcal {M}$ with $\Vert v \Vert < r_0$. For a fixed point $q \in \mathcal {M}$, let $\{ e_i \}_{i \in [n]}$ be an orthonormal basis of $T_q \mathcal {M}$. We define the isomorphism $\psi _q: \mathbb {R}^n \rightarrow T_q \mathcal {M}$ as $\psi _q(w_1, \ldots , w_n) = \sum _{i=1}^n w_i e_i$, and the chart $\varphi _q: B_{r_0}(q) \rightarrow \mathbb {R}^n$ as $\varphi _q\left( \exp _q \psi _q(w) \right) = w$, where $\Vert w\Vert _{\mathbb {R}^n} < r_0$,^{Footnote 9} which implies that $(\psi _q \circ \varphi _q )(p) = \exp _q^{-1} p$ for every $p \in B_q(r_0)$.

For $\Vert v \Vert < r_0$, we define the curve $\gamma (t) := R_q (tv)$, then

$$\begin{aligned} d(R_q v, q) \le \ell (\gamma |_{[0, 1]}) = \int _0^1 \Vert {\dot{\gamma }}(t) \Vert \mathrm {d}t = \int _0^1 \Vert \mathrm {d}R_q|_{tv} [v] \Vert \mathrm {d}t \le {\tilde{C}}_q \Vert v \Vert , \end{aligned}$$

(A.1)

where ${\tilde{C}}_q := \sup _{\Vert v\Vert < r_0} \Vert \mathrm {d}R_q|_v \Vert $. Since $R_q v$ is $C^2$ with respect to q and v and U is compact, we know ${\tilde{C}} := \sup _{q \in U} {\tilde{C}}_q < \infty $. Thus, $\exp ^{-1}_q R_q v$ is defined for $q \in U$ and $\Vert v \Vert < r := r_0 / {\tilde{C}}$.

Let ${\hat{B}}_{r} := \{ w \in \mathbb {R}^n : \Vert w \Vert _{\mathbb {R}^n} < r \}$, and ${\hat{R}}_q, {\hat{E}}_q: {\hat{B}}_{r} \rightarrow \mathbb {R}^n$ be the maps $\varphi _q \circ R_q \circ \psi _q|_{{\hat{B}}_r}$, $\varphi _q \circ \exp _q \circ \psi _q|_{{\hat{B}}_r}$, respectively. Since the differentials of $\exp _q$ and $R_q$ are the same at 0, then the differentials of ${\hat{E}}_q$ and ${\hat{R}}_q$ also coincide at 0. Let ${\hat{D}}_q := {\hat{E}}_q - {\hat{R}}_q$, then Taylor’s theorem states that for any $w \in {\hat{B}}_r$, there exists $t_w \in (0, 1)$ such that $\Vert {\hat{D}}_q(w) \Vert _{\mathbb {R}^n} = \frac{1}{2} \Vert \nabla ^2 {\hat{D}}_q(t_w w)[w, w] \Vert _{\mathbb {R}^n} \le {\hat{C}}_q \Vert w \Vert ^2_{\mathbb {R}^n}$, where ${\hat{C}}_q := \frac{1}{2}\sup _{ \Vert w \Vert _{\mathbb {R}^n} < r} \Vert \nabla ^2 {\hat{D}}_q(w) \Vert $.

Since $\psi _q$ is a linear isomorphism, then $(\psi _q \circ {\hat{D}}_q \circ \psi _q^{-1})(v) = (\psi _q \circ (\varphi _q \circ \exp _q - \varphi _q \circ R_q))(v) = (\psi _q \circ \varphi _q \circ \exp _q)(v) - (\psi _q \circ \varphi _q \circ R_q)(v) = v - \exp ^{-1}_q R_q v$. Therefore, by Lemma 4.1, we have

$$\begin{aligned} d(R_q v, \exp _q v)^2 \le \Vert v - \exp ^{-1}_q R_q v \Vert ^2 + K \Vert \exp ^{-1}_q R_q v \Vert ^2 \Vert v \Vert ^2 \le ({\hat{C}}_q^2+ {\tilde{C}}^2 K ) \Vert v \Vert ^4 \end{aligned}$$

for every $v \in T_q \mathcal {M}$ with $\Vert v \Vert < r$, where the last inequality follows from (A.1) and $\Vert \exp ^{-1}_q R_q v \Vert = d(R_q v, q)$. Finally, since $R_q v$ and $\exp _q v$ are both $C^2$ with respect to q and v, ${\hat{C}}_q$ is uniformly bounded on the compact set U. Thus, the required constant C exists and is finite. $\square $

1.2 A.2 Proof of Lemma 4.4

Proof

Define ${\hat{\varphi }} = \varphi \circ \gamma $. We know that $\hat{\varphi }$ is continuously differentiable with Lipschitz gradient in [0, 1]. By Theorem 2.3 in [37], there exist $\xi \in (0, 1), {\hat{M}}_\xi \in \partial ^2 {\hat{\varphi }}(\xi )$ such that ${\hat{\varphi }}(1) - {\hat{\varphi }}(0) = {\hat{\varphi }}^\prime (0) + \frac{{\hat{M}}_\xi }{2}$. Note that due to ${\hat{\varphi }}^\prime (t) = \left\langle {{\,\mathrm{grad}\,}}\varphi (\gamma (t)), {\dot{\gamma }}(t) \right\rangle $ and $\nabla _{\dot{\gamma }}{\dot{\gamma }} = 0$, we find ${\hat{\varphi }}^{\prime \prime }(t) = \left\langle \nabla _{{\dot{\gamma }}(t)} {{\,\mathrm{grad}\,}}\varphi (\gamma (t)), {\dot{\gamma }}(t) \right\rangle $, whenever ${\hat{\varphi }}^\prime $ is differentiable at t. From the linearity of the parallel transport, it suffices to consider the case ${\hat{M}}_\xi \in \partial ^2_B {\hat{\varphi }}(\xi )$, i.e., there exists $\{ t_k \} \rightarrow \xi $ such that ${\hat{\varphi }}^{\prime \prime }$ exists at $t_k$ and ${\hat{\varphi }}^{\prime \prime }(t_k) \rightarrow {\hat{M}}_\xi $. Since by the isometry property of the parallel transport, $\hat{\varphi }^{\prime \prime }(t_k) = \langle \nabla _{ {\dot{\gamma }}(t_k)} {{\,\mathrm{grad}\,}}\varphi (\gamma (t_k)) , {\dot{\gamma }}(t_k) \rangle = \langle P_{\gamma }^{t_k \mapsto \xi } \nabla _{P_\gamma ^{\xi \mapsto t_k} \dot{\gamma }(\xi )} {{\,\mathrm{grad}\,}}\varphi (\gamma (t_k)) , {\dot{\gamma }}(\xi ) \rangle $, then there is $M_\xi \in \partial {{\,\mathrm{grad}\,}}\varphi (\gamma (\xi ))$ such that ${\hat{M}}_\xi = \left\langle M_\xi {\dot{\gamma }}(\xi ), \dot{\gamma }(\xi ) \right\rangle $. Note ${\dot{\gamma }}(\xi ) = P_\gamma ^{0\rightarrow \xi }{\dot{\gamma }}(0)$, then the conclusion holds. $\square $

1.3 A.3 Proof of Lemma 4.5

Proof

Suppose the conclusion does not hold, then there exist $\delta >0$ and $v_k \in T_p\mathcal {M}\rightarrow 0$ such that

$$\begin{aligned} \Vert P_{\exp _p v_k, p}X(\exp _p v_k) - \nabla X(p; v_k) \Vert \ge \delta \left\| v_k \right\| . \end{aligned}$$

By Lemma 4.1 and the locally Lipschitz condition on X, there exists $r_0>0$ such that for every $p_1,p_2\in B_{r_0}(p)$, there exists a unique shortest geodesic joining $p_1$ and $p_2$ and X is L-Lipschitz in $B_{r_0}(p)$. Let $0<r<r_0$, by taking the subsequence of $\{v_k\}$, we assume that ${\bar{v}}_k := \frac{rv_k}{\left\| v_k \right\| } \rightarrow v \in T_p\mathcal {M}$ with $\left\| v \right\| = r$. We define $t_k = \left\| v_k \right\| /r$, $q_k = \exp _p v_k$, $r_k = \exp _p \bar{v}_k$ and $s_k = \exp _p t_k v$, then it holds that $\{q_k,r_k,s_k\}\subset B_{r_0}(p)$, and hence we have

$$\begin{aligned}&\Vert P_{q_k,p}X(q_k) - \nabla X(p;v_k)\Vert \nonumber \\&\quad \le \Vert P_{q_k,p} X(q_k)-P_{s_k,p} X(s_k)\Vert + \Vert P_{s_k,p} X(s_k) - \nabla X(p;v_k)\Vert \nonumber \\&\quad \le \underbrace{\Vert P_{p,s_k}P_{q_k,p} X(q_k) - P_{q_k,s_k} X(q_k))\Vert }_{\mathrm{(A)}} + \underbrace{\Vert P_{q_k,s_k} X(q_k) - X(s_k)\Vert }_{\mathrm{(B)}}\nonumber \\&\qquad + \underbrace{\Vert P_{s_k,p} X(s_k) - t_k \nabla X(p;v)\Vert }_\mathrm{(C)} + \underbrace{\Vert \nabla X(p;v_k) - t_k\nabla X(p;v)\Vert }_\mathrm{(D)}. \end{aligned}$$

(A.2)

From Lemma 10 in [41], there exists $C>0$ such that

$$\begin{aligned} \mathrm{(A)} \le \Vert P_{s_k,q_k}P_{p,s_k}P_{q_k,p} - {\mathrm{id}}\Vert \Vert X(q_k)\Vert \le C\max (d(p,q_k), d(q_k,s_k)) \Vert X(q_k)\Vert = o(t_k),\nonumber \\ \end{aligned}$$

(A.3)

since $\Vert X(q_k)\Vert \le L \Vert v_k\Vert \rightarrow 0$ as $v_k\rightarrow 0$. Moreover, since X is locally Lipschitz and directional differentiable at p, using Lemma 4.1, as $k\rightarrow \infty $, we have

$$\begin{aligned} \mathrm{(B)} \le L d(q_k,s_k) = o(t_k), \quad \mathrm{(C)} = \Vert P_{s_k,p} X(s_k) - X(p) - t_k\nabla X(p;v)\Vert = o(t_k),\nonumber \\ \end{aligned}$$

(A.4)

where the first estimate follows from Definition 2.10 and $\ell (\gamma ) = d(q_k, s_k)$. Let $q_k^t = \exp _p t{\bar{v}}_k$ and $q_v^t = \exp _p tv$, from Lemma 10 in [41] and Lemma 4.1, we have when $t \rightarrow 0$,

$$\begin{aligned} \begin{aligned} \Vert P_{q_k^t,p}X(q_k^t)-P_{q_v^t,p}X(q_v^t)\Vert&\le \Vert P_{p,q_v^t}P_{q_k^t,p}X(q_k^t)- P_{q_k^t,q_v^t}X(q_k^t)\Vert \\&\quad + \Vert P_{q_k^t,q_v^t}X(q_k^t)-X(q_v^t)\Vert \\&\le \Vert P_{q_v^t,q_k^t}P_{p,q_v^t}P_{q_k^t,p}-{\mathrm{id}}\Vert \Vert X(q_k^t)\Vert + L d(q_k^t,q_v^t) \\&= O(t^2) + O(t)\Vert {\bar{v}}_k - v\Vert , \end{aligned} \end{aligned}$$

where the constants in the big-O notation depend on p, v and $\mathcal {M}$. An intermediate result of the above estimate is

$$\begin{aligned} \Vert \nabla X(p;{\bar{v}}_k) - \nabla X(p;v)\Vert \le O(\Vert {\bar{v}}_k - v\Vert ) = o(1). \end{aligned}$$

(A.5)

Thus, the last term in (A.2) is

$$\begin{aligned} \mathrm{(D)} = t_k\Vert \nabla X(p;{\bar{v}}_k) - \nabla X(p;v)\Vert = o(t_k) \end{aligned}$$

(A.6)

Combining (A.3), (A.4) and (A.6), we find that $\Vert P_{q_k,p}X(q_k) - \nabla X(p;v_k)\Vert = o(t_k) = o(\Vert v_k\Vert )$, which contradicts with our assumption. $\square $

1.4 A.4 Proof of Lemma 4.6

Proof

Define $q_{k + 1} := R_{p_k} V_k$. Let $B_r(p), K$ be the quantities in Lemma 4.1. Without loss of generality, we may assume $p_k \in B_r(p)$ and $\Vert v\Vert < r$ for every k, and shrink the ball $B_r(p)$ such that (4.10) holds in $B_r(p)$. Using Lemma 4.1, we know the following inequality holds for every geodesic triangle in $B_r(p)$ whose edges are a, b, c:

$$\begin{aligned} a^2 \le b^2 + c^2 - 2bc \cos A + Kb^2c^2 \le (b + c)^2 + Kb^2c^2, \end{aligned}$$

where A is the angle between b and c. Let $a = d(p_k, q_{k+1})$, $b = d(p_k, p)$, $c = d(q_{k+1}, p)$, then we have

$$\begin{aligned} \limsup _{k \rightarrow \infty } \frac{d(p_k,q_{k+1})^2}{d(p_k,p)^2} \le K \limsup _{k \rightarrow \infty } d(q_{k+1}, p)^2 + \limsup _{k \rightarrow \infty } \left( 1 + \frac{d(q_{k+1},p)}{d(p_k,p)} \right) ^2 = 1.\nonumber \\ \end{aligned}$$

(A.7)

Define $a = d(p_{k}, p)$, $b = d(p_k, q_{k+1})$, $c = d(q_{k + 1}, p)$, note that $d(q_{k+1},p) = o(d(p_k, p))$, then for any $\varepsilon > 0$ there exists $k_\varepsilon > 0$ such that for any $k \ge k_\varepsilon $, it holds that $d(q_{k + 1}, p)^2 \le \varepsilon d(p_k, p)^2 \le 2\varepsilon d(q_{k + 1}, p)^2 + 2\varepsilon d(p_{k}, q_{k+1})^2 + \varepsilon K d(q_{k+1}, p)^2 d(p_k, q_{k+1})^2$. From (A.7) and the fact that $p_k \rightarrow p$ as $k \rightarrow \infty $, there exists $K_0 > 0$ such that $K d(p_k, q_{k+1})^2 \le 2K d(p_k, p)^2 \le 2$ for $k \ge K_0$. Thus, for all $\varepsilon \in (0, 1/8)$ and $k \ge \max \{ k_\varepsilon , K_0 \}$, it holds

$$\begin{aligned} \begin{aligned} d(q_{k + 1}, p)^2 \le \frac{2\varepsilon }{1 - 2\varepsilon - \varepsilon K d(p_k, q_{k+1})^2} d(p_k, q_{k+1})^2 \le 4 \varepsilon d(p_k, q_{k+1})^2, \end{aligned} \end{aligned}$$

i.e., $d(q_{k+1}, p)^2 = o( d(p_k, q_{k+1})^2)$. From (4.10), we have

$$\begin{aligned}d(p_k, q_{k+1}) \le d(p_k, \exp _{p_k} V_k) + d(\exp _{p_k} V_k, q_{k+1}) \le \Vert V_k \Vert + C \Vert V_k \Vert ^2,\end{aligned}$$

and hence it holds $d(q_{k+1},p) = o( d(p_k, q_{k+1})) = o( \Vert V_k \Vert )$.

On the other hand,

$$\begin{aligned} \limsup _{k \rightarrow \infty } \frac{d(p_k,p)^2}{d(p_k,q_{k+1})^2} \le K \limsup _{k \rightarrow \infty } d(q_{k+1}, p)^2 + \limsup _{k \rightarrow \infty } \left( 1 + \frac{d(q_{k+1},p)}{d(p_k,q_{k+1})} \right) ^2 = 1. \end{aligned}$$

Combining with (A.7), we have ${\mathop {\lim }\limits _{k \rightarrow \infty }} \frac{d(p_k, p)}{d(p_k, q_{k+1})} = 1$. $\square $

Appendix B: The deferred proofs in Sect. 4.3

1.1 B.1 Proof of Theorem 4.4

We prove Theorem 4.4 in this section. First, we give a simple result that will be frequently used in our proof: For $p \in {\bar{\mathcal {M}}}$, $\xi \in T_p {\bar{\mathcal {M}}}$, and a curve $\gamma : [0, 1] \rightarrow {\bar{\mathcal {M}}}$ with $\gamma (0) = p$, the vector field $Z(t) := {\bar{P}}_\gamma ^{0 \rightarrow t} \xi $ is parallel to $\gamma $, and hence by the definition of parallel transport we have

$$\begin{aligned} {\bar{\nabla }}_{{\dot{\gamma }}} \bar{P}_\gamma ^{0 \rightarrow t} \xi = 0. \end{aligned}$$

(B.1)

Similarly, for $p \in \mathcal {M}, \zeta \in T_p \mathcal {M}$ and $\gamma : [0, 1] \rightarrow \mathcal {M}$, it holds that $\nabla _{{\dot{\gamma }}} P^{0 \rightarrow t}_\gamma \zeta = 0$.

The following lemma compares the parallel transports of the manifold $\mathcal {M}$ and its ambient space ${\bar{\mathcal {M}}}$.

Lemma B.1

There exists $C > 0$ such that for every $p \in \mathcal {M}$, and every smooth curve $\gamma : [-1, 1] \rightarrow \mathcal {M}$ with $\gamma (0) = p$, and every $v \in T_p \mathcal {M}$, $t \in [0, 1]$, it holds that

$$\begin{aligned} \big \Vert P_\gamma ^{0 \rightarrow t} v - \bar{P}_{\gamma }^{0 \rightarrow t} v \big \Vert \le C \ell (\gamma |_{[0,t]}) \Vert v \Vert . \end{aligned}$$

(B.2)

Moreover, there exists $C^\prime > 0$ such that for every $p \in \mathcal {M}$, $w \in T_p \mathcal {M}$ with $\Vert w\Vert \le 2$, and every $t \in [0, 1]$, the following inequality holds for $\gamma (t) := \exp _p(tw)$.

$$\begin{aligned} \big \Vert P_\gamma ^{0 \rightarrow t} v - \bar{P}_{\gamma }^{0 \rightarrow t} v - t \mathrm {II}({\dot{\gamma }}(t), P^{0 \rightarrow t}_\gamma v) \big \Vert \le C^\prime t^2. \end{aligned}$$

(B.3)

Proof

Let $Y: [0, 1] \rightarrow T\mathcal {M}$ be a smooth vector field along $\gamma $. For any fixed $t \in [0, 1]$ and $\xi \in T_{\gamma (t)} {\bar{\mathcal {M}}}$, it holds that

$$\begin{aligned} \begin{aligned} \left\langle \frac{\mathrm {d}}{\mathrm {d}s} {\bar{P}}^{s \rightarrow t}_\gamma Y(s), \xi \right\rangle&= \frac{\mathrm {d}}{\mathrm {d}s} \left\langle {\bar{P}}^{s \rightarrow t}_\gamma Y(s), \xi \right\rangle = \frac{\mathrm {d}}{\mathrm {d}s} \left\langle Y(s), {\bar{P}}^{t \rightarrow s}_\gamma \xi \right\rangle \\&= \left\langle {\bar{\nabla }}_{{\dot{\gamma }}} Y(s), {\bar{P}}^{t \rightarrow s}_\gamma \xi \right\rangle + \left\langle Y(s), {\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}^{t \rightarrow s}_\gamma \xi \right\rangle \\&\overset{(\mathrm{B.1})}{=} \left\langle {\bar{P}}^{s \rightarrow t}_\gamma {\bar{\nabla }}_{{\dot{\gamma }}} Y(s), \xi \right\rangle , \end{aligned} \end{aligned}$$

Therefore, $\frac{\mathrm {d}}{\mathrm {d}s} {\bar{P}}_\gamma ^{s \rightarrow t}Y(s) = \bar{P}_{\gamma }^{s \rightarrow t} {\bar{\nabla }}_{{\dot{\gamma }}} Y(s)$.

Setting $Y(t) = P_\gamma ^{0 \rightarrow t} v$ for $v \in T_p \mathcal {M}$, we know

$$\begin{aligned} {\bar{\nabla }}_{{\dot{\gamma }}} Y \overset{(4.22)}{=} \nabla _{{\dot{\gamma }}} Y + \mathrm {II}({\dot{\gamma }}, Y) \overset{(\mathrm{B.1})}{=} \mathrm {II}({\dot{\gamma }}, Y), \end{aligned}$$

and thus,

$$\begin{aligned} Y(t) - {\bar{P}}_{\gamma }^{0 \rightarrow t}Y(0) = \int _0^t \frac{\mathrm {d}}{\mathrm {d}s} {\bar{P}}_{\gamma }^{s \rightarrow t}Y(s) \mathrm {d}s = \int _0^t {\bar{P}}_{\gamma }^{s \rightarrow t}\mathrm {II}({\dot{\gamma }}(s), Y(s)) \mathrm {d}s. \end{aligned}$$

(B.4)

Since $\mathrm {II}$ smoothly depends on the Riemannian metric of ${\bar{\mathcal {M}}}$ and $\mathcal {M}$ is compact, the constant $C := \sup \{ \Vert \mathrm {II}(u, v) \Vert : p \in \mathcal {M}, u, v \in T_p\mathcal {M}, \Vert u\Vert = \Vert v\Vert = 1 \}$ is finite. As the parallel transport is an isometry, it holds that $\Vert Y(t) \Vert = \Vert Y(0)\Vert = \Vert v\Vert $ and

$$\begin{aligned} \begin{aligned} \big \Vert Y(t) - {\bar{P}}_{\gamma }^{0 \rightarrow t}Y(0) \big \Vert \le \int _0^t C\Vert v \Vert \Vert {\dot{\gamma }}(s) \Vert \mathrm {d}s = C\Vert v \Vert \ell (\gamma |_{[0,t]}). \end{aligned} \end{aligned}$$

Moreover, we have

$$\begin{aligned} \begin{aligned} t \mathrm {II}({\dot{\gamma }}(t), Y(t)) - [Y(t) - {\bar{P}}_{\gamma }^{0 \rightarrow t} Y(0)]&\overset{(\mathrm{B.4})}{=} \int _0^t \mathrm {II}({\dot{\gamma }}(t), Y(t)) - {\bar{P}}_{\gamma }^{s \rightarrow t}\mathrm {II}({\dot{\gamma }}(s), Y(s)) \mathrm {d}s \\&= \int _0^t \int _s^t \frac{\mathrm {d}}{\mathrm {d}u} \bar{P}_{\gamma }^{u \rightarrow t}\mathrm {II}({\dot{\gamma }}(u), Y(u)) \mathrm {d}u \mathrm {d}s\\&= \int _0^t \int _s^t {\bar{P}}_{\gamma }^{u \rightarrow t} {\bar{\nabla }}_{{\dot{\gamma }}} \mathrm {II}({\dot{\gamma }}(u), Y(u)) \mathrm {d}u \mathrm {d}s. \end{aligned} \end{aligned}$$

When $\gamma (t) := \exp _p (tw)$ for $w \in T_p \mathcal {M}$ with $\Vert w \Vert \le 2$, the above integrand is continuous with respect to v, w, t, u, p, and thus, the constant $C^\prime := 2\sup \{ \Vert {\bar{\nabla }}_{{\dot{\gamma }}} \mathrm {II}({\dot{\gamma }}, P_\gamma ^{0 \rightarrow t} v) \Vert : p \in \mathcal {M}\text { and } w, v \in T_p\mathcal {M}, \Vert w \Vert \le 2, \Vert v\Vert \le 2, t \in [0, 1] \}$ is finite. Therefore, (B.3) holds. $\square $

The following lemma controls the error of parallelly transporting a vector along the shortest geodesic on $\mathcal {M}$ and then moving back along the shortest geodesic on ${\bar{\mathcal {M}}}$.

Lemma B.2

Fix $p \in \mathcal {M}$, then there exists $C > 0$ such that for every $v \in T_p \mathcal {M}$ with $\Vert v \Vert \le 2$, $t \in [0, 1]$ and $w, \xi \in T_p {\bar{\mathcal {M}}}$, the following inequality holds

$$\begin{aligned} |\left\langle {\bar{P}}_{\gamma (t),p} {\bar{P}}_{\gamma }^{0 \rightarrow t} w , \xi \right\rangle - \left\langle w, \xi \right\rangle | \le C t^2 \Vert w \Vert \Vert \xi \Vert , \end{aligned}$$

(B.5)

where $\gamma (t) := \exp _p (tv)$. Consequently,

$$\begin{aligned} \big \Vert {\bar{P}}_{p,\gamma (t)} - {\bar{P}}^{0 \rightarrow t}_\gamma \big \Vert \le Ct^2. \end{aligned}$$

(B.6)

Proof

Let $w, \xi \in T_p {\bar{\mathcal {M}}}$ and $v \in T_p \mathcal {M}$ with $\Vert v\Vert \le 2$ and $\gamma (t) := \exp _p(tv)$, then

$$\begin{aligned} \frac{\mathrm {d}}{\mathrm {d}t} \left\langle {\bar{P}}_{\gamma (t),p} {\bar{P}}_{\gamma }^{0 \rightarrow t} w , \xi \right\rangle&= \frac{\mathrm {d}}{\mathrm {d}t} \left\langle {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&= \left\langle {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&\quad + \left\langle {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&\overset{(\mathrm{B.1})}{=} \left\langle {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle , \end{aligned}$$

(B.7)

$$\begin{aligned} \frac{\mathrm {d}^2}{\mathrm {d}t^2} \left\langle {\bar{P}}_{\gamma (t),p} \bar{P}_{\gamma }^{0 \rightarrow t} w , \xi \right\rangle&= \left\langle {\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&\quad + \left\langle {\bar{P}}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}} {\bar{\nabla }}_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle \nonumber \\&\quad \overset{(\mathrm{B.1})}{=} \left\langle \bar{P}_{\gamma }^{0 \rightarrow t} w , {\bar{\nabla }}_{{\dot{\gamma }}} \bar{\nabla }_{{\dot{\gamma }}}{\bar{P}}_{p, \gamma (t)} \xi \right\rangle . \end{aligned}$$

(B.8)

Let $Y(q) := {\bar{P}}_{pq} \xi $ for $q \in {\bar{\mathcal {M}}}$, then $Y(\gamma (t)) = {\bar{P}}_{p, \gamma (t)} \xi $. From [19, Proposition 2.2(c)], we know ${\bar{\nabla }}_{{\dot{\gamma }}} Y = {\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}_{p, \gamma (t)} \xi $. By the definition of Y, ${\bar{\nabla }}_v Y(p) = 0$ for every $v \in T_p {\bar{\mathcal {M}}}$. Therefore, the right-hand side of (B.7) vanishes at $t = 0$. Since $\bar{\nabla }_{{\dot{\gamma }}}{\bar{\nabla }}_{{\dot{\gamma }}} {\bar{P}}_{p, \gamma (t)} \xi $ is continuous with respect to t, v and is linear with respect to $\xi $, then

$$\begin{aligned} C := \sup \left\{ \Vert {\bar{\nabla }}_{{\dot{\gamma }}}\bar{\nabla }_{{\dot{\gamma }}} {\bar{P}}_{p, \gamma (t)} \xi \Vert : t \in [0, 1], \Vert v \Vert \le 2, \Vert \xi \Vert \le 1 \right\} < \infty . \end{aligned}$$

Applying Taylor’s theorem at $t = 0$, we know for every $t \in [0, 1]$ and $w, \xi \in T_p {\bar{\mathcal {M}}}$,

$$\begin{aligned} |\left\langle {\bar{P}}_{\gamma (t),p} {\bar{P}}_{\gamma }^{0 \rightarrow t} w , \xi \right\rangle - \left\langle w, \xi \right\rangle | \le C t^2 \Vert w \Vert \Vert \xi \Vert . \end{aligned}$$

Since C is independent of v, then the above holds for every $\Vert v \Vert \le 2$. Finally, (B.6) follows from

$$\begin{aligned} \big \Vert {\bar{P}}_{p,\gamma (t)} - {\bar{P}}^{0 \rightarrow t}_\gamma \big \Vert = \big \Vert \bar{P}_{\gamma (t),p} {\bar{P}}_\gamma ^{0 \rightarrow t} - \mathrm {Id}\big \Vert \le Ct^2. \end{aligned}$$

This completes the proof. $\square $

The lemma below bounds the error of the inverse exponential maps of $\mathcal {M}$ and ${\bar{\mathcal {M}}}$, which is proved by extending $\exp _p$ to a retraction on ${\bar{\mathcal {M}}}$ using the Fermi coordinate [48, p. 135] and applying [71, Lemma 3] to bound the difference of two inverse retractions.

Lemma B.3

Fix $p_0 \in \mathcal {M}$, there exist $C, r > 0$ such that the following inequality holds for every $p, q \in B_{r}(p_0)$.

$$\begin{aligned} \big \Vert \exp ^{-1}_p q - \overline{\exp }^{-1}_p q \big \Vert \le C d_{{\bar{\mathcal {M}}}}(p, q)^2. \end{aligned}$$

(B.9)

Proof

From Lemma 4.1, there exists $r > 0$ such that $\exp _p$ is a diffeomorphism from $\{ v \in T_p \mathcal {M}: \Vert v \Vert < 4r \}$ to $B_{4r}(p)$ for every $p \in B_{4r}(p_0)$. According to Theorem 5.25 in [48], there exists $r^\prime > 0$ such that ${\overline{\exp }}$ is a diffeomorphism from $V := \{ (p, v) \in N\mathcal {M}: p \in B_{2r}(p_0), \Vert v \Vert < r^\prime \}$ to its image under V, where $N\mathcal {M}:= \bigcup _{p \in \mathcal {M}} (T_p \mathcal {M})^{\perp }$ is the normal bundle. Let $E_1, \ldots , E_{d}$ be vector fields on $B_{4r}(p_0)$ such that $\{ E_1(p), \ldots , E_{n}(p) \}$ and $\{ E_{n+1}(p), \ldots , E_{d}(p) \}$ are orthonormal bases of $T_p\mathcal {M}$ and $(T_p\mathcal {M})^\perp $, respectively. For $p \in B_r(p_0)$, $v := \sum _{i=1}^d w_i E_i(p) \in T_p {\bar{\mathcal {M}}}$ with $\Vert v_\top \Vert < r$ and $\Vert v_\perp \Vert < r^\prime $, we define $q(p, v) := \exp _p (v_\top )$ and

$$\begin{aligned} E(p, v) := {\overline{\exp }}_{q(p, v)} \left( \sum _{i=n+1}^d w_i E_i(q(p, v)) \right) . \end{aligned}$$

The above discussion shows that E is well-defined, and $E(p, v) = \exp _p v$ if $v \in T_p \mathcal {M}$ and $\Vert v\Vert < r$, and $E(p, v) = {\overline{\exp }}_p v$ if $v \in (T_p \mathcal {M})^\perp $ and $\Vert v\Vert < r^\prime $. Thus, E can be regarded as a retraction on ${\bar{\mathcal {M}}}$ defined near $p_0$, and (B.9) follows from [71, Lemma 3]. $\square $

Proof of Theorem 4.4

(i) First, we verify the local Lipschitz property of X. Since ${\bar{X}}$ is locally Lipschitz at $p \in \mathcal {M}$, then there exist $L_p > 0$ and a neighborhood ${\bar{U}}_p \subset {\bar{\mathcal {M}}}$ at p such that ${\bar{X}}$ is $L_p$-Lipschitz in ${\bar{U}}_p$. By Lemma B.1, we know for every $p, q \in \mathcal {M}\cap {\bar{U}}_p$ and every geodesic $\gamma $ joining p and q, it holds that

$$\begin{aligned} \begin{aligned} \big \Vert P_{\gamma }^{0 \rightarrow 1} X(p) - X(q) \big \Vert&\le \big \Vert P_{\gamma }^{0 \rightarrow 1} X(p) - {\bar{P}}_{\gamma }^{0 \rightarrow 1}X(p) \big \Vert + \big \Vert {\bar{P}}_{\gamma }^{0 \rightarrow 1}X(p) - X(q) \big \Vert \\&\le \left( C \sup _{p \in \mathcal {M}}\Vert X(p) \Vert + L_p \right) \ell (\gamma ), \end{aligned} \end{aligned}$$

where C is the constant in Lemma B.1. Since $\mathcal {M}$ is compact and X is continuous, $M := \sup _{p \in \mathcal {M}} \Vert X(p) \Vert < \infty $, then X is Lipschitz on $\mathcal {M}\cap {\bar{U}}_p$, i.e., X is locally Lipschitz at p.

(ii) Next, we show that X is directionally differentiable at p in the Hadamard sense. Fix $v \in T_p \mathcal {M}$ with $\Vert v \Vert = 1$ and let $\gamma (t) := \exp _p(tv^\prime )$ for $t \in (-1, 1)$ and $v^\prime \in T_p \mathcal {M}$ with $\Vert v^\prime \Vert \le 2$, and decompose $P_\gamma ^{t \rightarrow 0} X(\exp _p(tv^\prime )) - X(p) $ into

$$\begin{aligned} \begin{aligned}&\underbrace{ (P_\gamma ^{t \rightarrow 0} - {\bar{P}}_\gamma ^{t \rightarrow 0})X(\gamma (t)) - t S(t) }_{\text {(A)}} + \underbrace{ (\bar{P}_\gamma ^{t \rightarrow 0} - {\bar{P}}_{\gamma (t),p}) X(\gamma (t)) }_{\text {(B)}} \\&\qquad + \underbrace{ {\bar{P}}_{\gamma (t),p} X(\gamma (t)) - X(p) + tS(t) }_{\text {(C)}}, \end{aligned} \end{aligned}$$

(B.10)

where $S(t) := \mathrm {II}(-{\dot{\gamma }}(0), P_\gamma ^{t \rightarrow 0} X(\gamma (t)))$.

From Lemma B.1 and B.2, there exists $C > 0$ (independent of $v^\prime $) such that

$$\begin{aligned} \begin{aligned}&| (\mathrm A) | \overset{(\mathrm{B.3})}{\le } Ct^2 \quad \text { and } \quad |(\mathrm B)| \le \Vert \bar{P}_\gamma ^{t \rightarrow 0} - {\bar{P}}_{\gamma (t), p} \Vert \Vert X(\gamma (t)) \Vert \overset{(\mathrm{B.6})}{\le } CM t^2. \end{aligned} \end{aligned}$$

When $t \rightarrow 0^{+}$ and $v^\prime \rightarrow v$, we know $|(\mathrm A)| + |(\mathrm B)| = O(t^2)$ and $v^\prime _t := t^{-1}\overline{\exp }^{-1}_p \gamma (t) \rightarrow v$, and hence the continuity of S(t) and the Hadamard differentiability (4.24) imply that

$$\begin{aligned} \begin{aligned}&\lim _{\begin{array}{c} t \downarrow 0 \\ v^\prime \rightarrow v \end{array}} \frac{1}{t} \left[ P_{\exp _p(tv^\prime ),p} X(\exp _p(tv^\prime )) - X(p) \right] = \lim _{\begin{array}{c} t \downarrow 0 \\ v^\prime \rightarrow v \end{array}} \frac{1}{t}(\mathrm C) \\&\quad = \lim _{t \downarrow 0} \frac{1}{t} \left[ P_{\overline{\exp }_p(tv_t^\prime ),p} X({\overline{\exp }}_p(tv_t^\prime )) - X(p) \right] \\&\qquad + \lim _{\begin{array}{c} t \downarrow 0 \\ v^\prime \rightarrow v \end{array}} S(t) \overset{(4.24)}{=} \bar{\nabla } X(p; v) - \mathrm {II}(v, X(p)). \end{aligned} \end{aligned}$$

Note that $\mathrm {II}(v, X(p)) \in (T_p\mathcal {M})^\perp $, we find $\nabla X(p; v) = {\bar{\nabla }} X(p; v) - \mathrm {II}(v, X(p)) = ({\bar{\nabla }} X(p; v))_\top $ for every $\Vert v\Vert = 1$. Since $\nabla X(p; tv) = t\nabla X(p; v)$ for all $t > 0$, we know X is directionally differentiable at p in the Hadamard sense.

(iii.b) Finally, we need to verify the inequality (4.27). Suppose that the assumption (iii.b) holds. Since the parallel transport is an isometry, then from (4.26), there exist $C > 0, \delta \in (0, 1)$ such that for every $q \in \mathcal {M}$ with $d_{{\bar{\mathcal {M}}}}(p, q) < \delta $ and ${\bar{H}}_q \in {\bar{\mathcal {K}}}(q)$, it holds that

$$\begin{aligned} \Vert {\bar{P}}_{pq} X(p) - X(q) - {\bar{H}}_q {\overline{\exp }}_q^{-1} p \Vert \le C d_{{\bar{\mathcal {M}}}}(p, q)^{1 + \mu } \le C d_\mathcal {M}(p, q)^{1 + \mu }.\nonumber \\ \end{aligned}$$

(B.11)

Without the loss of generality, we could assume that $B_\delta (p) \subset V_p$, where $V_p$ is the neighborhood of p in the assumption (iii.b). For every $q \in \mathcal {M}$ and $H_q \in \mathcal {K}(q)$ such that $d_\mathcal {M}(q, p) < \delta $, the assumption (iii.b) states that there exists ${\bar{H}}_q \in {\bar{\mathcal {K}}}(q)$ satisfying $({\bar{H}}_q \exp ^{-1}_qp)_\perp = \mathrm {II}(\exp ^{-1}_qp, X(q))$ and $({\bar{H}}_q \exp ^{-1}_qp)_\top = H_q \exp ^{-1}_qp$. Let $\alpha = d_\mathcal {M}(p, q)$, $\xi := \alpha ^{-1} \exp ^{-1}_p q$ and $\gamma (t) := \exp _p( t \xi )$, it holds that

$$\begin{aligned} \begin{aligned}&\Vert P_{pq} X(p) - X(q) - H_q \exp ^{-1}_q p \Vert \\&\quad \le \underbrace{ \Vert P_{pq} X(p) - {\bar{P}}_{\gamma }^{0 \rightarrow \alpha } X(p) + \mathrm {II}(\exp ^{-1}_q p, X(q)) \Vert }_{\text {(A)}} \\&\qquad + \underbrace{\Vert ( {\bar{P}}_{\gamma }^{0 \rightarrow \alpha } - {\bar{P}}_{pq} )X(p) \Vert }_{\text {(B)}} + \underbrace{\Vert \bar{P}_{pq}X(p) - X(q) - {\bar{H}}_q {\overline{\exp }}^{-1}_qp \Vert }_{\text {(C)}} \\&\qquad + \underbrace{\Vert {\bar{H}}_q ({\overline{\exp }}^{-1}_qp - \exp ^{-1}_q p) \Vert }_{\text {(D)}}. \end{aligned} \end{aligned}$$

Let $r, {\tilde{C}} > 0$ be the constants such that Lemma B.1, B.2 and B.3 hold with $p_0 = p$, and ${\tilde{L}}_p > 0$ be the Lipschitz constant of X at p, and by the compactness of $\mathcal {M}$ and the continuity of $\mathrm {II}(v, w)$, we further assume that $\sup \{ \Vert \mathrm {II}(v, w) \Vert : p \in \mathcal {M}, v, w \in T_p\mathcal {M}, \Vert v\Vert =\Vert w\Vert =1 \}< {\tilde{C}} < \infty $. Note that ${\dot{\gamma }}(\alpha ) = -\alpha ^{-1} \exp ^{-1}_q p$ and $\mathrm {II}(\exp ^{-1}_qp, X(q)) = -\alpha \mathrm {II}({\dot{\gamma }}(\alpha ), X(q))$, then

$$\begin{aligned} \begin{aligned} |(\mathrm A)|&\le \Vert P_{pq} X(p) - {\bar{P}}_{\gamma }^{0 \rightarrow \alpha } X(p) - \alpha \mathrm {II}({\dot{\gamma }}(\alpha ), P_\gamma ^{0 \rightarrow \alpha }X(p)) \Vert \\&\qquad + \alpha \Vert \mathrm {II}({\dot{\gamma }}(\alpha ), P_\gamma ^{0 \rightarrow \alpha }X(p) - X(q)) \Vert \\&\quad \overset{(\mathrm{B.3})}{\le }{\tilde{C}} \alpha ^2 + {\tilde{C}} \alpha \Vert {\dot{\gamma }}(\alpha ) \Vert \Vert P_\gamma ^{0 \rightarrow \alpha }X(p) - X(q) \Vert \le {\tilde{C}}(1 + {\tilde{L}}_p) \alpha ^2, \end{aligned} \end{aligned}$$

where the last inequality follows from the Lipschitzness of X at p. From (B.6) we find that $|(\mathrm B)| \le {\tilde{C}}M\alpha ^2$. Since $\mathcal {K}$ is locally bounded, there exists ${\tilde{M}} > 0$ such that all elements in $\bigcup _{d(p, q) \le \delta } {\bar{\mathcal {K}}}(q)$ are bounded by $\tilde{M}$, and thus (B.9) yields $|(\mathrm D)| \le {\tilde{C}} {\tilde{M}} \alpha ^2$. Combining with (B.11) and letting $C_0 := {\tilde{C}}(M + {\tilde{M}} + {\tilde{L}}_p + 1)$, it holds that

$$\begin{aligned} \begin{aligned} \Vert P_{pq} X(p) - X(q) - H_q \exp ^{-1}_q p \Vert \le C d_\mathcal {M}(p, q)^{1 + \mu } + C_0 d_\mathcal {M}(p, q)^2. \end{aligned} \end{aligned}$$

Thus, the inequality (4.27) holds for $q \in \mathcal {M}$ with $d_\mathcal {M}(p, q) < {\tilde{\delta }} := \min \{ 1, r, \delta \}$ and ${\hat{C}} := C + C_0$. Moreover, when $\mu \in [0, 1)$, the same inequality holds for $q \in \mathcal {M}$ with $d_\mathcal {M}(p, q) < {\tilde{\delta }} := \min \{ r, \delta , (C/C_0)^{\frac{1}{1 - \mu }} \}$ and ${\hat{C}} := 2C$.

(iii.a) When $X(p) = 0$, we know

$$\begin{aligned} \begin{aligned} \Vert X(q) + H_q \exp ^{-1}_q p \Vert&= \Vert ({\bar{X}}(q) + {\bar{H}}_q \exp ^{-1}_q p)_\top \Vert \le \Vert {\bar{X}}(q) + {\bar{H}}_q \exp ^{-1}_q p\Vert \\&\le \Vert {\bar{X}}(q) + {\bar{H}}_q {\overline{\exp }}^{-1}_q p \Vert + \Vert \bar{H}_q ({\overline{\exp }}^{-1}_qp - \exp ^{-1}_q p) \Vert . \end{aligned} \end{aligned}$$

The above two terms can be controlled by (4.26) and (B.9), respectively, and therefore the conclusion follows from a similar discussion as in (iii.b).

(iv) Since when $\mu = 0$, we know for every $C > 0$, if (4.26) holds, then (4.27) holds with ${\hat{C}} = 2C$, which implies the semismoothness in Definition 4.1. The conclusion for the semismoothness with order $\mu \in (0, 1]$ directly follows from (i), (ii) and (iii). $\square $

1.2 B.2 Proof of Proposition 4.2

First, we briefly review how to represent quantities on manifolds under a local coordinate. Let $(U, \varphi )$ be a chart near $p \in \mathcal {M}$, and $\{ e_i \}$ be the standard basis of $\mathbb {R}^n$, i.e., the j-th component of $e_i$ is $\delta _{ij}$, and let $E_i := (\mathrm {d}\varphi )^{-1}[e_i]$, then $E_i$ is a smooth vector field near p and $\{ E_i(q) \}$ is a basis of $T_q\mathcal {M}$ for every $q \in U$. Let $g_{ij} := \left\langle E_i, E_j \right\rangle $, $G := (g_{ij})_{i,j\in [n]}$ and $\varGamma _{ij}^k$ be the Christoffel symbols such that $\nabla _{E_i}{E_j} = \sum _{k=1}^n \varGamma _{ij}^k E_k$, and $g^{ij}$ be such that $( g^{ij} )_{i,j\in [n]} = G^{-1}$, and we call G the metric matrix. The gradient of a smooth function $f:\mathcal {M}\rightarrow \mathbb {R}$ is ${{\,\mathrm{grad}\,}}f = \sum _{i, j=1}^n (g^{ij} E_i f)E_j$ (see [48, p. 27]). Two smooth vector fields X, Y defined near p can be represented as $X = \sum _{i=1}^n X^i E_i$ and $Y = \sum _{i=1}^n Y^i E_i$, and their inner product is $\left\langle X, Y \right\rangle = \sum _{i,j=1}^n g_{ij} X^i Y^j$. The covariant derivative $\nabla _XY$ can be written as (see [48, p. 92])

$$\begin{aligned} \nabla _{X}Y = \sum _{i=1}^n (X Y^i) E_i + \sum _{i, j, k=1}^n X^iY^j \varGamma _{ij}^k E_k. \end{aligned}$$

(B.12)

When $(U, \varphi )$ is the chart constructed in the proof of Lemma 4.2, $\varphi ^{-1}$ is the normal coordinate at $p \in \mathcal {M}$ (see [48, p. 132]). Proposition 5.24 in [48] shows that $g_{ij}(p) = \delta _{ij}$, $\varGamma _{ij}^k(p) = 0$, and all partial derivatives of $g_{ij}$ vanish at p. Thus, ${{\,\mathrm{grad}\,}}f(p) = \sum _{i=1}^n (E_i f)(p) E_i(p)$, $\nabla _XY(p) = \sum _{i=1}^n (X Y^i)(p) E_i(p)$ and $\left\langle X(p), Y(p) \right\rangle = \sum _{i=1}^n X^i(p) Y^i(p)$.

The following two lemmas are some chain rules on manifolds, which are proved by pulling functions on manifolds back to Euclidean spaces and then applying the Euclidean chain rules.

Lemma B.4

Let $\mathcal {M}, \mathcal {N}$ be two Riemannian manifolds, and $F: \mathcal {M}\rightarrow \mathcal {N}$ be a smooth map, $g: \mathcal {N}\rightarrow \mathbb {R}$ be a map that is locally Lipschitz at $F(p) \in \mathcal {N}$, then the following chain rule holds at $p \in \mathcal {M}$

$$\begin{aligned} \partial (g \circ F)(p) \subseteq \partial g(F(p)) \mathrm {d}F(p), \end{aligned}$$

(B.13)

where the inclusion means that for every $\xi \in \partial (g \circ F)(p)$, there exists $\zeta \in \partial g(F(p))$ such that $\left\langle \xi , v \right\rangle = \left\langle \zeta , \mathrm {d}F|_p[v] \right\rangle $ for every $v \in T_p\mathcal {M}$.

Proof

Let $(U, \varphi )$ and $(V, \psi )$ be charts near $p \in \mathcal {M}$ and $F(p) \in \mathcal {N}$ with $\varphi (0) = p$ and $\psi (0) = F(p)$, respectively. We denote $G_\mathcal {N}, G_\mathcal {M}$ as the metric matrices on $\mathcal {N}, \mathcal {M}$ defined above, respectively. Define ${\hat{g}} := g \circ \psi ^{-1}$ and ${\hat{F}} := \psi \circ F \circ \varphi ^{-1}$. Then, Theorem 2.3.10 in [25] yields that

$$\begin{aligned} \partial ({\hat{g}} \circ {\hat{F}})(0) \subseteq \partial {\hat{g}}({\hat{F}}(0)) \mathrm {d}{\hat{F}}(0). \end{aligned}$$

Fix $\xi \in \partial (g \circ F)(p)$, then from (2.2) we know ${\hat{\xi }} := G_\mathcal {M}\mathrm {d}\varphi |_p[\xi ] \in \partial ({\hat{g}} \circ {\hat{F}})(0)$. The above display shows that there exists ${\hat{\zeta }} \in \partial {\hat{g}}({\hat{F}}(0))$ such that for every ${\hat{v}} \in \mathbb {R}^n$, it holds that $ \hat{\xi }^\top {\hat{v}} = {\hat{\zeta }}^\top (\mathrm {d}{\hat{F}}(0)[{\hat{v}}]) $. Let $v := (\mathrm {d}\varphi |_p)^{-1}[{\hat{v}}]$, then ${\hat{\xi }}^\top {\hat{v}} = (\mathrm {d}\varphi |_p[\xi ])^\top G_\mathcal {M}(\mathrm {d}\varphi |_p[v]) = \left\langle \xi , v \right\rangle $. Since $\zeta := (\mathrm {d}\psi |_{F(p)})^{-1} G_\mathcal {N}^{-1} {\hat{\zeta }} \in \partial g(F(p))$ and $\mathrm {d}{\hat{F}} = \mathrm {d}\psi \circ \mathrm {d}F \circ (\mathrm {d}\varphi )^{-1}$, we find that ${\hat{\zeta }}^\top (\mathrm {d}{\hat{F}}(0)[{\hat{v}}]) = (\mathrm {d}\psi |_{F(p)}[\zeta ])^\top G_\mathcal {N}( \mathrm {d}\psi |_{F(p)} [\mathrm {d}F|_p[v]]) = \left\langle \zeta , \mathrm {d}F|_p[v] \right\rangle $. Therefore, $\left\langle \xi , v \right\rangle = \left\langle \zeta , \mathrm {d}F|_p[v] \right\rangle $ for every $v \in T_p \mathcal {M}$, and hence $\xi \in \partial g(F(p)) \mathrm {d}F(p)$. $\square $

Lemma B.5

Let $\mathcal {M}$ be a Riemannian manifold, X be a vector field on $\mathcal {M}$ that is locally Lipschitz at $p \in \mathcal {M}$, and Y be a smooth vector field on $\mathcal {M}$. Define $f := \left\langle X, Y \right\rangle $, then the following chain rule holds.

$$\begin{aligned} \partial f(p) = \left\langle \partial X(p), Y(p) \right\rangle + \left\langle X(p), \nabla Y(p) \right\rangle , \end{aligned}$$

(B.14)

where $\xi \in \partial f(p)$ is regarded as the operator $\zeta \mapsto \left\langle \xi , \zeta \right\rangle $ for $\zeta \in T_p\mathcal {M}$, and the right-hand side is understood as the set of operators $\zeta \mapsto \left\langle H\zeta , Y(p) \right\rangle + \left\langle X(p), \nabla _\zeta Y(p) \right\rangle $ for $\zeta \in T_p\mathcal {M}$ and $H \in \partial X(p)$.

Proof

Let $(U_p, \varphi )$ be a chart such that $\varphi ^{-1}$ is the normal coordinate. Suppose that $X = \sum _{i=1}^n X^i E_i$ and $Y = \sum _{i=1}^n Y^i E_i$, we define ${\hat{f}} := f \circ \varphi ^{-1}$, ${\hat{X}}^j := X^j \circ \varphi ^{-1}$, ${\hat{X}} := ({\hat{X}}^1, \ldots , {\hat{X}}^n)$, and ${\hat{Y}}$ likewise. Thus, ${\hat{f}}(x) = {\hat{X}}(x)^\top G(x) {\hat{Y}}(x)$ for $x \in {\hat{U}}_p := \varphi (U_p)$, and can be regarded as the composition $\mathcal {A}\circ {\tilde{X}}$, where $\mathcal {A}(x, v) := v^\top G(x) {\hat{Y}}(x)$ and ${\tilde{X}}(x) := (x, {\hat{X}}(x))$ for $x \in U_p, v \in \mathbb {R}^n$. Since $\mathcal {A}$ is smooth and $G(0) = I_n, \nabla G(0) = 0$, we know $\nabla \mathcal {A}(0, v)[w, \xi ] = v^\top \nabla {\hat{Y}}(0)[w] + \xi ^\top {\hat{Y}}(0)$ for $w, \xi \in \mathbb {R}^n$. Note that $\partial {\tilde{X}} = (\mathrm {Id}, \partial {\hat{X}})$, then the chain rule in Euclidean spaces [25, Theorem 2.6.6] implies that

$$\begin{aligned} \partial {\hat{f}}(0) = \nabla \mathcal {A}(0, {\hat{X}}(0))[\mathrm {Id}, \partial {\hat{X}}(0)] = {\hat{X}}(0)^\top \nabla {\hat{Y}}(0) + {\hat{Y}}(0)^\top \partial {\hat{X}}(0). \end{aligned}$$

(B.15)

We define the linear bijection $J: \mathcal {L}(\mathbb {R}^n) \rightarrow \mathcal {L}(T_p\mathcal {M})$ as

$$\begin{aligned} (J{\hat{H}})v := \sum _{i=1}^n [{\hat{H}} {\hat{v}}]_i E_i(p), \end{aligned}$$

where ${\hat{H}} \in \mathcal {L}(\mathbb {R}^n)$, $v \in T_p \mathcal {M}$ and ${\hat{v}} := \mathrm {d}\varphi |_p[v]$. Below we show that $J(\partial {\hat{X}}(0)) = \partial X(p)$.

Suppose ${\hat{H}} \in \partial _B {\hat{X}}(0)$, then there exists $x_k \rightarrow 0$ such that ${\hat{X}}$ is differentiable at $x_k$ and $\nabla {\hat{X}}(x_k) \rightarrow {\hat{H}}$ as $k\rightarrow \infty $. Let $H := J{\hat{H}}$, ${\hat{v}} \in \mathbb {R}^n$, $v := (\mathrm {d}\varphi |_p)^{-1}[{\hat{v}}]$, $p_k := \varphi ^{-1}(x_k)$ and $v_k := (\mathrm {d}\varphi |_{p_k})^{-1}[{\hat{v}}]$, then $\nabla {\hat{X}}^i(x_k)[{\hat{v}}] = \mathrm {d}{\hat{X}}^i|_{x_k}[{\hat{v}}] = \mathrm {d}X^i|_{p_k}[v_k] = v_kX^i(p_k)$, and therefore

$$\begin{aligned} \nabla _{v_k} X(p_k) = \sum _{i=1}^n \nabla {\hat{X}}^i(x_k)[{\hat{v}}] E_i + \sum _{i, j, k=1}^n X^iY^j\varGamma _{ij}^k E_k. \end{aligned}$$

Let $k \rightarrow \infty $ and note $\varGamma _{ij}^k(p) = 0$, we find $\nabla _{v_k} X(p_k) \rightarrow \sum _{i=1}^n [{\hat{H}} {\hat{v}}]_i E_i(p) = Hv$. Since ${\hat{v}}$ is arbitrary, we conclude that $\nabla X(p_k) \rightarrow H \in \partial _B X(p)$. By reversing this argument, we can show that $J^{-1}H \in \partial {\hat{X}}(0)$ for every $H \in \partial X(p)$. Then, $J(\partial _B {\hat{X}}(0)) = \partial _B X(p)$, and by the linearity of J, we know $J(\partial {\hat{X}}(0)) = \partial X(p)$.

Thus, ${\hat{Y}}(0)^\top ({\hat{H}}{\hat{v}}) = \sum _{i=1}^n Y^i(p) \langle J{\hat{H}}v, E_i(p) \rangle = \langle J{\hat{H}}v, Y(p) \rangle $ for every ${\hat{H}} \in \partial {\hat{X}}(0)$, $v \in T_p \mathcal {M}$ and ${\hat{v}} := \mathrm {d}\varphi |_p[v]$. Similarly, it holds that ${\hat{X}}(0)^\top \nabla {\hat{Y}}(0)[ {\hat{v}}] = \left\langle X(p), \nabla Y(p)[v] \right\rangle $. Besides, (2.2) yields $\partial {\hat{f}}(0)[{\hat{v}}] = \partial f(p)[v]$. Therefore, the proof is completed by plugging these equations into (B.15). $\square $

Proof of Proposition 4.2

Since Theorem 4.4 shows that X is also locally Lipschitz at p, then $\partial X(p)$ is well-defined. Let $U_p \subset \mathcal {M}$ be a neighborhood of p, $\xi \in T_p\mathcal {M}$ and V be a smooth vector field on ${\bar{\mathcal {M}}}$ such that $V_\top (q) = P_{pq}\xi $ for $q \in U_p$, and $\iota : \mathcal {M}\hookrightarrow \bar{\mathcal {M}}$ be the embedding map. Define ${\bar{f}}_V := \left\langle {\bar{X}}, V \right\rangle $ and $f_V := {\bar{f}}_V \circ \iota $, then Lemma B.4 shows that $\partial f_V(p) \subseteq {\bar{\partial }} {\bar{f}}_V(p) \mathrm {d}\iota (p) = \bar{\partial } {\bar{f}}_V(p)|_{T_p \mathcal {M}}$.

Let $v \in T_p \mathcal {M}$, then Lemma B.5 gives that ${\bar{\partial }} {\bar{f}}_V(p)[v] = \left\langle {\bar{\partial }} \bar{X}(p)[v], V(p) \right\rangle + \left\langle {\bar{X}}(p), {\bar{\nabla }}_v V(p) \right\rangle $. Since $v, {\bar{X}}(p) \in T_p \mathcal {M}$, Proposition 8.4 in [48] shows that

$$\begin{aligned}\left\langle {\bar{X}}(p), {\bar{\nabla }}_v V_\perp (p) \right\rangle&= \left\langle {\bar{X}}(p), ({\bar{\nabla }}_v V_\perp (p))_\top \right\rangle = \left\langle X(p), \nabla _v V_\perp (p) \right\rangle \\&= -\left\langle \mathrm {II}(v, X(p)), V_\perp (p) \right\rangle . \end{aligned}$$

Note that $\nabla _v V_\top (p) = \nabla _v (P_{pq}\xi )(p) = 0$, then $\left\langle {\bar{X}}(p), {\bar{\nabla }}_v V_\top (p) \right\rangle = \left\langle X(p), \nabla _v V_\top (p) \right\rangle = 0$, and thus,

$$\begin{aligned} \left\langle {\bar{X}}(p), {\bar{\nabla }}_vV(p) \right\rangle= & {} \left\langle {\bar{X}}(p), {\bar{\nabla }}_vV_\perp (p) + \bar{\nabla }_vV_\top (p) \right\rangle \\= & {} -\left\langle \mathrm {II}(v,X(p)), V_\perp (p) \right\rangle \\= & {} -\left\langle \mathrm {II}(v,X(p)), V(p) \right\rangle , \end{aligned}$$

where the last equation follows from $\left\langle \mathrm {II}(v, X(p)), V_\top (p) \right\rangle = 0$.

Combining these arguments, we know $\partial f_V(p)[v]\! \subseteq \! \langle {\bar{\partial }} {\bar{X}}(p)[v] \!-\! \mathrm {II}(v, X(p)), V(p) \rangle $. Note that ${\bar{X}}(q) \in T_q \mathcal {M}$ for $q \in U_p$, then $f_V(q) = \left\langle X(q), V_\top (q) \right\rangle $ for $q \in U_p$. Since $\nabla _v V_\top (p) = 0$ and $\partial X(p)[v] \subset T_p\mathcal {M}$, then Lemma B.5 yields $\partial f_V(p)[v] = \left\langle \partial X(p)[v], V_\top (p) \right\rangle = \left\langle \partial X(p)[v], V(p) \right\rangle $. Therefore, we know

$$\begin{aligned} \left\langle \partial X(p)[v], V(p) \right\rangle \subseteq \langle {\bar{\partial }} {\bar{X}}(p)[v] - \mathrm {II}(v, X(p)), V(p) \rangle . \end{aligned}$$

Note that $V(p)_\top = \xi $ and $V(p)_\perp $ are arbitrary and $\partial X(p)[v], {\bar{\partial }} {\bar{X}}(p)[v]$ are compact convex sets, the inclusion relationship (4.28) follows from [57, Corollary 13.1.1]. $\square $

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, Y., Bao, C., Ding, C. et al. A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds. Math. Program. 201, 1–61 (2023). https://doi.org/10.1007/s10107-022-01898-1

Download citation

Received: 27 April 2021
Accepted: 14 September 2022
Published: 05 October 2022
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10107-022-01898-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Nonconvex and Nonsmooth Optimization with Generalized Orthogonality Constraints: An Approximate Augmented Lagrangian Method

Optimization schemes on manifolds for structured matrices with fixed eigenvalues

A Riemannian under-determined BFGS method for least squares inverse eigenvalue problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: The deferred Proofs in Sects. 4.1 and 4.2

1.1 A.1 Proof of Lemma 4.2

Proof

1.2 A.2 Proof of Lemma 4.4

Proof

1.3 A.3 Proof of Lemma 4.5

Proof

1.4 A.4 Proof of Lemma 4.6

Proof

Appendix B: The deferred proofs in Sect. 4.3

1.1 B.1 Proof of Theorem 4.4

Lemma B.1

Proof

Lemma B.2

Proof

Lemma B.3

Proof

Proof of Theorem 4.4

1.2 B.2 Proof of Proposition 4.2

Lemma B.4

Proof

Lemma B.5

Proof

Proof of Proposition 4.2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation