The divergence of the BFGS and Gauss Newton methods

Mascarenhas, Walter F.

doi:10.1007/s10107-013-0720-6

The divergence of the BFGS and Gauss Newton methods

Full Length Paper
Series A
Published: 24 November 2013

Volume 147, pages 253–276, (2014)
Cite this article

Mathematical Programming Submit manuscript

Walter F. Mascarenhas¹

942 Accesses
3 Altmetric
Explore all metrics

Abstract

We present examples of divergence for the BFGS and Gauss Newton methods. These examples have objective functions with bounded level sets and other properties concerning the examples published recently in this journal, like unit steps and convexity along the search lines. As these other examples, the iterates, function values and gradients in the new examples fit into the general formulation in our previous work Mascarenhas (Comput Appl Math 26(1), 2007), which also presents an example of divergence for Newton’s method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Gauss–Newton Method for Solving Systems of Nonlinear Equations

Article 01 September 2021

An Approach for Analyzing the Global Rate of Convergence of Quasi-Newton and Truncated-Newton Methods

Article 23 September 2016

A Formalization of the Natural Gradient Method for General Similarity Measures

Notes

For a proof of this theorem, look at Eq. (67) in the appendix.
Theorem 2 is Lemma 6 in page 150 of [11].
This theorem is proved in the last paragraph of the appendix.

References

Absil, P., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic functions. SIAM J. Optim. 16(2), 531–547 (2005)
Article MathSciNet MATH Google Scholar
Bertsekas, D.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)
MATH Google Scholar
Cotton, É.: Sur les solutions asymptotiques des équations différentielles. Ann. Sci. de l’École Normale Supérieure, Sér. 3(28), 473–521 (1911)
MathSciNet Google Scholar
Dai, Y.-H.: A perfect example for the BFGS method, Published online March (2012)
Dai, Y.-H.: Convergence properties of the BFGS algorithm. SIAM J. Optim. 13(3), 693–701 (2002)
Article MathSciNet MATH Google Scholar
Fefferman, C.: A generalized sharp Whitney theorem for jets. Revista Matematica Iberoamericana 21(2) (2005)
Fefferman, C.: Extension of $C^{m,\omega }$-Smooth Functions by Linear Operators, Manuscript available at: www.math.princeton.edu/facultypapers/Fefferman. To appear in Revista Matematica Iberoamericana
Hadamard, J.: Sur l’itération et les solutions asymptotiques des équations différentielles. Bull. Soc. Math. France 29, 224–228 (1901)
MATH Google Scholar
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier 48, 769–783 (1998)
Article MathSciNet MATH Google Scholar
Lojasiewicz, S.: Ensemble Semi-Analytiques. IHES Lecture notes (1965)
Mascarenhas, W.F.: On the divergence of line search methods. Comput. Appl. Math. 26(1), 129–169 (2007)
Google Scholar
Mascarenhas, W.F.: The BFGS algorithm with exact line searches fails for nonconvex functions. Math. Program. 99(1), 49–61 (2004)
Article MathSciNet MATH Google Scholar
Mascarenhas, W.F.: Newton’s iterates can converge to non-stationary points. Math. Program. 112(2), 327–334 (2008)
Article MathSciNet MATH Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization, Springer Series in Operations Research. Springer, Berlin (1999)
Google Scholar
Ortega, J.M.: The Newton–Kantorovich theorem. Am. Math. Mon. 75(6), 658–660 (1968)
Article MATH Google Scholar
Perron, O.: Ueber staebilitat und Asymptotisches Verhalten der Loesung eines Systeme endlicher Differetialgleichungen. J. Reine. Angew. Math. 161, 41–64 (1929)
MATH Google Scholar
Poincaré, H.: Les méthodes nouvelles de la mécanique céleste. Gauthier-Villars et fils (1892)
Powell, M.J.D.: Nonconvex minimization calculations and the conjugate gradient method. Lect. Notes Math. 1066, 122–141 (1984)
Article Google Scholar
Rally, L.B.: A comparison of the existence theorems of Kantorovich and Moore. SIAM J. Numer. Anal. 17(1), Society for Industrial and, Applied Mathematics Feb. (1980)
Whitney, H.: Analytic extensions of differentiable functions defined in closed sets. Trans. Am. Math. Soc. 36, 63–89 (1934)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Matemática e Estatística, Universidade de São Paulo, Cidade Universitária, Rua do Matão 1010, São Paulo, SP, CEP 05508-090, Brazil
Walter F. Mascarenhas

Authors

Walter F. Mascarenhas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Walter F. Mascarenhas.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (txt 29 KB)

Technicalities

This appendix begins with an explanation as to why the form of the iterates, function values and gradients in reference [4] was already described in [11]. We then prove Lemma 1 and, finally, the theorems.

Let us then see how several equations in [4] and [11] correspond to (4). In [4] and the example for Newton’s method in [11] the parameter $d_n$ is equal to $1$. In the example for the BFGS method in [11] it is equal to $3$. Equation (4) corresponds to Eqs. (10)–(12) in [11]. It is generalized in Eqs. (34)–(37) of [11]. Dai [4] adds a constant $f^*$ to $f$, but this is irrelevant. It calls $\lambda $ by $t$ and thinks in terms of the steps $\delta _k = x_{k+1} - x_k$, so that $\delta _k = Q^k {D}\!\left( \lambda \right) ^k \overline{\delta }$ in [4]’s equation (2.1). This is just an affine change of coordinates of (4) and only a different notation for the corresponding $s_k$’s used in [11]. The matrix $M$ in [4]’s equation (2.2) is equal to the matrix $Q {D}\!\left( \lambda \right) $, whose $k$th power multiplies $\overline{x}_k$ in Eq. (4). The matrix $P$ in [4]’s equation (2.5) is equal to the matrix $\lambda Q {D}\!\left( \lambda \right) ^{-1}$ whose $k$th power multiplies $\overline{g}_k$ in Eq. (4). Therefore, the iterates, the function values and the gradients are described in essentially the same way in the examples in [4, 11, 12], in terms of orthogonal matrices $Q$ scaled by powers of $\lambda $ (or $t$) via the matrices ${D}\!\left( \lambda \right) $ or their inverses.

Proof of Lemma 1

Let us define $\delta _0 = 0\hbox { and }x_0 = \overline{x}$ and consider $\delta _k\hbox { and }x_k$ defined inductively by $\delta _k = A {f}\!\left( x_{k-1}\right) \hbox { and }x_k = x_{k-1} - \delta _k$. The lemma will be proved if we show that

$$\begin{aligned} \left\| \delta _{k} \right\| _\infty \le a^{k-1} b, \quad \left\| x_{k+1} - \overline{x} \right\| \le \frac{1 - a^k}{1 - a} b \quad \mathrm {and} \quad \left\| {f}\!\left( x_k\right) \right\| _\infty \le a^k \left\| {f}\!\left( \overline{x}\right) \right\| _\infty ,\nonumber \\ \end{aligned}$$

(66)

because these bounds imply that $x_k$ converges to $x_\infty $ with $\left\| x_\infty - \overline{x} \right\| _\infty \le b / (1-a)\hbox { and }{f}\!\left( x_\infty \right) = 0$. Let us then prove (66) by induction. Equation (66) certainly holds for $k = 0$. Assuming that (66) holds for $k$, we conclude from (7) and definition of $\delta _k$ that

$$\begin{aligned} \left\| \delta _{k+1} \right\| _\infty \le \sup _{1 \le i \le n} \left\| A^t e_i \right\| _1 \left\| {f}\!\left( x_k\right) \right\| _\infty \le a^{k} \left\| A^t e_i \right\| _1 \left\| {f}\!\left( \overline{x}\right) \right\| _\infty \le a^k b \end{aligned}$$

and the first bound in (66) holds for $k + 1$. The second bound on (66) follows from the analogous bound for $\left\| x_k - \overline{x} \right\| _\infty $ and the bound in $\delta _{k+1}$ above. It shows that $x_{k+1} \in D$ and the segment $S$ connecting $x_k$ to $x_{k+1}$ is contained in $D$. As a result, the Mean Value Theorem for the function $f_i: D \mapsto \mathbb R{}$ implies that, for some $\xi \in S$,

$$\begin{aligned} \left| {f_i}\!\left( x_{k+1}\right) \right|&= \left| {f_i}\!\left( x_k - A {f}\!\left( x_k\right) \right) \right| = \left| {f_i}\!\left( x_k\right) - {\nabla \! f_i}\!\left( \xi \right) ^t A {f}\!\left( x_k\right) \right| \\&= \left| \left( A^t {\nabla \! f_i}\!\left( \xi \right) \!-\! e_i \right) ^t {f}\!\left( x_k\right) \right| \!\le \! \left| A^t {\nabla \! f_i}\!\left( \xi \right) \!-\! e_i\right| _1 \left\| {f}\!\left( x_k\right) \right\| _\infty \le a \left\| {f}\!\left( x_k\right) \right\| _{\infty }\!. \end{aligned}$$

Thus, $\left| {f}\!\left( x_{k+1}\right) \right| _\infty \le a \left\| {f}\!\left( x_k\right) \right\| _\infty \le a^{k+1} \left\| {f}\!\left( \overline{x}\right) \right\| _\infty $ and we are done. $\square $

Proof of Theorem 1

We start by rewriting (1) and (2) as

$$\begin{aligned} M_k M_{k}^t s_k = - \alpha _k g_k, \end{aligned}$$

(67)

for $s_k := x_{k+1} - x_k\hbox { and }g_k := {\nabla \! f}\!\left( x_k\right) $. Equation (67) shows that $s_k^t g_k = - \left\| M_k^t s_k \right\| ^2\!{/} \alpha _k \le 0$ and the first Wolfe condition (3) yields

$$\begin{aligned} f_k := {f}\!\left( x_k\right) \ge f_{k+1}. \end{aligned}$$

(68)

This implies that

$$\begin{aligned} x_k \in K := {\left\{ x \in {\mathbb R}^{n} \ \mathrm {with} \ {f}\!\left( x\right) \le {f}\!\left( x_0\right) \right\} }. \end{aligned}$$

Since $f$ has bounded level sets, the set $K$ is compact and the matrices $M_k$ are bounded. Since $f$ has continuous first derivatives there exists a constant $\kappa $ such that $\left\| g \right\| _k \le \kappa ,\,\left\| M_k \right\| \le \kappa ,\,\left\| x_k \right\| \le \kappa \hbox { and }\left\| s_k \right\| \le \kappa $.

To prove that $\lim _{k\rightarrow \infty } g_k = 0$ we use the well known result that if a bounded sequence ${\left\{ {u}_{k}, \ {k} \in \mathbb N\right\} } \subset \mathbb R{}$ is such that all its converging subsequences $u_{n_k}$ converge to zero then $u_k$ itself converges to $0$. Let us then consider a subsequence $g_{n_k}$ such that $\lim _{k \rightarrow \infty } \left\| g_{n_k} \right\| = L$ and show that $L = 0$. Equation (67) leads to $\left\| g_k \right\| \le \kappa ^3 / \alpha _k$. Thus, if some sub subsequence of $\alpha _{n_{k_j}}$ converges to $+\infty $ then $L = \lim _{j \rightarrow \infty } \left\| g_{n_{k_j}} \right\| = 0$ and we are done. We can then assume that there exists $A$ such that $\alpha _{n_k} \le A$ for all $k$. Equations (3), (67) and (68) yield

$$\begin{aligned} \left\| M_{n_k}^t s_{n_k} \right\| ^2&\le - \alpha _{n_k} s_{n_k}^t g_{n_k} \le \alpha _{n_k} \left( {f}\!\left( x_{n_k}\right) - {f}\!\left( x_{n_k+1}\right) \right) / \sigma \\&\le \alpha _{n_k} \left( {f}\!\left( x_{n_k}\right) - {f}\!\left( x_{n_{k+1}}\right) \right) / \sigma \le A \left( {f}\!\left( x_{n_k}\right) - {f}\!\left( x_{n_{k+1}}\right) \right) / \sigma \end{aligned}$$

and

$$\begin{aligned} \sum _{k = 0}^\infty \left\| M_{n_k}^t s_{n_k} \right\| ^2 \le \frac{A}{\sigma } \left( {f}\!\left( x_{n_0}\right) - \inf _{x \in K} {f}\!\left( x\right) \right) . \end{aligned}$$

Therefore, $\lim _{k \rightarrow \infty } M_{n_k}^t s_{n_k} = 0$. If follows that $g_{n_k} = - M_{n_k} M_{n_k}^t s_{n_k} / \alpha _{n_k}$ converges to $0$, because the $M_{n_k}$ are bounded and $\alpha _{n_k} \ge \overline{\alpha }$. $\square $

Proof of Theorem 3

This theorem is an specialization of Theorem 4 in page 145 of [11]. The reader can prove Theorem 3 by using Theorem 4 in [11] by realizing that the hypothesis of Theorem 3 corresponds to a simplified version of the more general concept of seed, which is described in definition 7 in page 142 of [11] (There is a typo in the end of the item 3 in this definition. It should read like “...${D}\!\left( 0\right) \overline{s}_r$ and ${D}\!\left( 0\right) Q \overline{s}_{r+1}.$”, with a $Q$ between ${D}\!\left( 0\right) \hbox { and }\overline{s}_{r+1}$.) The reader will note that our choice for the diagonal exponents in (11) ($0$ in the first block, $1$ in the second and $d_n > 2$) leads to vacuous sub itens (c) and (d) in the fouth item in the definition of seed. As a result, we do not need to worry about these items. Morevoer, we only need to worry about item (b) in the case $i = j = n$. This is why we have hypothesis (17) in Theorem 3. The reader can now read the many details and involved proving Theorem [11]. This proof is very technical and there is no point repeating it here. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mascarenhas, W.F. The divergence of the BFGS and Gauss Newton methods. Math. Program. 147, 253–276 (2014). https://doi.org/10.1007/s10107-013-0720-6

Download citation

Received: 23 February 2013
Accepted: 15 October 2013
Published: 24 November 2013
Issue Date: October 2014
DOI: https://doi.org/10.1007/s10107-013-0720-6

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

The divergence of the BFGS and Gauss Newton methods

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adaptive Gauss–Newton Method for Solving Systems of Nonlinear Equations

An Approach for Analyzing the Global Rate of Convergence of Quasi-Newton and Truncated-Newton Methods

A Formalization of the Natural Gradient Method for General Similarity Measures

Notes

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (txt 29 KB)

Technicalities

Proof of Lemma 1

Proof of Theorem 1

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

The divergence of the BFGS and Gauss Newton methods

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adaptive Gauss–Newton Method for Solving Systems of Nonlinear Equations

An Approach for Analyzing the Global Rate of Convergence of Quasi-Newton and Truncated-Newton Methods

A Formalization of the Natural Gradient Method for General Similarity Measures

Notes

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (txt 29 KB)

Technicalities

Technicalities

Proof of Lemma 1

Proof of Theorem 1

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation