The divergence of the BFGS and Gauss Newton methods | Mathematical Programming Skip to main content
Log in

The divergence of the BFGS and Gauss Newton methods

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We present examples of divergence for the BFGS and Gauss Newton methods. These examples have objective functions with bounded level sets and other properties concerning the examples published recently in this journal, like unit steps and convexity along the search lines. As these other examples, the iterates, function values and gradients in the new examples fit into the general formulation in our previous work Mascarenhas (Comput Appl Math 26(1), 2007), which also presents an example of divergence for Newton’s method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. For a proof of this theorem, look at Eq. (67) in the appendix.

  2. Theorem 2 is Lemma 6 in page 150 of [11].

  3. This theorem is proved in the last paragraph of the appendix.

References

  1. Absil, P., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic functions. SIAM J. Optim. 16(2), 531–547 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bertsekas, D.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  3. Cotton, É.: Sur les solutions asymptotiques des équations différentielles. Ann. Sci. de l’École Normale Supérieure, Sér. 3(28), 473–521 (1911)

    MathSciNet  Google Scholar 

  4. Dai, Y.-H.: A perfect example for the BFGS method, Published online March (2012)

  5. Dai, Y.-H.: Convergence properties of the BFGS algorithm. SIAM J. Optim. 13(3), 693–701 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  6. Fefferman, C.: A generalized sharp Whitney theorem for jets. Revista Matematica Iberoamericana 21(2) (2005)

  7. Fefferman, C.: Extension of \(C^{m,\omega }\)-Smooth Functions by Linear Operators, Manuscript available at: www.math.princeton.edu/facultypapers/Fefferman. To appear in Revista Matematica Iberoamericana

  8. Hadamard, J.: Sur l’itération et les solutions asymptotiques des équations différentielles. Bull. Soc. Math. France 29, 224–228 (1901)

    MATH  Google Scholar 

  9. Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier 48, 769–783 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  10. Lojasiewicz, S.: Ensemble Semi-Analytiques. IHES Lecture notes (1965)

  11. Mascarenhas, W.F.: On the divergence of line search methods. Comput. Appl. Math. 26(1), 129–169 (2007)

    Google Scholar 

  12. Mascarenhas, W.F.: The BFGS algorithm with exact line searches fails for nonconvex functions. Math. Program. 99(1), 49–61 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  13. Mascarenhas, W.F.: Newton’s iterates can converge to non-stationary points. Math. Program. 112(2), 327–334 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  14. Nocedal, J., Wright, S.J.: Numerical Optimization, Springer Series in Operations Research. Springer, Berlin (1999)

    Google Scholar 

  15. Ortega, J.M.: The Newton–Kantorovich theorem. Am. Math. Mon. 75(6), 658–660 (1968)

    Article  MATH  Google Scholar 

  16. Perron, O.: Ueber staebilitat und Asymptotisches Verhalten der Loesung eines Systeme endlicher Differetialgleichungen. J. Reine. Angew. Math. 161, 41–64 (1929)

    MATH  Google Scholar 

  17. Poincaré, H.: Les méthodes nouvelles de la mécanique céleste. Gauthier-Villars et fils (1892)

  18. Powell, M.J.D.: Nonconvex minimization calculations and the conjugate gradient method. Lect. Notes Math. 1066, 122–141 (1984)

    Article  Google Scholar 

  19. Rally, L.B.: A comparison of the existence theorems of Kantorovich and Moore. SIAM J. Numer. Anal. 17(1), Society for Industrial and, Applied Mathematics Feb. (1980)

  20. Whitney, H.: Analytic extensions of differentiable functions defined in closed sets. Trans. Am. Math. Soc. 36, 63–89 (1934)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Walter F. Mascarenhas.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (txt 29 KB)

Technicalities

Technicalities

This appendix begins with an explanation as to why the form of the iterates, function values and gradients in reference [4] was already described in [11]. We then prove Lemma 1 and, finally, the theorems.

Let us then see how several equations in [4] and [11] correspond to (4). In [4] and the example for Newton’s method in [11] the parameter \(d_n\) is equal to \(1\). In the example for the BFGS method in [11] it is equal to \(3\). Equation (4) corresponds to Eqs. (10)–(12) in [11]. It is generalized in Eqs. (34)–(37) of [11]. Dai [4] adds a constant \(f^*\) to \(f\), but this is irrelevant. It calls \(\lambda \) by \(t\) and thinks in terms of the steps \(\delta _k = x_{k+1} - x_k\), so that \(\delta _k = Q^k {D}\!\left( \lambda \right) ^k \overline{\delta }\) in [4]’s equation (2.1). This is just an affine change of coordinates of (4) and only a different notation for the corresponding \(s_k\)’s used in [11]. The matrix \(M\) in [4]’s equation (2.2) is equal to the matrix \(Q {D}\!\left( \lambda \right) \), whose \(k\)th power multiplies \(\overline{x}_k\) in Eq. (4). The matrix \(P\) in [4]’s equation (2.5) is equal to the matrix \(\lambda Q {D}\!\left( \lambda \right) ^{-1}\) whose \(k\)th power multiplies \(\overline{g}_k\) in Eq. (4). Therefore, the iterates, the function values and the gradients are described in essentially the same way in the examples in [4, 11, 12], in terms of orthogonal matrices \(Q\) scaled by powers of \(\lambda \) (or \(t\)) via the matrices \({D}\!\left( \lambda \right) \) or their inverses.

Proof of Lemma 1

Let us define \(\delta _0 = 0\hbox { and }x_0 = \overline{x}\) and consider \(\delta _k\hbox { and }x_k\) defined inductively by \(\delta _k = A {f}\!\left( x_{k-1}\right) \hbox { and }x_k = x_{k-1} - \delta _k\). The lemma will be proved if we show that

$$\begin{aligned} \left\| \delta _{k} \right\| _\infty \le a^{k-1} b, \quad \left\| x_{k+1} - \overline{x} \right\| \le \frac{1 - a^k}{1 - a} b \quad \mathrm {and} \quad \left\| {f}\!\left( x_k\right) \right\| _\infty \le a^k \left\| {f}\!\left( \overline{x}\right) \right\| _\infty ,\nonumber \\ \end{aligned}$$
(66)

because these bounds imply that \(x_k\) converges to \(x_\infty \) with \(\left\| x_\infty - \overline{x} \right\| _\infty \le b / (1-a)\hbox { and }{f}\!\left( x_\infty \right) = 0\). Let us then prove (66) by induction. Equation (66) certainly holds for \(k = 0\). Assuming that (66) holds for \(k\), we conclude from (7) and definition of \(\delta _k\) that

$$\begin{aligned} \left\| \delta _{k+1} \right\| _\infty \le \sup _{1 \le i \le n} \left\| A^t e_i \right\| _1 \left\| {f}\!\left( x_k\right) \right\| _\infty \le a^{k} \left\| A^t e_i \right\| _1 \left\| {f}\!\left( \overline{x}\right) \right\| _\infty \le a^k b \end{aligned}$$

and the first bound in (66) holds for \(k + 1\). The second bound on (66) follows from the analogous bound for \(\left\| x_k - \overline{x} \right\| _\infty \) and the bound in \(\delta _{k+1}\) above. It shows that \(x_{k+1} \in D\) and the segment \(S\) connecting \(x_k\) to \(x_{k+1}\) is contained in \(D\). As a result, the Mean Value Theorem for the function \(f_i: D \mapsto \mathbb R{}\) implies that, for some \(\xi \in S\),

$$\begin{aligned} \left| {f_i}\!\left( x_{k+1}\right) \right|&= \left| {f_i}\!\left( x_k - A {f}\!\left( x_k\right) \right) \right| = \left| {f_i}\!\left( x_k\right) - {\nabla \! f_i}\!\left( \xi \right) ^t A {f}\!\left( x_k\right) \right| \\&= \left| \left( A^t {\nabla \! f_i}\!\left( \xi \right) \!-\! e_i \right) ^t {f}\!\left( x_k\right) \right| \!\le \! \left| A^t {\nabla \! f_i}\!\left( \xi \right) \!-\! e_i\right| _1 \left\| {f}\!\left( x_k\right) \right\| _\infty \le a \left\| {f}\!\left( x_k\right) \right\| _{\infty }\!. \end{aligned}$$

Thus, \(\left| {f}\!\left( x_{k+1}\right) \right| _\infty \le a \left\| {f}\!\left( x_k\right) \right\| _\infty \le a^{k+1} \left\| {f}\!\left( \overline{x}\right) \right\| _\infty \) and we are done. \(\square \)

Proof of Theorem 1

We start by rewriting (1) and (2) as

$$\begin{aligned} M_k M_{k}^t s_k = - \alpha _k g_k, \end{aligned}$$
(67)

for \(s_k := x_{k+1} - x_k\hbox { and }g_k := {\nabla \! f}\!\left( x_k\right) \). Equation (67) shows that \(s_k^t g_k = - \left\| M_k^t s_k \right\| ^2\!{/} \alpha _k \le 0\) and the first Wolfe condition (3) yields

$$\begin{aligned} f_k := {f}\!\left( x_k\right) \ge f_{k+1}. \end{aligned}$$
(68)

This implies that

$$\begin{aligned} x_k \in K := {\left\{ x \in {\mathbb R}^{n} \ \mathrm {with} \ {f}\!\left( x\right) \le {f}\!\left( x_0\right) \right\} }. \end{aligned}$$

Since \(f\) has bounded level sets, the set \(K\) is compact and the matrices \(M_k\) are bounded. Since \(f\) has continuous first derivatives there exists a constant \(\kappa \) such that \(\left\| g \right\| _k \le \kappa ,\,\left\| M_k \right\| \le \kappa ,\,\left\| x_k \right\| \le \kappa \hbox { and }\left\| s_k \right\| \le \kappa \).

To prove that \(\lim _{k\rightarrow \infty } g_k = 0\) we use the well known result that if a bounded sequence \({\left\{ {u}_{k}, \ {k} \in \mathbb N\right\} } \subset \mathbb R{}\) is such that all its converging subsequences \(u_{n_k}\) converge to zero then \(u_k\) itself converges to \(0\). Let us then consider a subsequence \(g_{n_k}\) such that \(\lim _{k \rightarrow \infty } \left\| g_{n_k} \right\| = L\) and show that \(L = 0\). Equation (67) leads to \(\left\| g_k \right\| \le \kappa ^3 / \alpha _k\). Thus, if some sub subsequence of \(\alpha _{n_{k_j}}\) converges to \(+\infty \) then \(L = \lim _{j \rightarrow \infty } \left\| g_{n_{k_j}} \right\| = 0\) and we are done. We can then assume that there exists \(A\) such that \(\alpha _{n_k} \le A\) for all \(k\). Equations (3), (67) and (68) yield

$$\begin{aligned} \left\| M_{n_k}^t s_{n_k} \right\| ^2&\le - \alpha _{n_k} s_{n_k}^t g_{n_k} \le \alpha _{n_k} \left( {f}\!\left( x_{n_k}\right) - {f}\!\left( x_{n_k+1}\right) \right) / \sigma \\&\le \alpha _{n_k} \left( {f}\!\left( x_{n_k}\right) - {f}\!\left( x_{n_{k+1}}\right) \right) / \sigma \le A \left( {f}\!\left( x_{n_k}\right) - {f}\!\left( x_{n_{k+1}}\right) \right) / \sigma \end{aligned}$$

and

$$\begin{aligned} \sum _{k = 0}^\infty \left\| M_{n_k}^t s_{n_k} \right\| ^2 \le \frac{A}{\sigma } \left( {f}\!\left( x_{n_0}\right) - \inf _{x \in K} {f}\!\left( x\right) \right) . \end{aligned}$$

Therefore, \(\lim _{k \rightarrow \infty } M_{n_k}^t s_{n_k} = 0\). If follows that \(g_{n_k} = - M_{n_k} M_{n_k}^t s_{n_k} / \alpha _{n_k}\) converges to \(0\), because the \(M_{n_k}\) are bounded and \(\alpha _{n_k} \ge \overline{\alpha }\). \(\square \)

Proof of Theorem 3

This theorem is an specialization of Theorem 4 in page 145 of [11]. The reader can prove Theorem 3 by using Theorem 4 in [11] by realizing that the hypothesis of Theorem 3 corresponds to a simplified version of the more general concept of seed, which is described in definition 7 in page 142 of [11] (There is a typo in the end of the item 3 in this definition. It should read like “...\({D}\!\left( 0\right) \overline{s}_r\) and \({D}\!\left( 0\right) Q \overline{s}_{r+1}.\)”, with a \(Q\) between \({D}\!\left( 0\right) \hbox { and }\overline{s}_{r+1}\).) The reader will note that our choice for the diagonal exponents in (11) (\(0\) in the first block, \(1\) in the second and \(d_n > 2\)) leads to vacuous sub itens (c) and (d) in the fouth item in the definition of seed. As a result, we do not need to worry about these items. Morevoer, we only need to worry about item (b) in the case \(i = j = n\). This is why we have hypothesis (17) in Theorem 3. The reader can now read the many details and involved proving Theorem [11]. This proof is very technical and there is no point repeating it here. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mascarenhas, W.F. The divergence of the BFGS and Gauss Newton methods. Math. Program. 147, 253–276 (2014). https://doi.org/10.1007/s10107-013-0720-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-013-0720-6

Mathematics Subject Classification

Navigation