Abstract
We present examples of divergence for the BFGS and Gauss Newton methods. These examples have objective functions with bounded level sets and other properties concerning the examples published recently in this journal, like unit steps and convexity along the search lines. As these other examples, the iterates, function values and gradients in the new examples fit into the general formulation in our previous work Mascarenhas (Comput Appl Math 26(1), 2007), which also presents an example of divergence for Newton’s method.
Similar content being viewed by others
References
Absil, P., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic functions. SIAM J. Optim. 16(2), 531–547 (2005)
Bertsekas, D.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)
Cotton, É.: Sur les solutions asymptotiques des équations différentielles. Ann. Sci. de l’École Normale Supérieure, Sér. 3(28), 473–521 (1911)
Dai, Y.-H.: A perfect example for the BFGS method, Published online March (2012)
Dai, Y.-H.: Convergence properties of the BFGS algorithm. SIAM J. Optim. 13(3), 693–701 (2002)
Fefferman, C.: A generalized sharp Whitney theorem for jets. Revista Matematica Iberoamericana 21(2) (2005)
Fefferman, C.: Extension of \(C^{m,\omega }\)-Smooth Functions by Linear Operators, Manuscript available at: www.math.princeton.edu/facultypapers/Fefferman. To appear in Revista Matematica Iberoamericana
Hadamard, J.: Sur l’itération et les solutions asymptotiques des équations différentielles. Bull. Soc. Math. France 29, 224–228 (1901)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier 48, 769–783 (1998)
Lojasiewicz, S.: Ensemble Semi-Analytiques. IHES Lecture notes (1965)
Mascarenhas, W.F.: On the divergence of line search methods. Comput. Appl. Math. 26(1), 129–169 (2007)
Mascarenhas, W.F.: The BFGS algorithm with exact line searches fails for nonconvex functions. Math. Program. 99(1), 49–61 (2004)
Mascarenhas, W.F.: Newton’s iterates can converge to non-stationary points. Math. Program. 112(2), 327–334 (2008)
Nocedal, J., Wright, S.J.: Numerical Optimization, Springer Series in Operations Research. Springer, Berlin (1999)
Ortega, J.M.: The Newton–Kantorovich theorem. Am. Math. Mon. 75(6), 658–660 (1968)
Perron, O.: Ueber staebilitat und Asymptotisches Verhalten der Loesung eines Systeme endlicher Differetialgleichungen. J. Reine. Angew. Math. 161, 41–64 (1929)
Poincaré, H.: Les méthodes nouvelles de la mécanique céleste. Gauthier-Villars et fils (1892)
Powell, M.J.D.: Nonconvex minimization calculations and the conjugate gradient method. Lect. Notes Math. 1066, 122–141 (1984)
Rally, L.B.: A comparison of the existence theorems of Kantorovich and Moore. SIAM J. Numer. Anal. 17(1), Society for Industrial and, Applied Mathematics Feb. (1980)
Whitney, H.: Analytic extensions of differentiable functions defined in closed sets. Trans. Am. Math. Soc. 36, 63–89 (1934)
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Technicalities
Technicalities
This appendix begins with an explanation as to why the form of the iterates, function values and gradients in reference [4] was already described in [11]. We then prove Lemma 1 and, finally, the theorems.
Let us then see how several equations in [4] and [11] correspond to (4). In [4] and the example for Newton’s method in [11] the parameter \(d_n\) is equal to \(1\). In the example for the BFGS method in [11] it is equal to \(3\). Equation (4) corresponds to Eqs. (10)–(12) in [11]. It is generalized in Eqs. (34)–(37) of [11]. Dai [4] adds a constant \(f^*\) to \(f\), but this is irrelevant. It calls \(\lambda \) by \(t\) and thinks in terms of the steps \(\delta _k = x_{k+1} - x_k\), so that \(\delta _k = Q^k {D}\!\left( \lambda \right) ^k \overline{\delta }\) in [4]’s equation (2.1). This is just an affine change of coordinates of (4) and only a different notation for the corresponding \(s_k\)’s used in [11]. The matrix \(M\) in [4]’s equation (2.2) is equal to the matrix \(Q {D}\!\left( \lambda \right) \), whose \(k\)th power multiplies \(\overline{x}_k\) in Eq. (4). The matrix \(P\) in [4]’s equation (2.5) is equal to the matrix \(\lambda Q {D}\!\left( \lambda \right) ^{-1}\) whose \(k\)th power multiplies \(\overline{g}_k\) in Eq. (4). Therefore, the iterates, the function values and the gradients are described in essentially the same way in the examples in [4, 11, 12], in terms of orthogonal matrices \(Q\) scaled by powers of \(\lambda \) (or \(t\)) via the matrices \({D}\!\left( \lambda \right) \) or their inverses.
Proof of Lemma 1
Let us define \(\delta _0 = 0\hbox { and }x_0 = \overline{x}\) and consider \(\delta _k\hbox { and }x_k\) defined inductively by \(\delta _k = A {f}\!\left( x_{k-1}\right) \hbox { and }x_k = x_{k-1} - \delta _k\). The lemma will be proved if we show that
because these bounds imply that \(x_k\) converges to \(x_\infty \) with \(\left\| x_\infty - \overline{x} \right\| _\infty \le b / (1-a)\hbox { and }{f}\!\left( x_\infty \right) = 0\). Let us then prove (66) by induction. Equation (66) certainly holds for \(k = 0\). Assuming that (66) holds for \(k\), we conclude from (7) and definition of \(\delta _k\) that
and the first bound in (66) holds for \(k + 1\). The second bound on (66) follows from the analogous bound for \(\left\| x_k - \overline{x} \right\| _\infty \) and the bound in \(\delta _{k+1}\) above. It shows that \(x_{k+1} \in D\) and the segment \(S\) connecting \(x_k\) to \(x_{k+1}\) is contained in \(D\). As a result, the Mean Value Theorem for the function \(f_i: D \mapsto \mathbb R{}\) implies that, for some \(\xi \in S\),
Thus, \(\left| {f}\!\left( x_{k+1}\right) \right| _\infty \le a \left\| {f}\!\left( x_k\right) \right\| _\infty \le a^{k+1} \left\| {f}\!\left( \overline{x}\right) \right\| _\infty \) and we are done. \(\square \)
Proof of Theorem 1
We start by rewriting (1) and (2) as
for \(s_k := x_{k+1} - x_k\hbox { and }g_k := {\nabla \! f}\!\left( x_k\right) \). Equation (67) shows that \(s_k^t g_k = - \left\| M_k^t s_k \right\| ^2\!{/} \alpha _k \le 0\) and the first Wolfe condition (3) yields
This implies that
Since \(f\) has bounded level sets, the set \(K\) is compact and the matrices \(M_k\) are bounded. Since \(f\) has continuous first derivatives there exists a constant \(\kappa \) such that \(\left\| g \right\| _k \le \kappa ,\,\left\| M_k \right\| \le \kappa ,\,\left\| x_k \right\| \le \kappa \hbox { and }\left\| s_k \right\| \le \kappa \).
To prove that \(\lim _{k\rightarrow \infty } g_k = 0\) we use the well known result that if a bounded sequence \({\left\{ {u}_{k}, \ {k} \in \mathbb N\right\} } \subset \mathbb R{}\) is such that all its converging subsequences \(u_{n_k}\) converge to zero then \(u_k\) itself converges to \(0\). Let us then consider a subsequence \(g_{n_k}\) such that \(\lim _{k \rightarrow \infty } \left\| g_{n_k} \right\| = L\) and show that \(L = 0\). Equation (67) leads to \(\left\| g_k \right\| \le \kappa ^3 / \alpha _k\). Thus, if some sub subsequence of \(\alpha _{n_{k_j}}\) converges to \(+\infty \) then \(L = \lim _{j \rightarrow \infty } \left\| g_{n_{k_j}} \right\| = 0\) and we are done. We can then assume that there exists \(A\) such that \(\alpha _{n_k} \le A\) for all \(k\). Equations (3), (67) and (68) yield
and
Therefore, \(\lim _{k \rightarrow \infty } M_{n_k}^t s_{n_k} = 0\). If follows that \(g_{n_k} = - M_{n_k} M_{n_k}^t s_{n_k} / \alpha _{n_k}\) converges to \(0\), because the \(M_{n_k}\) are bounded and \(\alpha _{n_k} \ge \overline{\alpha }\). \(\square \)
Proof of Theorem 3
This theorem is an specialization of Theorem 4 in page 145 of [11]. The reader can prove Theorem 3 by using Theorem 4 in [11] by realizing that the hypothesis of Theorem 3 corresponds to a simplified version of the more general concept of seed, which is described in definition 7 in page 142 of [11] (There is a typo in the end of the item 3 in this definition. It should read like “...\({D}\!\left( 0\right) \overline{s}_r\) and \({D}\!\left( 0\right) Q \overline{s}_{r+1}.\)”, with a \(Q\) between \({D}\!\left( 0\right) \hbox { and }\overline{s}_{r+1}\).) The reader will note that our choice for the diagonal exponents in (11) (\(0\) in the first block, \(1\) in the second and \(d_n > 2\)) leads to vacuous sub itens (c) and (d) in the fouth item in the definition of seed. As a result, we do not need to worry about these items. Morevoer, we only need to worry about item (b) in the case \(i = j = n\). This is why we have hypothesis (17) in Theorem 3. The reader can now read the many details and involved proving Theorem [11]. This proof is very technical and there is no point repeating it here. \(\square \)