Abstract
The solution of nonlinear inverse problems is a challenging task in numerical analysis. In most cases, this kind of problems is solved by iterative procedures that, at each iteration, linearize the problem in a neighborhood of the currently available approximation of the solution. The linearized problem is then solved by a direct or iterative method. Among this class of solution methods, the Gauss–Newton method is one of the most popular ones. We propose an efficient implementation of this method for large-scale problems. Our implementation is based on projecting the nonlinear problem into a sequence of nested subspaces, referred to as Generalized Krylov Subspaces, whose dimension increases with the number of iterations, except for when restarts are carried out. When the computation of the Jacobian matrix is expensive, we combine our iterative method with secant (Broyden) updates to further reduce the computational cost. We show convergence of the proposed solution methods and provide a few numerical examples that illustrate their performance.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
This paper is concerned with the solution of nonlinear least-squares problems of the form
where \(f:\mathbb {R}^n\rightarrow \mathbb {R}^m\) is a differentiable nonlinear function, \(\textbf{x}\in \mathbb {R}^n\) denotes the unknown vector that we would like to determine, \(\textbf{y}\in \mathbb {R}^m\) represents measured data, and \(\left\| \cdot \right\| \) stands for the Euclidean vector norm. This kind of problems arises in several fields of science and engineering, such as in image reconstruction and geophysics. In some applications, the solution of the problem (1.1) is very sensitive to perturbations, e.g., to errors in the data vector \(\textbf{y}\); these errors may, for instance, be caused by measurement inaccuracies. In this paper, we only consider problems that are not overly sensitive to perturbations; if the given problem is very sensitive to perturbations in \(\textbf{y}\), then we assume that the problem has been regularized to make the solution less sensitive. For an example of well-posed problem of the form (1.1) we refer the reader to the so-called Bratu problem; see, e.g., [29] and Sect. 4. If the problem is ill-posed, a simple approach to regularize it is to use Tikhonov regularization (see, e.g., [21]), i.e., to substitute (1.1) with
for a suitable \(\mu >0\). In the following, we will assume that this substitution, if required, has been performed.
We consider the Gauss–Newton method [20, 34] for the solution of (1.1). This is an iterative method that at each step solves a linear least-squares problem obtained from a first-order Taylor expansion of the nonlinear function \(r(\textbf{x})=\textbf{y}-f(\textbf{x})\). Convergence is assured thanks to the use of the Armijo-Goldstein condition; see below. In case the problem (1.1) is underdetermined, i.e., when n is larger than m, the solution is not unique. Pes and Rodriguez [32, 33] describe a modification of the Gauss–Newton method that can be applied in this situation to determine the minimal-norm solution; also some regularized variants of the Gauss–Newton method are described.
Nevertheless, we would like to briefly discuss the situation when the Jacobian matrix is severely ill-conditioned or even rank-deficient. Since we project a linearization of the (large) problem (1.1) into solution subspaces of fairly small dimensions, the projected problems usually are better conditioned than the original large problem. As it will be illustrated by numerical examples in Sect. 4, we are able to quite accurately solve ill-conditioned problems, when the data \(\textbf{y}\) are not corrupted by noise, by employing as the only type of regularization the projection into a small linear subspace. This assumption corresponds to committing an inverse crime. Nevertheless, we illustrate that our solution method is less sensitive to ill-conditioning than the standard Gauss–Newton method, and that, when the problem (1.1) is severely ill-conditioned, the standard Gauss–Newton method may fail to determine an accurate solution even under the scenario of an inverse crime.
The main computational effort of the Gauss–Newton method is the solution of linear least-squares problems of possibly large dimensions at each iteration. We would like to address this issue by using Generalized Krylov Subspaces (GKS). The subspaces we consider were first applied by Lampe et al. [30] to the solution of large linear discrete ill-posed problems by Tikhonov regularization. Subsequently they were used for the solution of nonconvex minimization problems in imaging and statistics; see, e.g., [9, 10, 27, 31]. The approximate solution \(\textbf{x}^{(k)}\in \mathbb {R}^n\) computed at iteration k lives in a generalized Krylov subspace \(\mathcal {V}_{k-1}\), and a projected residual error is included in the next generalized Krylov subspace \(\mathcal {V}_k\). The solution subspaces generated in this manner are nested and typically contain accurate approximations of the desired solution of (1.1) already when they are of fairly small dimension; in particular, the use of generalized Krylov subspaces performs better than using standard Krylov subspaces; see [7, 31] for illustrations.
Another possible bottleneck of the Gauss–Newton method for large-scale problems is the computation of the Jacobian matrix at each iteration. If this is very time-consuming, then instead of forming the Jacobian by evaluating all of its entries at each iteration, we approximate the Jacobian by using a secant (Broyden) update; see [6, 20]. When the iterates \(\textbf{x}^{(k)}\) and \(\textbf{x}^{(k+1)}\) are close, the secant update allows us to approximate the Jacobian of f at \(\textbf{x}^{(k+1)}\) by the Jacobian (or an approximation thereof) at \(\textbf{x}^{(k)}\) plus a low-rank correction. Therefore, we update the Jacobian for \(\tilde{k}\) iterations, and compute the Jacobian “from scratch” only when \(k\equiv 0\;\text{ mod }\; \tilde{k}\).
Finally, we observe that, if many iterations of the algorithms are performed, the dimension of the space \(\mathcal {V}_k\) may increase significantly and this, in turn, can slow down the computations. To avoid this situation, we propose a restarted variant based on the algorithms proposed in [11]. Every \(k_{\textrm{rest}}\) iterations the space \(\mathcal {V}_k\) is restarted, i.e., it is replaced by a one-dimensional space; see below. This strategy ensures that the dimension of the space \(\mathcal {V}_k\) is bounded regardless of the number of iterations. We will illustrate that this reduces the computing time.
It is the purpose of this paper to show that projection into generalized Krylov subspaces may be beneficial when solving large nonlinear least-squares problems (1.1). This approach is compared to a standard implementation of the Gauss–Newton method. We remark that many approaches to Newton-Krylov methods have been discussed in the literature; among the first papers on this topic are those written by Brown and Saad [4, 5], who consider the application of Arnoldi methods, which requires \(m=n\). Darvishi and Shin [13] and Kan et al. [28] provide more recent contributions. However, these methods do not focus on nonlinear least-squares problems and, moreover, consider the application of standard Krylov subspaces.
This paper is organized as follows. Section 2 briefly recalls the classical Gauss–Newton method. Our new solution approaches are described in Sect. 3 and there we also show some of properties of these methods. Section 4 presents a few numerical examples that illustrate the performance of our methods, and concluding remarks can be found in Sect. 5.
2 The Gauss–Newton Method
This section reviews the classical Gauss–Newton method applied to the solution of nonlinear least-squares problems (1.1). Let \(r(\textbf{x})=\textbf{y}-f(\textbf{x})\) be the residual function, where \(\textbf{x}\in \mathbb {R}^n\) represents an approximate solution of the problem (1.1), the function \(f:\mathbb {R}^n\rightarrow \mathbb {R}^m\) is nonlinear and Fréchet differentiable, and the vector \(\textbf{y}\in \mathbb {R}^m\) represents available data. We disregard for the moment that in some applications, the problem (1.1) might be severely ill-conditioned. The Gauss–Newton method computes the solution \(\textbf{x}^*\) of (1.1) by minimizing the Euclidean norm of the residual \(r(\textbf{x})\), i.e., it solves
where for simplicity, let us assume that the solution \(\textbf{x}^*\) is unique. The Gauss–Newton method is an iterative procedure that, at each iteration, linearizes the problem (1.1) and minimizes the norm of the residual of the linearized problem. Since we assumed that f is Fréchet differentiable, it follows that \(r(\textbf{x})\) satisfies this property as well. Therefore, we use the approximation
where \(\textbf{x}^{(k)}\) is the current approximation of \(\textbf{x}^*\). The vector \(\textbf{q}^{(k)}\) is referred to as the step, and \(J^{(k)} = J(\textbf{x}^{(k)}) \in \mathbb {R}^{m\times n}\) represents the Jacobian matrix of \(r(\textbf{x}) = [r_1(\textbf{x}),\ldots ,r_m(\textbf{x})]^T\) at \(\textbf{x}=\textbf{x}^{(k)}\). It is defined by
Let \(\textbf{x}^{(0)}\in \mathbb {R}^n\) be an initial approximation of the solution \(\textbf{x}^*\) of the nonlinear least-squares problem (2.1). We determine, for \(k=0,1,2,\ldots ~\), the step \(\textbf{q}^{(k)}\) by solving the linear least-squares problem
and we compute the next approximation \(\textbf{x}^{(k+1)}\) of \(\textbf{x}^*\) as
where the coefficient \(\alpha ^{(k)}\) is determined according to the Armijo-Goldstein principle, see below, to ensure convergence to a stationary point of
see, e.g., [20, 25] for discussions. Since the functional \(\mathcal {J}\) might not be convex, it is in general not possible to show that the iterates \(\textbf{x}^{(k)}\) converge to a minimizer of \(\mathcal {J}\). We remark that if one knows that the solution \(\textbf{x}^*\) is nonnegative, then it is possible to determine \(\alpha ^{(k)}\) such that \(\textbf{x}^{(k)}\ge 0\) for all k. We will not dwell on this situation here, but we note that it is straightforward to extend our solution method to compute nonnegative solutions.
We say that \(\alpha ^{(k)}\) satisfies the Armijo-Goldstein condition if
To determine a suitable \(\alpha ^{(k)}\), we apply a line search. Let \(\alpha _0^{(k)}>0\) be large enough. If \(\alpha _0^{(k)}\) satisfies the condition (2.2), then we set \(\alpha ^{(k)}=\alpha _0^{(k)}\), otherwise we let
We iterate in this manner until we find an \(\alpha _j^{(k)}\) that satisfies the condition (2.2) and then set \(\alpha ^{(k)}=\alpha _j^{(k)}\). The Gauss–Newton method with line search based on the Armijo-Goldstein condition yields a converging sequence of iterates \(\textbf{x}^{(k)}\), \(k=0,1,2,\ldots ~\), for any differentiable function f; see [25] for a proof. Algorithm 1 summarizes the computations.
Theorem 1
Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}^m\) be a Fréchet differentiable function and let \(\textbf{x}^{(k)}\), \(k=1,2,\ldots ~\), denote the iterates generated by Algorithm 1. There exists \(\textbf{x}^*\) such that
and \(\textbf{x}^*\) is a stationary point of \(\mathcal {J}(\textbf{x})=\left\| \textbf{y}-f(\textbf{x})\right\| ^2\). If \(\mathcal {J}\) is convex, then \(\textbf{x}^*\) is a global minimizer of \(\mathcal {J}\). Moreover, if \(\mathcal {J}\) is strictly convex, then there is a unique global minimizer that coincides with \(\textbf{x}^*\).
Proof
3 The Gauss–Newton Method in Generalized Krylov Subspaces
We would like to reduce the computational cost of the iterations with Algorithm 1 when applied to the solution of large-scale problems (1.1). We achieve this by determining approximate solutions of the problem (2.1) in a sequence of fairly low-dimensional solution subspaces \(\mathcal {V}_k\), \(k=1,2,\ldots ~\), whose dimensions increase with k. At iteration k, we first compute the approximation \(\textbf{x}^{(k+1)}\in \mathcal {V}_k\) of the solution \(\textbf{x}^*\) of (2.1), and then expand the solution subspace by including an appropriately chosen vector in \(\mathcal {V}_k\). This defines the next solution subspace \(\mathcal {V}_{k+1}\). Since \(d_k=\textrm{dim}(\mathcal {V}_k)\ll n\), the computation of \(\textbf{x}^{(k+1)}\) is less expensive than computing a new iterate with the standard Gauss–Newton method, which seeks to determine a solution in \(\mathbb {R}^n\).
Assume that at the k-th step an approximation \(\textbf{x}^{(k)}\in \mathcal {V}_{k-1}\subset \mathcal {V}_k\) of the solution \(\textbf{x}^*\) is known. Let the columns of the matrix \(V_k\in \mathbb {R}^{n\times d_k}\) form an orthonormal basis for the subspace \(\mathcal {V}_k\). Then
where \(I_{d_k}\) denotes the identity matrix of order \(d_k\), \(\textbf{z}^{(k)}\in \mathbb {R}^{d_k}\), and the last entry of \(\textbf{z}^{(k)}\) vanishes since \(\textbf{z}^{(k)}\in \mathcal {V}_{k-1}\).
The residual at the k-th step is given by
We determine a new approximate solution \(\textbf{x}^{(k+1)}\in \mathcal {V}_k\) as
where the vector \(\textbf{z}^{(k+1)}\) is computed by applying the Gauss–Newton method to (3.1), i.e.,
and the step \(\textbf{q}^{(k)}\) is obtained by solving the least-squares problem
With slight abuse of notation, we denote the step by \(\textbf{q}\), similarly as in the previous section, even though the vectors \(\textbf{q}\) are of different dimensions; in this section, the dimension depends on k. The matrix \(J^{(k)}\in \mathbb {R}^{m\times d_k}\) represents the Jacobian of r in (3.1) evaluated at the point \(\textbf{z}^{(k)}\). We have
where \(J_f\) denotes the Jacobian matrix of the nonlinear function f. To ensure convergence, the parameter \(\alpha ^{(k)}\) in (3.2) is determined by applying the Armijo-Goldstein principle, i.e., \(\alpha ^{(k)}\) is the largest member of the sequence \(\alpha ^{(0)}2^{-i}\), \(i=0,1,\ldots ~\), for which
In our numerical experiments we set \(\alpha ^{(0)}=1\).
Once the new iterate \(\textbf{x}^{(k+1)}\) has been evaluated, we enlarge the solution subspace \(\mathcal {V}_k\) by determining a vector \(\textbf{v}^{(k+1)}\) that is orthogonal to \(\mathcal {V}_{k}\). To this end, let us first compute the Jacobian \(J_f\left( \textbf{x}^{(k+1)}\right) \) of f at \(\textbf{x}^{(k+1)}\). This computation also is required for determining \(\textbf{x}^{(k+2)}\) and, therefore, is carried out only once. Let
The intuition is that \(\textbf{g}^{(k+1)}\) is the residual of the normal equations associated with the linearized problem. However, in general, this vector is not orthogonal to the subspace \(\mathcal {V}_k\). We therefore explicitly orthogonalize it to this subspace, i.e., we compute
and then normalize the vector so obtained,
We store an orthonormal basis for \(\mathcal {V}_{k+1}\) as columns of the matrix \(V_{k+1}=\left[ V_k,\textbf{v}^{(k+1)}\right] \in \mathbb {R}^{n\times (d_k+1)}\).
Although it is an extremely rare event when solving problems that arise from a real application, it is possible that \(\tilde{\textbf{g}}^{(k+1)}=\textbf{0}\), i.e., that \(\textbf{g}^{(k+1)}\in \mathcal {V}_k\). In this case, we simply do not enlarge the solution subspace in this iteration and proceed with the computations. Differently from Krylov methods for the solution of linear systems of equations, such as GMRES, this is not a case of a “lucky breakdown” and one generally cannot terminate the iterations.
The last point to address is how to determine \(V_0\), i.e., the basis for the initial subspace. In our computed examples, we will choose an initial approximate solution \(\textbf{x}^{(0)}\ne \textbf{0}\) and set
The computations are summarized by Algorithm 2.
We are now in position to show some theoretical results.
Lemma 1
Let f be a Fréchet differentiable function. With the notation of Algorithm 2, it holds that for every k, there is a finite positive integer j such that
Proof
This result follows from [25, Proposition 4.1]. \(\square \)
Theorem 2
Let f be Fréchet differentiable and let \(\textbf{z}^{(k)}\), \(k=1,2,\ldots ~\), denote the iterates generated by Algorithm 2. Define
Then there is a vector \(\textbf{x}^*\) such that
and \(\textbf{x}^*\) is a stationary point of \(\mathcal {J}(\textbf{x})=\left\| \textbf{y}-f(\textbf{x})\right\| ^2\). If \(\mathcal {J}\) is convex, then \(\textbf{x}^*\) is a global minimizer of \(\mathcal {J}\). Moreover, if \(\mathcal {J}\) is strictly convex, then \(\textbf{x}^*\) is the unique global minimizer.
Proof
We first discuss the case when there is an index \(k_0\) such that \(\textbf{g}^{(k)}\in \mathcal {V}_{k_0}\) for all \(k>k_0\) and \(\textrm{dim}\left( \mathcal {V}_{k_0}\right) <n\). Then, after iteration \(k_0\), all subsequent iterates belong to \(\mathcal {V}_{k_0}\). However, since the Armijo-Goldstein principle is satisfied, this means that Algorithm 2 reduces to Algorithm 1 after \(k_0\) iterations and, therefore, the iterates converge to a stationary point of \(\mathcal {J}\) by Theorem 1.
On the other hand, if there is no index \(k_0\) such that \(\textbf{g}^{(k)}\in \mathcal {V}_{k_0}\) for all \(k>k_0\) and \(\textrm{dim}\left( \mathcal {V}_{k_0}\right) <n\), then there is an index \(k_1\) such that \(\textrm{dim}\left( \mathcal {V}_{k_1}\right) =n\) and, therefore, \(\mathcal {V}_{k}=\mathbb {R}^n\) for all \(k>k_1\). Convergence to a stationary point now follows from Theorem 1. \(\square \)
Remark 1
The proof of Theorem 2 would appear to suggest that one may need n iterations with Algorithm 2 to obtain an accurate approximation of the limit point. However, in problems that stem from real applications, typically only a few iterations are required to satisfy the stopping criterion. This is illustrated by numerical experiments in Sect. 4.
3.1 Secant Acceleration
In some applications the evaluation of the Jacobian \(J_f\) of f may be computationally expensive. We therefore discuss the updating strategy proposed by Broyden [6] that makes it possible to avoid the computation of \(J_f\) from scratch every iteration. A nice discussion of this updating strategy is provided by Dennis and Schnabel [20, Chapter 8]. Broyden [6] proposed the following secant update of the Jacobian matrix
where \(H^{(k)}\) generally is a rank-two matrix defined by
with \(\Delta \textbf{x}^{(k)}=\textbf{x}^{(k+1)}-\textbf{x}^{(k)}\) and \(\Delta \textbf{r}^{(k)}=\textbf{r}^{(k+1)}-\textbf{r}^{(k)}\). We then can compute an approximation of \(J_f(V_{k+1}\textbf{z}^{(k+1)})V_{k+1}\) as follows
These updating formulas can be applied repeatedly, for \(k=0,1,\ldots ~\), with a suitable choice of \(H^{(0)}\). Then \(H^{(k)}\) is a low-rank matrix when k is fairly small; see [20]. The matrix \(H^{(k)}\) typically is not explicitly formed; instead the vectors that make up \(H^{(k)}\) are stored and used when evaluating matrix–vector products with \(H^{(k)}\). When the number of updating steps increases, the quality of the approximation of \(J_f\) by \(H^{(k)}\) at desired points \(\textbf{x}\) may deteriorate, but not by very much. Dennis and Schnabel [20, p. 176] refer to this behavior as “bounded deterioration”; see [20, Lemma 8.2.1] for bounds on the deterioration. When the approximation \(H^{(0)}\) of the Jacobian at the initial point \(\textbf{x}_{\textrm{init}}\) is sufficiently close to the Jacobian at the desired solution, \(J_f(\textbf{x}^*)\), the bounded deterioration of the quality of the approximations of \(H^{(k)}\) is mild enough to secure convergence to the solution \(\textbf{x}^*\) as k increases when repeatedly using the updating formula; see [20, Theorem 8.2.2] for details. Nevertheless, since it is difficult to assess whether the conditions of this theorem are satisfied, we recompute the Jacobian from scratch every \(\tilde{k}\) iterations. Algorithm 3 summarizes the computations.
3.2 Restarting Strategy
The main computational cost of the outlined solution method, if we ignore the computation of the Jacobian of f, is the solution of a linear system of equations of fairly small size at every iteration. However, if many iterations are carried out, the systems may become large and require a non-negligible computational effort to solve. To avoid this issue, we employ the restart strategy proposed in [11] for the Maximization–Minimization algorithm. Specifically, the solution method described in [11] restarts the generalized Krylov subspace when a fixed user-specified number of iterations have been carried out. We denote this number by \(k_{\textrm{rest}}>1\) as the number of iterations after which we restart the GKS. Since we set \(V_0\in \mathbb {R}^{n\times 1}\), we have that, if no restarting is used, \(\textrm{dim}(V_k)=k+1\). To avoid that the dimension of \(V_k\) and, therefore, the size of \(J^{(k)}\), become too large, we set, if \(k\equiv 0 \,\text{ mod }\, k_{\textrm{rest}}\),
This ensures that the number of columns of \(V_k\) never exceeds \(k_{\textrm{rest}}\) and that the computational effort per iteration does not increase too much with k. For large-scale problems, this also has the added benefit of reducing the memory requirement of the method. This may be important if only a small amount of fast memory is available or if the size of the problem to be solved is very large; see [11] for a discussion.
Since when using restarts the spaces \(\mathcal {V}_k\) are not nested, the proof of Theorem 2 does not hold. We only can state that, thanks to the Armijo-Goldstein rule, it holds
Algorithm 4 summarizes the computations.
4 Numerical Examples
In this section, we apply the algorithms described in the previous sections to three nonlinear least-squares problems: a real-world problem from applied geophysics, a partial differential equation that appears in a number of applications such as in fuel ignition models of thermal combustion theory, and a simple problem for which the Jacobian is a very sparse matrix. For each problem we compare the algorithms in terms of accuracy of the computed solution, number of iterations required to satisfy the stopping criterion, and CPU time. We measure the accuracy by means of the Relative Reconstruction Error (RRE)
where \(\textbf{x}_{\textrm{true}}\) denotes the desired solution of the problem. For each algorithm we set the maximum number of iterations to \(K=100\) and the tolerance for the stopping criterion to \(\tau =10^{-5}\). In Algorithm 3 we set \(\tilde{k}=10\), i.e., we compute the Jacobian from scratch every \(\tilde{k}=10\) iterations. The updating formulas perform best if the updated Jacobian is close to the Jacobian at \(\textbf{x}_{\textrm{true}}\). We therefore do not use the updating formulas in Algorithm 3 for \(k=1,2,\ldots ,\tilde{k}\). Finally, we set in Algorithm 4 the number of iterations after which the space \(\mathcal {V}_k\) is restarted to \(k_{\textrm{rest}}=20\). All computations were carried out in MATLAB 2021b running on a laptop computer with an AMD Ryzen 7 5800HS CPU and 16GB of RAM.
A geophysics problem. We consider a nonlinear model from geophysics with the aim of reconstructing the distribution of the electrical conductivity and the magnetic permeability of the subsoil from measured data recorded with a ground conductivity meter (GCM). This device is made up of two coils, a transmitter and a receiver. The transmitter sends electromagnetic waves into the subsoil, while the receiver measures the induced electromagnetic field; see, e.g., [8, 14, 17,18,19] and references therein for more details on the mathematical model and [15, 16] for numerical results obtained.
We briefly recall how the model is defined. It is based on Maxwell’s equations. The subsoil is assumed to have a layered structure with n layers of thickness \(d_\ell \), \(\ell =1,2,\ldots ,n\). Layer \(\ell \) has electric conductivity \(\sigma _\ell \) and magnetic permeability \(\mu _\ell \). The last layer is assumed to be infinite. We define the so-called propagation constant as
where \(\lambda \) is the integration variable ranging from zero to infinity. It represents the ratio of the depth below the surface measured in meters and the inter-coil distance \(\rho \) (the distance between the transmitter and the receiver). The parameter \(\omega \) stands for the angular frequency, that is, \(2\pi \) times the frequency in Hertz.
The model, which describes the interaction between the GCM and the subsoil for the vertical and horizontal orientation of the device coils, is given by
where \(\varvec{\sigma }=[\sigma _1,\ldots ,\sigma _n]^T\), \(\varvec{\mu }=[\mu _1,\ldots ,\mu _n]^T\), h represents the height of the device while measuring above the ground, \(J_s\) denotes the Bessel function of the first kind of order s, and \(R_{\omega ,0}(\lambda )\) is the reflection factor defined as
Here \(N_0(\lambda )=\lambda /({\textrm{i}}\mu _0\omega )\), \(\mu _0=4\pi \cdot 10^{-7}\)H/m is the magnetic permeability of free space and \(Y_1(\lambda )\) is computed by the recursive formula
for \(\ell =n-1,n-2,\ldots ,1\), where \(Y_\ell (\lambda )\) is the surface admittance at the top of each layer. The coefficient \(N_\ell (\lambda )= u_\ell (\lambda )/({\textrm{i}}\mu _\ell \omega )\) represents the characteristic admittance at the \(\ell \)-th layer and yields the initialization of the recursion formula (4.1), that is when \(\ell =n\), we set \(Y_n(\lambda )=N_n(\lambda )\).
For simplicity, we focus on the reconstruction of the distribution of the electrical conductivity and assume the magnetic permeability to be known. To this end, we construct three test cases and consider three different profiles for the electrical conductivity: a Gaussian function, a continuous, but not everywhere differentiable function (which we refer to as “triangular”), and a step function. It is known that in this application the model requires the solution to be differentiable. Therefore, the last two cases are very challenging to solve. When we construct our test cases, we assume that there is no noise except for round-off errors. Thus, we commit an inverse-crime. However, we remark that the Jacobian of f is so ill-conditioned that perturbations introduced by round-off errors produce significant fluctuations in the computed solutions.
We set \(\textbf{x}^{(0)}\) to be a constant function. Since the problem is non-convex and underdetermined, the choice of the function \(\textbf{x}^{(0)}\) is important. For the Gaussian and triangular profiles, we set \(\textbf{x}_0=\left[ 0.5,\,0.5,\ldots ,0.5\right] ^T\), while for the step function we set \(\textbf{x}_0=\left[ 1.5,\,1.5,\ldots ,1.5\right] ^T\). Moreover, we calibrate the instrument so that measurements are taken for 10 different heights, i.e., \(m=10\), and the subsoil is assumed to be composed of 100 layers, i.e., \(n=100\).
Table 1 reports results obtained with Algorithms 1, 2, 3, and 4. Since the problems considered are extremely ill-conditioned, the Gauss–Newton algorithm produces poor reconstructions. Therefore, similarly to the Levenberg-Marquardt scheme (see, e.g., [26] and references therein) we regularize the inversion of the Jacobian by applying Tikhonov regularization; see, e.g., [21] for a discussion and analysis of Tikhonov regularization. In particular, we solve
and refer to the algorithm so defined as “Algorithm 1 Regularized”. Here \(10^{-4}\) is the regularization parameter and \(\left\| \textbf{q}\right\| ^2\) is the regularization operator.
We can observe that, even when regularized, the Gauss–Newton algorithm fails to provide an accurate approximation of the desired solution and the algorithm is significantly slower than Algorithms 2, 3, and 4 both in terms of the number of iterations required and CPU time. The latter three algorithms provide satisfactory reconstructions for all three problems and are inexpensive computationally. Since very few iterations are performed Algorithms 2 and 4 produce basically indistinguishable results. Moreover, the use of the Broyden update reduces the computational cost with small to no impact on the accuracy of the computed solution. We point out that Algorithms 2, 3, and 4 do not require fine-tuning of any parameters to perform well, and they do not require Tikhonov regularization. The number of restart iterations \(k_{\textrm{rest}}\) can be fixed to 20 in all cases and, in our experience, the method is quite robust with respect to the choice of this parameter.
Figure 1 displays the computed reconstructions determined by the regularized Gauss–Newton method and Algorithms 2 and 3. We do not show the graphs determined by Algorithm 1 as these computed solutions are not meaningful approximation of the desired solutions. Moreover, we do not display the approximations computed by Algorithm 4 as they are nearly identical to the ones obtained by Algorithms 2.
Note that we do not propose here to combine both the secant update with the restarting of the solution subspace. The reason is twofold, firstly we do not want to complicate any further the algorithm by inserting multiple parameters that need to be tuned. Secondly, in our computed example few iterations are performed and the possible computational advantage of this combination would be negligible. Moreover, the obtained algorithm would be highly heuristic and it may be possible to expect erratic behavior. Therefore, at this time we do not consider the combinations of these two acceleration techniques.
Partial differential equation. This example describes a nonlinear partial differential equation that can be solved with the Gauss–Newton method. Specifically, we consider the Bratu problem
where \(\Delta x(s,t)\) stands for the Laplacian of x, \(x_s\) denotes the partial derivative of x along the s-direction, and \(\alpha \) is a constant. We are interested in investigating how the solution changes when the parameter \(\lambda \) is varied; see, e.g., [12, 24, 29] for discussions and illustrations of these kinds of path-following problems.
We let \(\Omega = [-3,3]^2\) and discretize the problem with standard 2nd order finite differences on an \(n\times n\) equispaced grid. The discretized problem can be written as
where the entries of \(\textbf{x}\) represent the discretized solution arranged in lexicographical order, \(L=L_1\otimes I+I\otimes L_1\), \(D=D_1\otimes I\), where \(\otimes \) denotes the Kronecker product, and the exponential is meant element-wise. We let
For simplicity, once we compute the sampling of the function x and denote it by \(\textbf{x}_{\textrm{true}}\), we construct \(\textbf{y}\) by
i.e., we disregard approximation errors in the PDE as well as the scaling.
In our test, we let \(x(s,t)=\textrm{e}^{-10(s^2+t^2)}\) and sample x on a \(100\times 100\) equispaced grid on the square \([-3,3]^2\). Therefore, in (1.1), \(f:\mathbb {R}^{10^4}\rightarrow \mathbb {R}^{10^4}\). The Jacobian of f at the point \(\textbf{x}\) is given by
Since we impose zero Dirichlet boundary conditions, L and D are Block Toeplitz with Toeplitz Block (BTTB) matrices. Therefore, the matrix \(J_f\) is a Generalized Locally Toeplitz (GLT) matrix; see [1, 2, 22, 23]. The GLT theory furnishes a tool for studying the behavior of the singular values of \(J_f\). Providing a complete and precise analysis of the behavior of the singular values of \(J_f\) is outside the scope of this paper. Here we just note that the singular values of \(J_f\) can be approximated quite accurately, if \(n^2\) is large enough, by a uniform sampling over \([0,\pi ]^2\) of the modulus of the GLT symbol of \(J_f\). Let \(x:[0,1]^2\rightarrow \mathbb {R}\) be the function that \(\textbf{x}\) is a sampling of. Then, the GLT symbol of \(J_f\) is given by
where \({\textrm{i}}=\sqrt{-1}\), \((\theta ,\phi )\in [-\pi ,\pi ]^2\), and \((s,t)\in [0,1]^2\). It is straightforward to see that, if \(\alpha \) is large and \(\lambda \) is small, then the singular values of \(J_f\) decay rapidly to 0 with increasing index and, therefore, \(J_f\) is ill-conditioned. On the other hand, if \(\lambda \) is large and \(\alpha \) is small, then the singular values decay slowly with increasing index and, therefore, \(J_f\) is well-conditioned. This is confirmed by Fig. 2, which displays the approximated singular values of \(J_f\) for two choices of \(\alpha \) and \(\lambda \), namely \((\alpha ,\lambda )=(1,10)\) and \((\alpha ,\lambda )=(10,1)\), with \(\textbf{x}=\textbf{x}_{\textrm{true}}\). We observe that in the first case the matrix \(J_f\) is very well-conditioned, while in the latter case the matrix is poorly conditioned. We can estimate the condition number of \(J_f\) in these two cases with the MATLAB function condest. In the first case, i.e., the well-conditioned one, the computed condition number is \(\kappa _2\approx 3.3\), while in the latter case, i.e., the ill-conditioned one, we obtain \(\kappa _2\approx 1.7\cdot 10^{30}\). This illustrates that the conditioning for large values of \(\alpha \) is far worse than what the GLT theory predicts.
We illustrate the performances of Algorithms 1, 2, and 4 for several choices of \(\alpha \) and \(\lambda \). In particular, we let \((\alpha ,\lambda )\in \{1,\;2,\ldots ,\;10\}^2\). Since we observed that the standard Gauss–Newton method for some choices of \(\alpha \) and \(\lambda \) stops after a single iteration, we forced this method to carry out at least 5 iterations.
Table 2 reports the means, standard deviations, maximum values, and minimum values for the RRE, number of iterations, and CPU times required for Algorithms 1, 2, and 4. Since the computation of \(J_f\) is very cheap, we do not consider Algorithm 3 in this example. We note that, on average, Algorithms 1 and 4 yield less accurate computed solutions than Algorithm 2. However, for some parameter pairs \((\alpha ,\lambda )\) the accuracy obtained with Algorithms 1 and 4 is much higher than with Algorithm 2. Nevertheless, the approximate solutions computed by the latter algorithms are always fairly accurate, even in the worst case. On the other hand, the Gauss–Newton method determines very poor reconstructions for several choices of \(\alpha \) and \(\lambda \). Figure 3a, d, g show \(\log _{10}(\textrm{RRE})\) for each choice of the parameters. We can observe that, for a fairly large number of choices of \(\alpha \) and \(\lambda \), the Gauss–Newton method performs quite poorly in terms of accuracy, while Algorithms 2 and 4 always provides accurate reconstructions. This is due to the fact that the projection into the generalized Krylov subspace regularizes the problem without reducing the accuracy of the computed solutions.
The CPU times reported in Table 2 and in Fig. 3c, f, i illustrate that Algorithms 2 and 4 are faster in terms of CPU time than Algorithm 1, despite that the first two methods usually require more iterations to converge than Algorithm 1. We can also observe that the speed-up obtained by using the restarting strategy is significant as it reduces the average computational cost by a factor of almost 10.
We now discuss how the CPU times changes when the dimensions of the problem increase. We fix \(\alpha =5\) and \(\lambda =10\) and let \(n\in {100, 150,\ldots ,1000}\). Therefore, when \(n=1000\), we have \(f:\mathbb {R}^{10^6}\rightarrow \mathbb {R}^{10^6}\). We run the considered algorithms and plot the CPU times in Fig. 4. We can observe that, obviously, the CPU time required to solve the problem increases with n. However, the rate of increase is substantially lower when the projection in the generalized Krylov subspace is employed. In particular, using Algorithm 4, we can see that we are able to solve a nonlinear problem with \(10^6\) unknowns in less than 10 seconds on a laptop.
An extremely sparse problem. In this example, we would like to show a very particular situation when Algorithms 2 and 3 are not faster than Algorithm 1. However, we hasten to point out that this situation arises very seldom. Introduce the nonlinear differentiable function \(f:\mathbb {R}^n\rightarrow \mathbb {R}^{n-1}\) given by
where the \(\sin \) operation is meant element-wise. We construct the problem by choosing as the exact solution \(\textbf{x}\in \mathbb {R}^n\) of the problem (1.1) a sampling of the function \(\textbf{x}=\frac{1}{2}\sin (x)\), with \(x\in (-\pi ,\pi )\), on an equispaced grid with \(n=10^{3}\). The Jacobian \(J_f(\textbf{x})\) of f is bidiagonal and given by
Least-square problems with such a matrix can be solved very inexpensively.
Table 3 reports results obtained for Algorithms 1 and 2. We note that the latter algorithm determines a more accurate approximate solution than the former. However, since the solution of least-squares problems with the sparse matrix \(J_f\) is very inexpensive to compute and Algorithm 1 carries out fewer iterations than Algorithm 2, the overall cost of the former algorithm is slightly smaller than of the latter algorithm. Nevertheless, Algorithm 2 reduces the cost per iteration significantly, when compared with Algorithm 1, and provides more accurate reconstructions. Therefore, one may still want to use Algorithm 2 even if it is slightly more expensive.
Note that, since Algorithm 2 carried out fewer than \(k_{\textrm{rest}}\) iterations, we do not report the results for Algorithm 4 as they are identical to the ones obtained with Algorithm 2.
5 Conclusion and Extensions
This paper presents new implementations of the Gauss–Newton algorithm based on the use of generalized Krylov subspaces. The approach described easily can be extended to other nonlinear optimization methods such as the Levenberg-Marquardt method. We have shown that Algorithm 2 determines approximate solutions that converge to a stationary point of the minimized functional. Several numerical examples show that Algorithms 2, 3, and 4 outperform the standard Gauss–Newton method. Extensions to regularized problems as well as other iterative algorithms are presently being developed.
Data Availability
Data sharing is not applicable to this article as no datasets were generated or analyzed
References
Barbarino, G., Garoni, C., Serra-Capizzano, S.: Block generalized locally Toeplitz sequences: theory and applications in the multidimensional case. Electron. Trans. Numer. Anal. 53, 113–216 (2020)
Barbarino, G., Garoni, C., Serra-Capizzano, S.: Block generalized locally Toeplitz sequences: theory and applications in the unidimensional case. Electron. Trans. Numer. Anal. 53, 28–112 (2020)
Björck, Å.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)
Brown, P.N., Saad, Y.: Hybrid Krylov methods for nonlinear systems of equations. SIAM J. Sci. Stat. Comput. 11(3), 450–481 (1990)
Brown, P.N., Saad, Y.: Convergence theory of nonlinear Newton–Krylov algorithms. SIAM J. Optim. 4(2), 297–330 (1994)
Broyden, C.G.: A class of methods for solving nonlinear simultaneous equations. Math. Comput. 19(92), 577–593 (1965)
Buccini, A.: Fast alternating direction multipliers method by generalized Krylov subspaces. J. Sci. Comput. 90(1), 60 (2022)
Buccini, A., Díaz de Alba, P.: A variational non-linear constrained model for the inversion of FDEM data. Inverse Probl. 38, 014001 (2022)
Buccini, A., De la Cruz Cabrera, O., Koukouvinos, C., Mitrouli, M., Reichel, L.: Variable selection in saturated and supersaturated designs via minimization. Commun. Stat. Simul. Comput. (2021) (in press)
Buccini, A., Pragliola, M., Reichel, L., Sgallari, F.: A comparison of parameter choice rules for \(\ell ^p\)-\(\ell ^q\) minimization. Annnali dell’Università di Ferrara 68(2), 441–463 (2022)
Buccini, A., Reichel, L.: Limited memory restarted \(\ell ^p-\ell ^q\) minimization methods using generalized Krylov subspaces. Adv. Comput. Math. 49(2), 26 (2023)
Calvetti, D., Reichel, L.: Iterative methods for large continuation problems. J. Comput. Appl. Math. 123, 217–240 (2000)
Darvishi, M.T., Shin, B.C.: High-order Newton–Krylov methods to solve systems of nonlinear equations. J. KSIAM 15, 19–30 (2011)
Deidda, G.P., Díaz de Alba, P., Rodriguez, G.: Identifying the magnetic permeability in multi-frequency EM data inversion. Electron. Trans. Numer. Anal. 47, 1–17 (2017)
Deidda, G.P., Díaz de Alba, P., Fenu, C., Lovicu, G., Rodriguez, G.: FDEMtools: a MATLAB package for FDEM data inversion. Numer. Algorithms 84(4), 1313–1327 (2020)
Deidda, G.P., Díaz de Alba, P., Pes, F., Rodriguez, G.: Forward electromagnetic induction modelling in a multilayered half-space: an open-source software tool. Remote Sens. 15(7), 1772 (2023)
Deidda, G.P., Díaz de Alba, P., Rodriguez, G., Vignoli, G.: Smooth and sparse inversion of EMI data from multi-configuration measurements. In: 2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI) (RTSI 2018), pp. 213–218. Palermo, Italy (2018)
Deidda, G.P., Díaz de Alba, P., Rodriguez, G., Vignoli, G.: Inversion of multiconfiguration complex EMI data with minimum gradient support regularization: a case study. Math. Geosci. 52(7), 945–970 (2020)
Deidda, G.P., Fenu, C., Rodriguez, G.: Regularized solution of a nonlinear problem in electromagnetic sounding. Inverse Probl. 30, 125014 (2014)
Dennis, J.E., Jr., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM, Philadelphia (1996)
Engl, H.W., Hanke, M., Neubauer, A.: Regularization of Inverse Problems. Kluwer, Dordrecht (1996)
Garoni, C., Serra-Capizzano, S.: Generalized Locally Toeplitz Sequences: Theory and Applications, vol. 1. Springer, Berlin (2017)
Garoni, C., Serra-Capizzano, S.: Generalized Locally Toeplitz Sequences: Theory and Applications, vol. 2. Springer, Berlin (2018)
Glowinski, R., Keller, H.B., Reinhart, L.: Continuation-conjugate gradient methods for the least squares solution of nonlinear boundary value problems. SIAM J. Sci. Stat. Comput. 6(4), 793–832 (1985)
Grippo, L., Sciandrone, M.: Globally convergent block-coordinate techniques for unconstrained optimization. Optim. Methods Softw. 10(4), 587–637 (1999)
Hanke, M.: A regularizing Levenberg–Marquardt scheme, with applications to inverse groundwater filtration problems. Inverse Probl. 13(1), 79 (1997)
Huang, G., Lanza, A., Morigi, S., Reichel, L., Sgallari, F.: Majorization-minimization generalized Krylov subspace methods for \(\ell _p\)-\(\ell _q\) optimization applied to image restoration. BIT Numer. Math. 57, 351–378 (2017)
Kan, Z., Song, N., Peng, H., Chen, B.: Extension of complex step finite difference method to Jacobian-free Newton–Krylov. J. Comput. Appl. Math. 399, Art. 113,732 (2022)
Keller, H.B.: Lectures on Numerical Methods in Bifurcation Problems. Springer, Berlin (1987)
Lampe, J., Reichel, L., Voss, H.: Large-scale Tikhonov regularization via reduction by orthogonal projection. Linear Algebra Appl. 436, 2845–2865 (2012)
Lanza, A., Morigi, S., Reichel, L., Sgallari, F.: A generalized Krylov subspace method for \(\ell _p\)-\(\ell _q\) minimization. SIAM J. Sci. Comput. 37, S30–S50 (2015)
Pes, F., Rodriguez, G.: The minimal-norm Gauss–Newton method and some of its regularized variants. Electron. Trans. Numer. Anal. 53, 459–480 (2020)
Pes, F., Rodriguez, G.: A doubly relaxed minimal-norm Gauss–Newton method for underdetermined nonlinear least-squares problems. Appl. Numer. Math. 171, 233–248 (2022)
Ruhe, A.: Accelerated Gauss–Newton algorithms for nonlinear least squares problems. BIT Numer. Math. 19(3), 356–367 (1979)
Acknowledgements
The authors would like to thank Fabio Durastante for insightful discussions and the anonymous referees for their comments that helped us to greatly improve this work.
Funding
Open access funding provided by Università degli Studi di Cagliari within the CRUI-CARE Agreement. AB, PDdeA, and FP are members of the GNCS group of INdAM and are partially supported by the INdAM-GNCS 2022 Project “Metodi e modelli di regolarizzazione per problemi mal posti di grandi dimensioni” (CUP_E55F22000270001). AB and FP are partially supported by Fondazione di Sardegna, Progetto biennale bando 2021, “Computational Methods and Networks in Civil Engineering (COMANCHE)” and by INdAM-GNCS 2023 Project “Tecniche numeriche per lo studio dei problemi inversi e l’analisi delle reti complesse” (CUP_E53C22001930001). PDdeA is partially supported by the INdAM-GNCS 2023 Project “Metodi numerici per modelli descritti mediante operatori differenziali e integrali non locali” (CUP_E53C22001930001) and gratefully acknowledges Fondo Sociale Europeo REACT EU - Programma Operativo Nazionale Ricerca e Innovazione 2014-2020 and Ministero dell’Università e della Ricerca for the financial support.
Author information
Authors and Affiliations
Contributions
All authors contributed equally.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests. during the current study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Buccini, A., de Alba, P.D., Pes, F. et al. An Efficient Implementation of the Gauss–Newton Method Via Generalized Krylov Subspaces. J Sci Comput 97, 44 (2023). https://doi.org/10.1007/s10915-023-02360-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-023-02360-w