1 Introduction

Magnetic resonance imaging (MRI) is an important diagnostic and research tool in many clinical scenarios. However, its inherently slow data acquisition process is often problematic. To accelerate the scanning process, parallel MRI (p-MRI) and compressed sensing MRI (CS-MRI) are often employed. These methods are designed to facilitate fast reconstruction of high-quality, artifact-free images from minimal k-space data. Recently, deep learning approaches for MRI reconstruction [1,2,3,4,5,6,7,8,9,10] have demonstrated great promises for further acceleration of MRI acquisition. However, not all of these techniques [1, 2, 5, 7, 8] are able to exploit parallel image acquisitions which are common in clinical practice.

In this paper, we investigate accelerated p-MRI reconstruction using deep learning. We propose a novel, end-to-end trainable approach for this task which we refer to as a variable splitting network (VS-Net). VS-Net builds on a general parallel CS-MRI concept, which is formulated as a multi-variable energy minimization process in a deep learning framework. It has three computational blocks: a denoiser block, a data consistency block and a weighted average block. The first one is a denoising convolutional neural network (CNN), while the latter two have point-wise closed-form solutions. As such, VS-Net is computationally efficient yet simple to implement. VS-Net accepts complex-valued multi-channel MRI data and learns all parameters automatically during offline training. In a series of experiments, we monitor reconstruction accuracies obtained from varying the number of stages in VS-Net. We studied different parameterizations for weight parameters in VS-Net and analyzed their numerical behaviour. We also evaluated VS-Net performance on a multi-coil knee image dataset for different acceleration factors using the Cartesian undersampling patterns, and showed improved image quality and preservation of tissue textures.

To this end, we point out the differences between our method and related works in [1,2,3,4, 8], as well as highlight our novel contributions to this area. First, data consistency (DC) layer introduced in [1] was designed for single-coil images. MoDL [4] extended the cascade idea of [1] to a multi-coil setting. However, its DC layer implementation was through iteratively solving a linear system using the conjugate gradient in their network, which can be very complicated. In contrast, DC layer in VS-Net naturally applies to multi-coil images, and is also a point-wise, analytical solution, making VS-Net both computationally efficient and numerically accurate. Variational network (VN) [3] and [8] were applicable to multi-coil images. However, they were based gradient-descent optimization and proximal methods respectively, which does not impose the exact DC. Compared ADMM-net [2] to VS-Net, the former was also only applied to single-coil images. Moreover, ADMM-Net was derived from the augmented Lagrangian method (ALM), while VS-Net uses a penalty function method, which results in a simpler network architecture. ALM introduces Lagrange multipliers to weaken the dependence on penalty weight selection. While these weights can be learned automatically in network training, the need for a network with a more complicated ALM is not clear. In ADMM-Net and VN, the regularization term was defined via a set of explicit learnable linear filter kernels. In contrast, VS-Net treats regularization implicitly in a CNN-denoising process. Consequently, VS-Net has the flexibility of using varying advanced denoising CNN architectures while avoiding expensive dense matrix inversion - a encountered problem in ADMM-Net. A final distinction is that the effect of different weight parameterizations is studied in VS-Net, while this was not investigated previously.

2 VS-Net for Accelerated p-MRI Reconstruction

General CS p-MRI Model: Let \(m \in {\mathbb {C}^N}\) be a complex-valued MR image stacked as a column vector and \(y_i \in {\mathbb {C}^M}\) (\(M < N\)) be the under-sampled k-space data measured from the ith MR receiver coil. Recovering m from \(y_i\) is an ill-posed inverse problem. According to compressed sensing (CS) theory, one can estimate the reconstructed image m by minimizing:

$$\begin{aligned} \mathop {\min }\limits _m \left\{ {\frac{\lambda }{2}\sum \limits _{i = 1}^{{n_c}} {\Vert {\mathcal{D} \mathcal{F}{S_i}m - {y_i}} \Vert _2^2} + \mathcal{R}\left( m \right) } \right\} , \end{aligned}$$
(1)

In the data fidelity term (first term), \(n_c\) denotes the number of receiver coils, \(\mathcal{D} \in {\mathbb {R}^{M \times N}}\) is the sampling matrix that zeros out entries that have not been acquired, \(\mathcal{F } \in {\mathbb {C}^{N \times N}}\) is the Fourier transform matrix, \({S_i } \in {\mathbb {C}^{N \times N}}\) is the ith coil sensitivity, and \(\lambda \) is a model weight that balances the two terms. Note that the coil sensitivity \(S_i\) is a diagonal matrix, which can be pre-computed from the fully sampled k-space center using the E-SPIRiT algorithm [11]. The second term is a general sparse regularization term, e.g. (nonlocal) total variation [12, 13], total generalized variation [12, 14] or the \(\ell 1\) penalty on the discrete wavelet transform of m [15].

Variable Splitting: In order to optimize (1) efficiently, we develop a variable splitting method. Specifically, we introduce the auxiliary splitting variables \(u \in {\mathbb {C}^N} \) and \(\{x_i \in {\mathbb {C}^N}\}_{i=1}^{n_c}\), converting (1) into the following equivalent form

$$\begin{aligned} \mathop {\min }\limits _{m,u,x_i} {\frac{\lambda }{2}\sum \limits _{i = 1}^{{n_c}} {\Vert {\mathcal{D} \mathcal{F}x_i - {y_i}} \Vert _2^2} + \mathcal{R}\left( u \right) } \; s.t.\;m=u, \;S_im=x_i, \;\forall i \in \left\{ {1,2,...,{n_c}} \right\} . \end{aligned}$$

The introduction of the first constraint \(m=u\) decouples m in the regularization from that in the data fidelity term so that a denoising problem can be explicitly formulated (see Eq. (3) top). The introduction of the second constraint \(S_im=x_i\) is also crucial as it allows decomposition of \(S_im\) from \(\mathcal{D} \mathcal{F}{S_i}m \) in the data fidelity term such that no dense matrix inversion is involved in subsequent calculations (see Eq. (4) middle and bottom). Using the penalty function method, we then add these constraints back into the model and minimize the single problem

$$\begin{aligned} \mathop {\min }\limits _{m,u,x_i} {\frac{\lambda }{2}\sum \limits _{i = 1}^{{n_c}} {\Vert {\mathcal{D} \mathcal{F}x_i - {y_i}} \Vert _2^2} + \mathcal{R}\left( u \right) } + \frac{\alpha }{2}\sum \limits _{i = 1}^{{n_c}} {\Vert {{x_i} - {S_i}m} \Vert _2^2} + \frac{\beta }{2}\Vert {u - m} \Vert _2^2\ , \end{aligned}$$
(2)

where \(\alpha \) and \(\beta \) are introduced penalty weights. To minimize (2), which is a multi-variable optimization problem, one needs to alternatively optimize m, u and \(x_i\) by solving the following three subproblems:

$$\begin{aligned} \left\{ \begin{array}{l} {u^{k + 1}} = \mathop {\arg \min }\limits _u \frac{\beta }{2}\Vert {u - {m^k}} \Vert _2^2 + \mathcal{R}\left( u \right) \\ x_i^{k + 1} = \mathop {\arg \min }\limits _{{x_i}} \lambda \sum \nolimits _{i = 1}^{{n_c}} {\Vert {\mathcal{D} \mathcal{F}{x_i} - {y_i}} \Vert _2^2} + \frac{\alpha }{2}\sum \nolimits _{i = 1}^{{n_c}} {\Vert {{x_i} - {S_i}{m^k}} \Vert _2^2} \\ m^{k+1} = \mathop {\arg \min }\limits _m \frac{\alpha }{2}\sum \nolimits _{i = 1}^{{n_c}} {\Vert {x_i^{k + 1} - {S_i}m} \Vert _2^2} + \frac{\beta }{2}\Vert {{u^{k + 1}} - m} \Vert _2^2 \end{array} \right. , \end{aligned}$$
(3)

Here \(k \in \{1,2,...,n_{it}\}\) denotes the kth iteration. An optimal solution (\(m^*\)) may be found by iterating over \({u^{k + 1}}\), \(x_i^{k+1}\) and \({m^{k + 1}}\) until convergence is achieved or the number of iterations reaches \(n_{it}\). An initial solution to these subproblems can be derived as follows

$$\begin{aligned} \left\{ \begin{array}{l} {u^{k + 1}} = denoiser({m^k})\\ x_i^{k + 1} = {\mathcal{F}^{ - 1}}( {{{( {\lambda {\mathcal{D} ^T}\mathcal{D} + \alpha I } )}^{-1}}( {\alpha \mathcal{F}{S_i}{m^k} + \lambda {\mathcal{D} ^T}{y_i}} )})\;\;\; \forall i \in \{1,2,...,{n_c}\}\\ {m^{k + 1}} = {( {\beta I + \alpha \sum \nolimits _{i = 1}^{{n_c}} {S_i^H{S_i}} } )^{-1}}({\beta {u^{k + 1}} + \alpha \sum \nolimits _{i = 1}^{{n_c}} {S_i^H} x_i^{k + 1}}) \end{array} \right. . \end{aligned}$$
(4)

Above \(S_i^H\) is the conjugate transpose of \(S_i\) and I is the identity matrix of size N by N. \({\mathcal{D} ^T}\mathcal{D}\) is a diagonal matrix of size N by N, whose diagonal entries are either zero or one corresponding to a binary sampling mask. \(\mathcal{D}^Ty_i\) is an N-length vector, representing the k-space measurements (ith coil) with the unsampled positions filled with zeros. In this step we have turned the original problem (1) into a denoising problem (denoted by denoiser) and two other equations, both of which have closed-form solutions that can be computed point-wise due to the nature of diagonal matrix inversion. We also note that the middle equation efficiently imposes the consistency between k-space data and image space data coil-wisely, and the bottom equation simply computes a weighted average of the results obtained from the first two equations. Next, we will show an appropriate network architecture can be derived by unfolding the iterations of (4).

Fig. 1.
figure 1

Overall architecture of the proposed variable splitting network (VS-Net).

Fig. 2.
figure 2

Detailed structure of each block in VS-net. DB, DCB and WAB stand for Denoiser Block, Data Consistency Block and Weighted Average Block, respectively.

Network Architecture: We propose a deep cascade network that naturally integrates the iterative procedures in Eq. (4). Figure 1 depicts the network architecture. Specifically, one iteration of an iterative reconstruction is related to one stage in the network. In each stage, there are three blocks: denoiser block (DB), data consistency block (DCB) and weighted average block (WAB), which respectively correspond to the three equations in (4). The network takes four inputs: (1) the single sensitivity-weighted undersampled image which is computed using \(\sum _i^{n_c} S_i^H \mathcal{F}^{-1}\mathcal{D}^T y_i\); (2) the pre-computed coil sensitivity maps \(\{S_i\}_{i=1}^{n_c}\); (3) the binary sampling mask \({\mathcal{D} ^T}\mathcal{D}\); (4) the undersampled k-space data \(\{\mathcal{D}^Ty_i\}_{i=1}^{n_c}\). Note that the sensitivity-weighted undersampled image is only used once for DB and DCB in Stage 1. In contrast, \(\{\mathcal{D}^Ty_i\}_{i=1}^{n_c}\), \(\{S_i\}_{i=1}^{n_c}\) and the mask are required for WAB and DCB at each stage (see Figs. 1 and 2). As the network is guided by the iterative process resulting from the variable splitting method, we refer to it as a Variable Splitting Network (VS-Net).

In Fig. 2, we illustrate the detailed structures of key building blocks of the network (DB, DCB and WAB) at Stage k in VS-Net. In DB, we intend to denoise a complex-valued image with a convolutional neural network (CNN). To handle complex values, we stack real and imaginary parts of the undersampled input into a real-valued two-channel image. ReLU’s are used to add nonlinearities to the denoiser to increase its denoising capability. Note that while we use a simple CNN here, our setup allows for incorporation of more advanced denoising CNN architectures. In DCB, \(m^k\) from the upstream block, \(\{S_i\}_{i=1}^{n_c}\), \(\{k_i\}_{i=1}^{n_c}\) (i.e. the undersampled k-space data of all coils) and the mask are taken as inputs, passing through the middle equation of (4) and outputting \(\{x_i^{k+1}\}_{i=1}^{n_c}\). The outputs \(u^{k+1}\) and \(\{x_i^{k+1}\}_{i=1}^{n_c}\) from DCB and WAB, concurrently with the coil sensitivity maps, are fed to WAB producing \(m^{k+1}\), which is then used as the input to DB and DCB in the next stage in VS-Net. Due to the existence of analytical solutions, no iteration is required in WAB and DCB. Further, all the computational operations in WAB and DCB are point-wise. These features make the calculations in the two blocks simple and efficient. The process proceeds in the same manner until Stage \(n_{it}\) is reached.

Network Loss and Parameterizations: Training the proposed VS-Net is another optimization process, for which a loss function must be explicitly formulated. In MR reconstruction, the loss function often defines the similarity between the reconstructed image and a clean, artifact-free reference image. For example, a common choice for the loss function used in this work is the mean squared error (MSE), given by

$$\begin{aligned} \mathcal{L}({\mathbf {\Theta }}) = \min _{\mathbf {\Theta }} \frac{1}{{2{n_i}}}\sum \limits _{i = 1}^{{n_i}} \Vert {{m_i^{{n_{it}}}}( {\mathbf {\Theta }} ) - {g_i}} \Vert _2^2, \end{aligned}$$
(5)

where \(n_i\) is the number of training images, and g is the reference image, which is a sensitivity-weighted fully sampled image computed by \(\sum _j^{n_c} S_j^H \mathcal{F}^{-1}f_j\). Here \(f_j\) represents the fully sampled raw data of the jth coil. \({\mathbf {\Theta }}\) above are the network parameters \({\mathbf {\Theta }}\) to be learned. In this work we study two parameterizations for \({\mathbf {\Theta }}\), i.e., \({\mathbf {\Theta ^1}} = \left\{ {\{{\mathbf{{{W}}}^l}\}_{l = 1}^{{n_{it}}}, \lambda , \alpha , \beta } \right\} \) and \({\mathbf {\Theta ^2}} = \left\{ {\{ {\mathbf{{{W}}}^l, \lambda ^l, \alpha ^l, \beta ^l}\}_{l = 1}^{{n_{it}}}} \right\} \). Here \(\{ {\mathbf{{{W}}}^l}\}_{l = 1}^{{n_{it}}}\) are learnable parameters for all (\(n_{it}\)) CNN denoising blocks in VS-Net. Moreover, in both cases we also make the data fidelity weight \(\lambda \) and the penalty weights \(\alpha \) and \(\beta \) learnable parameters. In contrast, for \({\mathbf {\Theta ^1}}\) we let the weights \(\lambda \), \(\alpha \) and \(\beta \) be shared by the WABs and DCBs across VS-Net, while for \({\mathbf {\Theta ^2}}\) each WAB and DCB have their own learnable weights. Since all the blocks are differentiable, backpropagation (BP) can be effectively employed to minimize the loss with respect to the network parameters \({\mathbf {\Theta }}\) in an end-to-end fashion.

3 Experiments Results

Datasets and Training Details: We used a publicly available clinical knee datasetFootnote 1 in [3], which has the following 5 image acquisition protocols: coronal proton-density (PD), coronal fat-saturated PD, axial fat-saturated T\(_2\), sagittal fat-saturated T\(_2\) and sagittal PD. For each acquisition protocol, the same 20 subjects were scanned using an off-the-shelf 15-element knee coil. The scan of each subject cover approximately 40 slices and each slice has 15 channels (\(n_c=15\)). Coil sensitivity maps provided in the dataset were precomputed from a data block of size 24 \(\times \) 24 at the center of fully sampled k-space using BART [16]. For training, we retrospectively undersampled the original k-space data for 4-fold and 6-fold acceleration factors (AF) with Cartesian undersampling, sampling 24 lines at the central region. For each acquisition protocol, we split the subjects into training and testing subsets (each with sample size of 10), and trained VS-Net to reconstruct each slice in a 2D fashion. The network parameters was optimized for 200 epochs, using Adam with learning rate \(10^{-3}\) and batch size 1. We used PSNR and SSIM as quantitative performance metrics.

Fig. 3.
figure 3

Quantitative measures versus number of epochs at training (first two columns) and testing (last two columns). 1st row shows the network performance using different stage numbers. 2nd column shows the network performance using different parameterizations of \({\mathbf {\Theta }}\) in the loss (5).

Fig. 4.
figure 4

Visual comparison of results obtained by different methods for Cartesian undersampling with AF 4 (top) and 6 (bottom). From left to right: zero-filling results, \(\ell 1\)-SPIRiT results, VN results, VS-Net results, and ground truth. See the supplementary for more visual comparison.

Parameter Behaviour: To show the impact of the stage number \(n_{it}\) (see Fig. 1), we first experiment on the subjects under the coronal PD protocol with 4-fold AF. We set \(n_{it} = \{1,3,5,7,10,15,20\}\) and plotted the training and testing quantitative curves versus the number of epochs in the upper portion of Fig. 3. As the plots show, increasing the stage number improves network performance. This is obvious for two reasons: (1) as number of parameters increases, so does the network’s learning capability; (2) the embedded variable splitting minimization is an iterative process, for which sufficient iterations (stages) are required to converge to an ideal solution. We also found that: (i) the performance difference between \(n_{it} = 15\) and \(n_{it} = 20\) is negligible as the network gradually converges after \(n_{it} = 15\); (ii) there is no overfitting during network training despite the use of a relatively small training set. Second, we examine the network performance when using two different parameterizations: \({\mathbf {\Theta ^1}} = \left\{ {\{ {\mathbf{{{W}}}^l}\}_{l = 1}^{{n_{it}}},\lambda , \alpha , \beta } \right\} \) and \({\mathbf {\Theta ^2}} = \left\{ {\{ {\mathbf{{{W}}}^l, \lambda ^l, \alpha ^l, \beta ^l}\}_{l = 1}^{{n_{it}}} } \right\} \). For a fair comparison, we used the same initialization for both parameterizations and experimented with two cases \(n_{it} = \{5,10\}\). As shown in the bottom portion of Fig. 3, in both cases the network with \({\mathbf {\Theta ^1}}\) performs slightly worse than the one with \({\mathbf {\Theta ^2}}\). In penalty function methods, a penalty weight is usually shared (fixed) across iterations. However, our experiments indicated improved performance if the model weights (\(\lambda \), \(\alpha \) and \(\beta \)) are non-shared or adaptive at each stage in the network.

Table 1. Quantitative results obtained by different methods on the test set including \(\sim \)2000 image slices across 5 acquisition protocols. Each metric was calculated on \(\sim \)400 image slices, and mean ± standard deviation are reported.

Numerical Comparison: We compared our VS-Net with the iterative \(\ell 1\)-SPIRiT [17] and the variational network (VN) [3], with the zero-filling reconstruction as a reference. For VS-Net, we used \(n_{it}\) (10) and \(\mathbf {\Theta }^2\), although the network’s performance can be further boosted with a larger \(n_{it}\). For VN, we carried out training using mostly default hyper-parameters from [3], except for the batch size, which (using original image size) was set to 1 to better fit GPU memory. For both VS-Net and VN, we trained a separate model for each protocol, resulting in a total of 20 models for the 4-fold and 6-fold AFs. In Table 1, we summarize the quantitative results obtained by these methods. As is evident, learning-based methods VS-Net and VN outperformed the iterative \(\ell 1\)-SPIRiT. VN produced comparable SSIMs to VS-Net in some scenarios. The resulting PNSRs were however lower than that of VS-Net for all acquisition protocols and AFs, indicating the superior numerical performance of VS-Net. In Fig. 4, we present a visual comparison on a coronal PD image reconstruction for both AFs. Apart from zero-filling, all methods removed aliasing artifacts successfully. Among \(\ell 1\)-SPIRiT, VN and VS-Net, VS-Net recovered more small, textural details and thus achieved the best visual results, relative to the ground truth. The quantitative metrics in Fig. 4 further show that VS-Net is the best.

4 Conclusion

In this paper, we proposed the variable spitting network (VS-Net) for accelerated reconstruction of parallel MR images. We have detailed how to formulate VS-Net as an iterative energy minimization process embedded in a deep learning framework, where each stage essentially corresponds to one iteration of an iterative reconstruction. In experiments, we have shown that the performance of VS-Net gradually plateaued as the network stage number increased, and that setting parameters in each stage as learnable improved the quantitative results. Further, we have evaluated VS-Net on a multi-coil knee image dataset for 4-fold and 6-fold acceleration factors under Cartesian undersampling and showed its superiority over two state-of-the-art methods.