Keywords

1 Introduction

Radiofrequency ablation (RFA) a low invasive therapy for liver cancer. In RFA, doctors often use ultrasound image fusion during or after the operation to assess the treatment effect. Deformable image registration (DIR) is significant technology for image fusion since it can deal with tissue deformation and body movement. In DIR, a dense, non-linear correspondence is estimated between a pair of 2D or 3D images. Most registration methods [1,2,3,4,5] solve an optimization problem that aligns voxels with similar appearance. However, it requires calculation of image similarity in every optimizing iteration. Therefore, it is computationally intensive and extremely slow in practice.

Several recent works proposed machine-learning-based methods to learn a transformation function to replace the iterative optimization in deformable registration. Most of these [6,7,8,9] rely on ground-truth or synthesized displacement fields which are difficult to be obtained in medical imaging applications. Recent works in [10, 11, 17, 18] presented weakly supervised or unsupervised methods based on a convolutional neural network (CNN) and a spatial transformation network (STN) [12]. For better registration accuracy and robustness, generative adversarial network (GAN) [13, 14] was adopted in [19,20,21]. However, the work in [20] requires initial registration ground-truth, and the methods in [19, 21] only work on training data of 3D ROIs or 2D synthesized slices.

In this work, we propose a framework of adversarial learning for deformable image registration (AL-DIR). The AL-DIR model can run deformable registration in one pass and can be trained without ground-truth spatial transformation. AL-DIR consists of three networks: a CNN-based registration network (generator) using similarity metrics of image intensity and vessel masks as loss function; a discrimination network (discriminator) that distinguishes between the registered image and the fixed image; an autoencoder for measuring the anatomical shape difference before and after an iteration of registration. Main contributions of this work are as follow:

  • We propose an end-to-end registration network to predict 3D displacement field of DIR without ground-truth spatial transformation. The single-pass prediction leverages information of image intensity and anatomical shape features (such as vessels and organs) and only requires image pair as input.

  • We present an adversarial learning framework to train the registration network. The discrimination network can guide the registration for more accurate and realistic deformation. Unlike most of GAN-based supervised methods, our approach only requires vessel masks for training, so it is in weakly supervised manner.

  • We incorporate the encoder part of an autoencoder to extract anatomical shape difference for better convergence of deformation.

2 Methodology

Image registration aims to find a spatial transformation between a moving image \( \varvec{M}\left( \varvec{x} \right) \), and a fixed image \( \varvec{F}\left( \varvec{x} \right) \). Here, \( \varvec{x} \) refers to coordination of image pixels. Image registration can be considered as an optimization problem for minimizing a cost function:

$$ {\text{C(}}\varvec{\mu}) { = } - {\text{S(}}\varvec{F}\left( \varvec{x} \right), \varvec{M}\left(\varvec{g}\left( {\varvec{x};{\varvec{\upmu}}} \right) \right))+\uplambda\,C_{\text{r}} \left(\varvec{\mu}\right), $$
(1)

where \( \varvec{g} \) is transformation function; \( \varvec{\mu} \) is displacement field; \( {\text{S}} \) is similarity measure between \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( {\varvec{g} \left(\varvec{x};\varvec{\mu} \right)} \right) \); \( C_{r} \) is a regularization term for smooth deformation; \( \lambda \) is a weighting factor.

In this work, we present an adversarial learning framework to predict the displacement field. As shown in Fig. 1, the proposed framework consists of three networks: a registration network, a discrimination network and an autoencoder network. After training, the registration network only requires image pair as input, as shown in Fig. 2.

Fig. 1.
figure 1

The proposed adversarial learning framework of DIR.

Fig. 2.
figure 2

Registration procedure.

2.1 Registration Network

In registration network, we adopt a CNN architecture that is similar to U-Net [15]. In training, a pair of training images to align, \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( \varvec{x} \right) \), are concatenated and input to the CNN. Displacement of each voxel in \( \varvec{M}\left( \varvec{x} \right) \), i.e. the displacement field \( \varvec{\mu} \), is output as prediction. Since it is difficult to obtain the ground-truth of \( \varvec{\mu} \), we use similarity metrics of image intensity and vessel region as loss function of the registration network: image similarity metric that penalizes appearance difference between \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right)) \), and anatomical region (vessel region in our work) correspondence that guarantees deformation accuracy on important tissues and areas. We adopt local cross-correlation (CC) of \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right)) \) as image similarity metric:

$$ {\text{CC(}}\varvec{F}\left( \varvec{x} \right), \varvec{M}\left(\varvec{g}\left( {\varvec{x};\varvec{\mu}} \right)\right) ) { = }\mathop {\sum} \nolimits_{\varvec{x}} \left[ {\frac{{\left( {\sum \nolimits_{{\varvec{y} \in \varvec{N}\left( \varvec{x} \right)}} \hat{\varvec{F}}\left( \varvec{y} \right)\hat{\varvec{M}}\left( {\varvec{g}\left( {\varvec{y};\varvec{\mu}} \right)} \right)} \right)^{2} }}{{\left( {\mathop \sum \nolimits_{{\varvec{y} \in \varvec{N}\left( \varvec{x} \right)}} \hat{\varvec{F}}\left( \varvec{y} \right)^{2} } \right)\left( {\mathop \sum \nolimits_{{\varvec{y} \in \varvec{N}\left( \varvec{x} \right)}} \hat{\varvec{M}}\left( {\varvec{g}\left( {\varvec{y};\varvec{\mu}} \right)} \right)^{2} } \right) + \epsilon }}} \right] , $$
(2)

where \( \hat{\varvec{F}}\left( \varvec{y} \right) \) and \( \hat{M}\left( {\varvec{g}(\varvec{y};\varvec{\mu}} \right)) \) denote images with local mean intensities subtracted out. The local mean is calculated over a \( \varvec{N}\left( \varvec{x} \right) \) local volume around each voxel \( y \). \( \epsilon \) is a small constant to avoid numerical issues. Size of the local volume \( \varvec{N}\left( \varvec{x} \right) \) is set as \( 11 \times 11 \times 11 \) experimentally.

We also calculate the similarity of anatomical region correspondence to guarantee accuracy on clinically important regions. In this work, we measure l1 distance between liver vessel masks \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} \). We also adopt gradient of displacement field as the regularization term \( C_{\text{r}} \left(\varvec{\mu}\right) \). Therefore, the training of the registration network requires correspond vessel regions and is in weakly supervised manner. The registration network loss is calculated as follows:

$$ {\mathcal{L}}_{\text{reg}} = - {\text{CC(}}\varvec{F}\left( \varvec{x} \right), \varvec{M}\left(\varvec{g}\left( {\varvec{x};\varvec{\mu}} \right)\right)\text{)} +\uplambda_{\text{m}} \left\| {\varvec{F}\left( \varvec{x} \right)_{\text{m}} - \varvec{M}\text{(}\varvec{g}(\varvec{x};\varvec{\mu}\text{)} )_{\text{m}} } \right\|_{l1} +\uplambda_{\text{r}} C_{\text{r}} \left(\varvec{\mu}\right). $$
(3)

The network consists of an encoder-decoder with skip connections that is responsible for estimating \( \varvec{\mu} \) given \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( \varvec{x} \right) \). Figure 3 shows the network architecture. The input of the network is formed by concatenating the \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( \varvec{x} \right) \) into a two-channel 3D image of size \( N_{1} \times N_{2} \times N_{3} \times 2 \). We apply 3D convolution layers followed by layers of ReLU activation, batch normalization and dropout. The convolution kernel size is fixed to \( 3 \times 3 \times 3 \). The network output \( \varvec{\mu} \) is of size of \( N_{1} \times N_{2} \times N_{3} \times 3 \). The last 3 channels of the output represent voxel displacement of a coordination \( \varvec{x} \) in \( \varvec{M}\left( \varvec{x} \right) \).

Fig. 3.
figure 3

Architecture of registration network.

2.2 Discrimination Network

On the purpose of better guidance for the training of registration network, we propose a discrimination network and train it simultaneously with the registration network. As shown in Fig. 1, the discrimination network also has a two-channel 3D image as input. The first channel is the fixed image \( \varvec{F}\left( \varvec{x} \right) \) in both of the real case and the fake case. The second channel is the vessel segmented region \( \varvec{F}\left( \varvec{x} \right)_{{{\mathbf{v}}{\text{es}}}} = \varvec{F}\left( \varvec{x} \right) \cdot \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) in the “real” case and the corresponding region of the deformed image \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{ves}} = \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right)) \cdot \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} \) in the “fake” case. The loss of discrimination network is defined as a binary cross-entropy metric as follows:

$$ {\mathcal{L}}_{\text{adv}} = E\left( { - \log \left( {D\left( {\varvec{F}\left( \varvec{x} \right),\varvec{F}\left( \varvec{x} \right)_{{{\mathbf{v}}{\text{es}}}} } \right)} \right)} \right) + E\left( { - log\left( {1 - D\left( {\varvec{F}\left( \varvec{x} \right),\varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{ves}} } \right)} \right)} \right). $$
(4)

The discrimination network is optimized by maximizing the perfect registration (by minimizing \( - \log \left( {D\left( {\varvec{F}\left( \varvec{x} \right),\varvec{F}\left( \varvec{x} \right)_{{{\mathbf{v}}{\text{es}}}} } \right)} \right) \)) and by minimizing the registration without high accuracy (by minimizing \( - log\left( {1 - D\left( {\varvec{F}\left( \varvec{x} \right),\varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{ves}} } \right)} \right) \)). In this way, the training of the discrimination network does not require any ground-truth of spatial transformation, which makes our approach be easier to be utilized in practical medical imaging applications.

The architecture of the discrimination network is shown in Fig. 4(a). In order to achieve faster and better training convergence, the first half of discrimination network adopts the same configuration of the encoder part (without skip connections) of the registration network and shares the network weights during training. The resting part of the discrimination network consists of 3D convolution layers and fully connected layers. The convolution kernel size is fixed to 3 × 3 × 3. The output is the classification result of “real” or “fake”. The loss \( {\mathcal{L}}_{\text{adv}} \) is not only used to fit the discrimination network but also used to feedback to the registration network.

Fig. 4.
figure 4

Architectures of discrimination network and autoencoder network.

2.3 Autoencoder of Anatomical Shape

Deformation of image fusion tasks is required to be smooth and realistic, especially on important tissues and regions for medical imaging applications. We incorporate a loss term by training an autoencoder network on liver vessel masks of fixed images \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \). The encoder part reduces the vessel mask to low resolution features \( Enc\left( {\varvec{F}\left( \varvec{x} \right)_{\text{m}} } \right) \) in non-liner manner [16], and the decoder part generates the original vessel mask \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) from \( Enc\left( {\varvec{F}\left( \varvec{x} \right)_{\text{m}} } \right) \). In this work, the autoencoder network is pre-trained, and the encoder part is leveraged to extract anatomical shape features of \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} \) in each iteration of registration training. The architecture of the autoencoder network is shown in Fig. 4(b). The loss retrieved from the encoder network is defined as the l2 distance between features of \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} \):

$$ {\mathcal{L}}_{\text{enc}} = \left\| {Enc\left( {\varvec{F}\left( \varvec{x} \right)_{\text{m}} } \right) - Enc\left( {\varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} } \right)} \right\|_{l2} . $$
(5)

2.4 Adversarial Learning

In this work, the networks of registration, discrimination and the pre-trained encoder are combined to an adversarial learning framework. The registration network and discrimination network are trained simultaneously. The loss in the combined training procedure of the proposed AL-DIR is defined as follows:

$$ {\mathcal{L}}_{{{\text{AL}} - {\text{DIR}}}} = {\mathcal{L}}_{\text{reg}} +\uplambda_{\text{adv}} \cdot {\mathcal{L}}_{\text{adv}} +\uplambda_{\text{enc}} \cdot {\mathcal{L}}_{\text{enc}} , $$
(6)

where \( \uplambda_{\text{adv}} = 1.0 \) and \( \uplambda_{\text{enc}} = \, 0.25 \) are set experimentally. By minimizing \( {\mathcal{L}}_{{{\text{AL}} - {\text{DIR}}}} \), accurate, smooth and realistic deformation of \( \varvec{M}\left( \varvec{x} \right) \) can be obtained.

3 Experiments and Results

3.1 Materials and Training/Evaluation Details

We use clinical image data acquired in liver RFA surgery to evaluate the proposed method. In total, 510 image pairs from 98 patients are used, and 3-fold cross-validation is performed in the experiments. All images are resampled to size of \( 128 \times 112 \times 96 \) with resolution of \( 1.0 \times 1.0 \times 1.0 \) mm3. Corresponding masks of liver vessels are annotated for training. The study was approved by the ethics committee of Hitachi group headquarters.

The proposed method is implemented in Keras with TensorflowTM backend. All the experiments are run on an 11 GB NVIDIA GTXTM 1080 Ti GPU. In the training stage, we use an Adam optimizer with a learning rate of \( 10^{ - 4} \). We set batch size as 1 for reducing GPU memory usage. First, we train the autoencoder network with the vessel masks of fixed images for 20,000 iterations. Then we train the registration network and discrimination network with the resulted encoder for 40,000 iterations.

As pre-registration, a rigid registration is firstly performed on each image pair by using the vessel masks in the evaluation experiment. After that, the proposed AL-DIR model is trained, and evaluation of deformable registration is run. Distance between corresponding landmarks of portal vein branches on \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right)) \) are used to measure the registration error. Moreover, Dice coefficient between vessel regions on the fixed images and the deformed images are calculated:

$$ {\text{Dice}} = \frac{{2\left( {\varvec{F}\left( \varvec{x} \right)_{\text{m}} \cap \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} } \right)}}{{\left( {\varvec{F}\left( \varvec{x} \right)_{\text{m}} \cup \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} } \right)}}. $$
(7)

We compare the proposed AL-DIR with two deep-learning-based methods: VoxelMorph-2 [17] and LabelReg [18]. We apply the same pre-registration results to both models and follow the implementation details and training parameters in [17, 18]. VoxelMorph is trained on image pairs without vessel masks, and LabelReg is trained by using both image pairs and vessel masks.

In order to evaluate the effect of the proposed discrimination network and autoencoder network, we also evaluate the following combinations of registration networks: (1) registration network only (referred as “Reg”), (2) registration network with discrimination network (referred as “Reg + GAN”) and (3) registration network with autoencoder network (referred as “Reg + Enc”).

3.2 Evaluation Results

Target registration errors (TREs) on portal vein branches and Dice coefficients of vessel regions are measured and listed in Table 1. As mentioned before, all the evaluated methods start deformable registration after the rigid registration (TRE = 10.6 mm, Dice = 0.33). We can see that VoxelMorph is able to achieve relatively good accuracy of TREs but is gives the worst performance of vessel region Dice, because it only uses information of image intensity to train the registration model. On the other hand, LabelReg achieves good performance of vessel region Dice since it is trained based on the loss of vessel region Dice directly. However, the loss of LabelReg does not adopt any image similarity metrics, so it gives a worse TRE than VoxelMorph. Compared to this methods, the registration network (Reg) in this work utilizes similarity metrics of both image and vessel masks, so better performance of both TRE and Dice can be achieved. Moreover, the discrimination network and autoencoder network contribute on better training guidance for the registration networks. As a result, the combination of the three networks, i.e., the proposed AL-DIR, gives the best performance of both TRE and Dice. The running time of AL-DIR is 0.3 s on the GPU.

Table 1. Evaluation results.

Some examples of registration results of AL-DIR are shown in Fig. 5. The registration is run on 3D images, and 2D axial slices are given here. Vessel masks before and after registration are highlighted by circles. We can see that AL-DIR can handle large changes in shapes and provide accurate deformation on important anatomical regions (liver vessels in this work) defined by user.

Fig. 5.
figure 5

Examples of deformable registration results.

4 Conclusion

We propose an adversarial learning framework for deep-learning-based deformable image registration. The end-to-end registration network can be trained to predict displacement field of deformable registration without ground-truth spatial transformation. The single-pass prediction leverages information of image intensity and anatomical shape features and only requires image pair as input. The discrimination network can guide the registration for more accurate and realistic deformation. Moreover, the autoencoder network can extract anatomical shape difference for better convergence. We apply our method to image fusion of 3D liver ultrasound images, and experimental results show that our method achieve better performance than the state-of-the-art deep-learning-based methods.