Abstract
We present an adversarial learning algorithm for deep-learning-based deformable image registration (DIR) and apply to 3D liver ultrasound image fusion. We consider DIR as a parametric optimization model that aims to find displacement field of deformation. We propose an adversarial learning framework inspired by generative adversarial network (GAN) to predict the displacement field without ground-truth spatial transformation. We use convolutional neural network (CNN) and a spatial transform layer as registration network to generate the registered image. Similarity metrics of image intensity and vessel masks are used as loss function for the training. We also optimize a discrimination network to measure the divergence between the registered image and the fixed image. Feedback from the discrimination network can guide the registration network for more accurate and realistic deformation. Moreover, we incorporate an autoencoder network to extract anatomical features from vessel masks as shape regularization. Our approach is end-to-end, only requires image pair as input in registration tasks. Experiments show that the proposed method outperforms state-of-the-art deep-learning-based methods.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Radiofrequency ablation (RFA) a low invasive therapy for liver cancer. In RFA, doctors often use ultrasound image fusion during or after the operation to assess the treatment effect. Deformable image registration (DIR) is significant technology for image fusion since it can deal with tissue deformation and body movement. In DIR, a dense, non-linear correspondence is estimated between a pair of 2D or 3D images. Most registration methods [1,2,3,4,5] solve an optimization problem that aligns voxels with similar appearance. However, it requires calculation of image similarity in every optimizing iteration. Therefore, it is computationally intensive and extremely slow in practice.
Several recent works proposed machine-learning-based methods to learn a transformation function to replace the iterative optimization in deformable registration. Most of these [6,7,8,9] rely on ground-truth or synthesized displacement fields which are difficult to be obtained in medical imaging applications. Recent works in [10, 11, 17, 18] presented weakly supervised or unsupervised methods based on a convolutional neural network (CNN) and a spatial transformation network (STN) [12]. For better registration accuracy and robustness, generative adversarial network (GAN) [13, 14] was adopted in [19,20,21]. However, the work in [20] requires initial registration ground-truth, and the methods in [19, 21] only work on training data of 3D ROIs or 2D synthesized slices.
In this work, we propose a framework of adversarial learning for deformable image registration (AL-DIR). The AL-DIR model can run deformable registration in one pass and can be trained without ground-truth spatial transformation. AL-DIR consists of three networks: a CNN-based registration network (generator) using similarity metrics of image intensity and vessel masks as loss function; a discrimination network (discriminator) that distinguishes between the registered image and the fixed image; an autoencoder for measuring the anatomical shape difference before and after an iteration of registration. Main contributions of this work are as follow:
-
We propose an end-to-end registration network to predict 3D displacement field of DIR without ground-truth spatial transformation. The single-pass prediction leverages information of image intensity and anatomical shape features (such as vessels and organs) and only requires image pair as input.
-
We present an adversarial learning framework to train the registration network. The discrimination network can guide the registration for more accurate and realistic deformation. Unlike most of GAN-based supervised methods, our approach only requires vessel masks for training, so it is in weakly supervised manner.
-
We incorporate the encoder part of an autoencoder to extract anatomical shape difference for better convergence of deformation.
2 Methodology
Image registration aims to find a spatial transformation between a moving image \( \varvec{M}\left( \varvec{x} \right) \), and a fixed image \( \varvec{F}\left( \varvec{x} \right) \). Here, \( \varvec{x} \) refers to coordination of image pixels. Image registration can be considered as an optimization problem for minimizing a cost function:
where \( \varvec{g} \) is transformation function; \( \varvec{\mu} \) is displacement field; \( {\text{S}} \) is similarity measure between \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( {\varvec{g} \left(\varvec{x};\varvec{\mu} \right)} \right) \); \( C_{r} \) is a regularization term for smooth deformation; \( \lambda \) is a weighting factor.
In this work, we present an adversarial learning framework to predict the displacement field. As shown in Fig. 1, the proposed framework consists of three networks: a registration network, a discrimination network and an autoencoder network. After training, the registration network only requires image pair as input, as shown in Fig. 2.
2.1 Registration Network
In registration network, we adopt a CNN architecture that is similar to U-Net [15]. In training, a pair of training images to align, \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( \varvec{x} \right) \), are concatenated and input to the CNN. Displacement of each voxel in \( \varvec{M}\left( \varvec{x} \right) \), i.e. the displacement field \( \varvec{\mu} \), is output as prediction. Since it is difficult to obtain the ground-truth of \( \varvec{\mu} \), we use similarity metrics of image intensity and vessel region as loss function of the registration network: image similarity metric that penalizes appearance difference between \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right)) \), and anatomical region (vessel region in our work) correspondence that guarantees deformation accuracy on important tissues and areas. We adopt local cross-correlation (CC) of \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right)) \) as image similarity metric:
where \( \hat{\varvec{F}}\left( \varvec{y} \right) \) and \( \hat{M}\left( {\varvec{g}(\varvec{y};\varvec{\mu}} \right)) \) denote images with local mean intensities subtracted out. The local mean is calculated over a \( \varvec{N}\left( \varvec{x} \right) \) local volume around each voxel \( y \). \( \epsilon \) is a small constant to avoid numerical issues. Size of the local volume \( \varvec{N}\left( \varvec{x} \right) \) is set as \( 11 \times 11 \times 11 \) experimentally.
We also calculate the similarity of anatomical region correspondence to guarantee accuracy on clinically important regions. In this work, we measure l1 distance between liver vessel masks \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} \). We also adopt gradient of displacement field as the regularization term \( C_{\text{r}} \left(\varvec{\mu}\right) \). Therefore, the training of the registration network requires correspond vessel regions and is in weakly supervised manner. The registration network loss is calculated as follows:
The network consists of an encoder-decoder with skip connections that is responsible for estimating \( \varvec{\mu} \) given \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( \varvec{x} \right) \). Figure 3 shows the network architecture. The input of the network is formed by concatenating the \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( \varvec{x} \right) \) into a two-channel 3D image of size \( N_{1} \times N_{2} \times N_{3} \times 2 \). We apply 3D convolution layers followed by layers of ReLU activation, batch normalization and dropout. The convolution kernel size is fixed to \( 3 \times 3 \times 3 \). The network output \( \varvec{\mu} \) is of size of \( N_{1} \times N_{2} \times N_{3} \times 3 \). The last 3 channels of the output represent voxel displacement of a coordination \( \varvec{x} \) in \( \varvec{M}\left( \varvec{x} \right) \).
2.2 Discrimination Network
On the purpose of better guidance for the training of registration network, we propose a discrimination network and train it simultaneously with the registration network. As shown in Fig. 1, the discrimination network also has a two-channel 3D image as input. The first channel is the fixed image \( \varvec{F}\left( \varvec{x} \right) \) in both of the real case and the fake case. The second channel is the vessel segmented region \( \varvec{F}\left( \varvec{x} \right)_{{{\mathbf{v}}{\text{es}}}} = \varvec{F}\left( \varvec{x} \right) \cdot \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) in the “real” case and the corresponding region of the deformed image \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{ves}} = \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right)) \cdot \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} \) in the “fake” case. The loss of discrimination network is defined as a binary cross-entropy metric as follows:
The discrimination network is optimized by maximizing the perfect registration (by minimizing \( - \log \left( {D\left( {\varvec{F}\left( \varvec{x} \right),\varvec{F}\left( \varvec{x} \right)_{{{\mathbf{v}}{\text{es}}}} } \right)} \right) \)) and by minimizing the registration without high accuracy (by minimizing \( - log\left( {1 - D\left( {\varvec{F}\left( \varvec{x} \right),\varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{ves}} } \right)} \right) \)). In this way, the training of the discrimination network does not require any ground-truth of spatial transformation, which makes our approach be easier to be utilized in practical medical imaging applications.
The architecture of the discrimination network is shown in Fig. 4(a). In order to achieve faster and better training convergence, the first half of discrimination network adopts the same configuration of the encoder part (without skip connections) of the registration network and shares the network weights during training. The resting part of the discrimination network consists of 3D convolution layers and fully connected layers. The convolution kernel size is fixed to 3 × 3 × 3. The output is the classification result of “real” or “fake”. The loss \( {\mathcal{L}}_{\text{adv}} \) is not only used to fit the discrimination network but also used to feedback to the registration network.
2.3 Autoencoder of Anatomical Shape
Deformation of image fusion tasks is required to be smooth and realistic, especially on important tissues and regions for medical imaging applications. We incorporate a loss term by training an autoencoder network on liver vessel masks of fixed images \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \). The encoder part reduces the vessel mask to low resolution features \( Enc\left( {\varvec{F}\left( \varvec{x} \right)_{\text{m}} } \right) \) in non-liner manner [16], and the decoder part generates the original vessel mask \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) from \( Enc\left( {\varvec{F}\left( \varvec{x} \right)_{\text{m}} } \right) \). In this work, the autoencoder network is pre-trained, and the encoder part is leveraged to extract anatomical shape features of \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} \) in each iteration of registration training. The architecture of the autoencoder network is shown in Fig. 4(b). The loss retrieved from the encoder network is defined as the l2 distance between features of \( \varvec{F}\left( \varvec{x} \right)_{\text{m}} \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right))_{\text{m}} \):
2.4 Adversarial Learning
In this work, the networks of registration, discrimination and the pre-trained encoder are combined to an adversarial learning framework. The registration network and discrimination network are trained simultaneously. The loss in the combined training procedure of the proposed AL-DIR is defined as follows:
where \( \uplambda_{\text{adv}} = 1.0 \) and \( \uplambda_{\text{enc}} = \, 0.25 \) are set experimentally. By minimizing \( {\mathcal{L}}_{{{\text{AL}} - {\text{DIR}}}} \), accurate, smooth and realistic deformation of \( \varvec{M}\left( \varvec{x} \right) \) can be obtained.
3 Experiments and Results
3.1 Materials and Training/Evaluation Details
We use clinical image data acquired in liver RFA surgery to evaluate the proposed method. In total, 510 image pairs from 98 patients are used, and 3-fold cross-validation is performed in the experiments. All images are resampled to size of \( 128 \times 112 \times 96 \) with resolution of \( 1.0 \times 1.0 \times 1.0 \) mm3. Corresponding masks of liver vessels are annotated for training. The study was approved by the ethics committee of Hitachi group headquarters.
The proposed method is implemented in Keras with TensorflowTM backend. All the experiments are run on an 11 GB NVIDIA GTXTM 1080 Ti GPU. In the training stage, we use an Adam optimizer with a learning rate of \( 10^{ - 4} \). We set batch size as 1 for reducing GPU memory usage. First, we train the autoencoder network with the vessel masks of fixed images for 20,000 iterations. Then we train the registration network and discrimination network with the resulted encoder for 40,000 iterations.
As pre-registration, a rigid registration is firstly performed on each image pair by using the vessel masks in the evaluation experiment. After that, the proposed AL-DIR model is trained, and evaluation of deformable registration is run. Distance between corresponding landmarks of portal vein branches on \( \varvec{F}\left( \varvec{x} \right) \) and \( \varvec{M}\left( {\varvec{g}(\varvec{x};\varvec{\mu}} \right)) \) are used to measure the registration error. Moreover, Dice coefficient between vessel regions on the fixed images and the deformed images are calculated:
We compare the proposed AL-DIR with two deep-learning-based methods: VoxelMorph-2 [17] and LabelReg [18]. We apply the same pre-registration results to both models and follow the implementation details and training parameters in [17, 18]. VoxelMorph is trained on image pairs without vessel masks, and LabelReg is trained by using both image pairs and vessel masks.
In order to evaluate the effect of the proposed discrimination network and autoencoder network, we also evaluate the following combinations of registration networks: (1) registration network only (referred as “Reg”), (2) registration network with discrimination network (referred as “Reg + GAN”) and (3) registration network with autoencoder network (referred as “Reg + Enc”).
3.2 Evaluation Results
Target registration errors (TREs) on portal vein branches and Dice coefficients of vessel regions are measured and listed in Table 1. As mentioned before, all the evaluated methods start deformable registration after the rigid registration (TRE = 10.6 mm, Dice = 0.33). We can see that VoxelMorph is able to achieve relatively good accuracy of TREs but is gives the worst performance of vessel region Dice, because it only uses information of image intensity to train the registration model. On the other hand, LabelReg achieves good performance of vessel region Dice since it is trained based on the loss of vessel region Dice directly. However, the loss of LabelReg does not adopt any image similarity metrics, so it gives a worse TRE than VoxelMorph. Compared to this methods, the registration network (Reg) in this work utilizes similarity metrics of both image and vessel masks, so better performance of both TRE and Dice can be achieved. Moreover, the discrimination network and autoencoder network contribute on better training guidance for the registration networks. As a result, the combination of the three networks, i.e., the proposed AL-DIR, gives the best performance of both TRE and Dice. The running time of AL-DIR is 0.3 s on the GPU.
Some examples of registration results of AL-DIR are shown in Fig. 5. The registration is run on 3D images, and 2D axial slices are given here. Vessel masks before and after registration are highlighted by circles. We can see that AL-DIR can handle large changes in shapes and provide accurate deformation on important anatomical regions (liver vessels in this work) defined by user.
4 Conclusion
We propose an adversarial learning framework for deep-learning-based deformable image registration. The end-to-end registration network can be trained to predict displacement field of deformable registration without ground-truth spatial transformation. The single-pass prediction leverages information of image intensity and anatomical shape features and only requires image pair as input. The discrimination network can guide the registration for more accurate and realistic deformation. Moreover, the autoencoder network can extract anatomical shape difference for better convergence. We apply our method to image fusion of 3D liver ultrasound images, and experimental results show that our method achieve better performance than the state-of-the-art deep-learning-based methods.
References
Roche, A., Pennec, X., Malandain, G., Ayache, N.: Rigid registration of 3-D ultrasound with MR images: a new approach combining intensity and gradient information. IEEE Trans. Med. Images 20(10), 1038–1049 (2001)
Penney, G.P., Blackall, J.M., Hamady, M.S., Sabharwal, T.: Registration of freehand 3D ultrasound and magnetic resonance liver images. Med. Image Anal. 8, 81–91 (2004)
Wein, W., Brunke, S., et al.: Automatic CT-ultrasound registration for diagnostic imaging and image-guided intervention. Med. Image Anal. 12, 577–585 (2008)
Wein, W., Ladikos, A., Fuerst, B., Shah, A., Sharma, K., Navab, N.: Global registration of ultrasound to MRI using the LC2metric for enabling neurosurgical guidance. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8149, pp. 34–41. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40811-3_5
Lange, T., Papenberg, N., et al.: 3D ultrasound-CT registration of the liver using combined landmark-intensity information. Int. J. CARS 4, 79–88 (2009)
Krebs, J., et al.: Robust non-rigid registration through agent-based action learning. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.Louis, Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10433, pp. 344–352. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66182-7_40
Rohé, M.-M., Datar, M., Heimann, T., Sermesant, M., Pennec, X.: SVF-Net: learning deformable image registration using shape matching. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10433, pp. 266–274. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66182-7_31
Sokooti, H., de Vos, B., Berendsen, F., Lelieveldt, B.P.F., Išgum, I., Staring, M.: Nonrigid image registration using multi-scale 3D convolutional neural networks. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10433, pp. 232–239. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66182-7_27
Yang, X., Kwitt, R., Styner, M., Niethammer, M.: Quicksilver: fast predictive image registration–a deep learning approach. NeuroImage 158, 378–396 (2017)
de Vos, B.D., Berendsen, F., Viergever, M.A.: End-to-end unsupervised deformable image registration with a convolutional neural network. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 204–212 (2017)
Li, H., Fan, Y.: Non-rigid image registration using fully convolutional networks with deep self-supervision. arXiv preprint arXiv:1709.00799 (2017)
Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: NIPS 2015, pp. 2017–2025 (2015)
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS 2014, pp. 2672–2680 (2014)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV 2017, pp. 2223–2232 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Oktay, O., Ferrante, E., Kamnitsas, K.: Anatomically constrained neural networks (ACNNs): application to cardiac image enhancement and segmentation. IEEE Trans. Med. Imaging 37(2), 384–395 (2018)
Balakrishnan, G., Zhao, A., Sabuncu, M.R.: An unsupervised learning model for deformable medical image registration. In: CVPR 2018, pp. 9252–9260 (2018)
Hu, Y., Modat, M., Gibson, E., Ghavami, N.: Label-driven weakly-supervised learning for multimodal deformable image registration. In: ISBI 2018, pp. 1070–1074. IEEE (2018)
Fan, J., Cao, X., Xue, Z., Yap, P.-T., Shen, D.: Adversarial similarity network for evaluating image alignment in deep learning based registration. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 739–746. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_83
Hu, Y., et al.: Adversarial deformation regularization for training image registration neural networks. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 774–782. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_87
Mahapatra, D., Antony, B., Sedai, S.: Deformable medical image registration using generative adversarial networks. In: ISBI 2018, pp. 1449–1453. IEEE (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Z., Ogino, M. (2019). Adversarial Learning for Deformable Image Registration: Application to 3D Ultrasound Image Fusion. In: Wang, Q., et al. Smart Ultrasound Imaging and Perinatal, Preterm and Paediatric Image Analysis. PIPPI SUSI 2019 2019. Lecture Notes in Computer Science(), vol 11798. Springer, Cham. https://doi.org/10.1007/978-3-030-32875-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-32875-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32874-0
Online ISBN: 978-3-030-32875-7
eBook Packages: Computer ScienceComputer Science (R0)