Abstract
A novel WGAN-GP-based model is proposed in this study to fulfill bi-directional synthesis of medical images for the first time. GMM-based noise generated from the Glow model is newly incorporated into the WGAN-GP-based model to better reflect the characteristics of heterogeneity commonly seen in medical images, which is beneficial to produce high-quality synthesized medical images. Both the conventional “down-sampling”-like synthesis and the more challenging “up-sampling”-like synthesis are realized through the newly introduced model, which is thoroughly evaluated with comparisons towards several popular deep learning-based models both qualitatively and quantitatively. The superiority of the new model is substantiated based on a series of rigorous experiments using a multi-modal MRI database composed of 355 real demented patients in this study, from the statistical perspective.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
It is widely acknowledged that, medical images synthesis receives more and more popularity in recent year because of the rapid development of deep learning techniques. Various imaging modalities, including T1/T2/DTI MRI image [1, 2], PET images [3], cardiac ultrasound images [4], retinal images [5], etc., have been successfully synthesized via various deep learning models. Medical images synthesis is also widely known to be valuable, since adequate high-quality medical images data that may become challenging to be acquired through actual scanning (i.e., because of various reasons including high acquisition costs, patient concerns, etc.) can be produced, alternatively. Hence, the notorious problem of overfitting in the training of sophisticated deep learning models is likely to be largely alleviated and their generalization capabilities can be boosted, therein.
In this study, an important functional MRI modality in dementia diseases diagnosis, i.e., the arterial spin labeling (ASL) [6], is emphasized for the synthesis purpose. A novel WGAN-GP model with GMM-based noise generated by the Glow model (i.e., “WGAN-GP+Glow”) is proposed for the first time to fulfill the above-mentioned synthesis task. It is widely acknowledged that, GAN (i.e., generative adversarial networks) [7] becomes quite popular in contemporary deep learning studies. The basic idea of GAN is to generate “pseudo-but-real” data through a generator and to differentiate the very synthesized data from real data through a discriminator. The quality of the synthesized data is regarded to be high, when the discriminator cannot fulfill its classification mission. Also, because neither the Jensen-Shannon divergence nor the Kullback-Leibler divergence mainly incorporated in the original GAN model can reasonably reflect the actual difference between distributions of synthesized data and real data, the Wasserstein distance is then incorporated to replace the above two conventional divergences, and related GAN derivatives (e.g., WGAN [8], WGAN-GP [9]) are proposed, therein. For WGAN-GP, it utilizes the gradient penalty to conveniently fulfill the well-known Lipschitz constraint for avoiding the vanishing gradient problem, which is more favored than WGAN in most recent studies. Therefore, WGAN-GP is also incorporated in the new “WGAN-GP+Glow” model of this study.
The problem of only adopting WGAN-GP for realizing medical images synthesis is that, the noise in the generator of WGAN-GP only follows a simple Gaussian distribution for producing synthesized images. It is quite challenging, as the characteristics of heterogeneity commonly seen in medical images cannot be simply reflected by only one Gaussian distribution. In this study, GMM-based (i.e., Gaussian mixture model) noise is generated using the Glow model [10], and the generated GMM-based noise is then fed into WGAN-GP to complete the new “WGAN-GP+Glow” model. Moreover, the new “WGAN-GP+Glow” model has been investigated for bi-directional synthesis between ASL images and structural MRI in this study. It is necessary to point out that, the spatial resolution of ASL images is often not as high as that of structural MRI. Hence, synthesizing ASL images from structural MRI is described as a “downsampling” process, while synthesizing structural MRI from ASL images is considered as an “upsampling” process, to the contrary. Generally speaking, the “upsampling” synthesis is more challenging than the conventional “downsampling” synthesis, since the input information is less. In this study, the superiority of the new “WGAN-GP+Glow” model will be comprehensively verified through both the conventional “downsampling” synthesis and the more challenging “upsampling” synthesis.
2 Methodology
In this section, technical details of the new “WGAN-GP+Glow” model for realizing bi-directional images synthesis are elaborated. In Subsect. 2.1, details about generating GMM-based noise via the Glow model are described. In Subsect. 2.2, bi-directional images synthesis via WGAN-GP with GMM-based noise is emphasized.
2.1 GMM-Based Noise Generation via the Glow Model
Given an image set X, \(x\in X\) denotes one image following the probability distribution \(P_X\). Also, provided the latent feature set of X as Z, and \(z\in Z\) represents the latent feature of image x following the prior probability distribution \(P_Z\). The formula of \(P_X\) can be explicitly represented as Eq. 1.
Meanwhile, a pair of bi-jection functions \((f, g) = \{f:X\rightarrow Z, g=f^{-1}\}\) can be defined (i.e., \(-1\) indicates the inverse function). Suppose P(x|z) follows the Dirac distribution \(\delta (x-g(z))\), Eq. 2 can be obtained to reveal the fact that image x can be reconstructed and represented based on the latent feature z.
Therefore, Eq. 1 can be re-written as Eq. 3, in which \(\frac{\partial f(x)}{\partial x}\) denotes the Jacobian determinant of x with respect to f(x) (i.e., detailed derivations are omitted here).
In order to avoid the potential underflow problem in Eq. 3, it needs to be revised via the logarithm form in Eq. 4.
Furthermore, to obtain the optimal f, the classic idea of MLE (i.e., maximum likelihood estimation) can be incorporated based on \(\mathbb {E}_{x\sim P_X}(log P_X(x))\) in Eq. 5.
which suggests that the optimal f needs to satisfy the invertible characteristics (i.e., bi-jective characteristics). Meanwhile, the Jacobian determinant in Eq. 5 is convenient to be calculated as well.
It is necessary to point out that, the majority of contemporary “flow-based” generative models, including Glow [10], NICE [11], RealNVP [12], etc., all represent f by stacking multiple simple bi-jections. Therefore, within each individual bi-jection, the input x can be decomposed into two parts: \(x = [x_1, x_2]_{1\times 1}\) (i.e., \([\cdot ]_{1\times 1}\) represents the concatenation operation after refreshing the order of elements in x using the \(1\times 1\) convolution [10]). Equation 6 can be introduced, therein.
in which, \(y=[y_1,y_2]\) indicates the output (i.e., \([\cdot ]\) also denotes the concatenation operation); s and t stand for scaling and translation operations; \(\bigotimes \) represents the element-wised multiplication. The Jacobian matrix of the affine transformation of y can be represented via a triangular matrix in Eq. 7.
Fortunately, the determinant of Eq. 7 can be simply represented as the product of elements in s. Therefore, a flow can be generated using multiple simple bi-jections \(h_i\) that are successively connected as shown in Eq. 8.
Hence, the logarithm form of the Jacobian determinant in Eq. 4 can be derived based on Eq. 8, whose outcome is described in Eq. 9.
For the model structure of the original Glow, it utilizes a flow of length K as the main structure as illustrated on the left of Fig. 1. It is necessary to point out that, the squeeze operation is performed within each individual block of the original Glow. However, the squeeze operation is carried out only within every other block (i.e., \(Z_1, Z_3, Z_5, Z_7\)) of the utilized Glow in this study, which is illustrated at the bottom row of Fig. 1. In this way, more spatial affinity can be retained within synthesized images generated by the revised Glow of this study.
Moreover, the latent feature z of the final output from the utilized Glow can be concatenated as \(z=[z_1,z_2,\cdots ,z_8]\). Provided \(P_{z_i}\) as the distribution that \(z_i\) follows (i.e., \(P_{z_i}\) is normally distributed and \(z_i\) is the gaussian-distributed noise in this study), the distribution that z follows should be the weighted sum of each \(P_{z_i}\) that follows the GMM-based distribution. The above idea can be represented in Eq. 10 after adding a translation transformation to each individual \(P_{z_i}\).
where, K and \(\phi _i\) are the number of Gaussian distributions and the normalized weight of \(P_{z_i}\) (i.e., regarding \(\sum _{i=1}^K \phi _i=1\)), respectively; is the translation transformation matrix of \(Z_i\).
2.2 WGAN-GP Images Synthesis with GMM-Based Noise via Glow
The flowchart of the new “WGAN-GP+Glow” model to fulfill bi-directional synthesis between structural MRI and ASL images in this study is illustrated in Fig. 2. To be specific, Z represents the GMM-based noise generated by Glow, and it will then be fed into the generator of WGAN-GP during either way of the bi-directional synthesis. After that, the discriminator of WGAN-GP will try to differentiate the synthesized image from the corresponding real one. Moreover, detailed structures of the generator and the discriminator of WGAN-GP are displayed in Fig. 3, in which the number of neurons in each FC layer is annotated. It can be observed that, essential sub-structures of “FC+leakyReLu” are mainly adopted. The reason to incorporate the leaky ReLu function (i.e., \(f(x) = \max (\alpha x, x)\), \(\alpha \in [0, 1]\)), rather than other activation functions is that, it is more effective in dealing with the “dying ReLu” problem that always outputs the same value for any input. For the generator, BN (i.e., batch normalization) is incorporated in each sub-structure of “FC+BN+leakyReLu”, in order to avoid potential problems of vanishing/exploding gradients. Furthermore, it is also helpful to speed up the training of the whole new model.
The objective function to be optimized in the training of the new “WGAN-GP+Glow” model can be described as Eq. 11.
in which, the 1st, 2nd and 3rd term of RHS (i.e., the right hand side) in Eq. 11 denote the generator’s loss, the discriminator’s loss, and the gradient penalty, respectively; D and G represent the discriminator and the generator, separately; z denotes the noise following the GMM distribution \(\mathbb {P}_{z}\); x represents the target data following the data distribution \(\mathbb {P}_{R}\); \(\hat{x} = \epsilon x + (1-\epsilon )G(z)\), in which \(\epsilon \) is randomly chosen between [0, 1]; \(\gamma \) is the weight of the gradient penalty. The training of the new “WGAN-GP+Glow” model is then carried out via the popular Adam optimization algorithm.
3 Experimental Analyses
The dataset of this study was constructed from an on-going demented population-based study. There are totally 355 real patients in this dataset, including 38 AD (i.e., Alzheimer’s disease) patients, 185 MCI (i.e., mild cognitive impairments) patients, and 132 NCI (i.e., non-cognitive impairments) patients as normal controls. The average age of these patients is \(70.56\pm 7.20\) years old, and informed consents were obtained from all patients for conducting this study. High-resolution MPRAGE (i.e., magnetization prepared rapid acquisition gradient echo) T1-weighted MRI images were acquired as structural MRI using a SIEMENS 3T TIM Trio MR scanner. Meanwhile, the pseudo-continuous ASL scanning was applied for acquiring ASL images from each individual patient as well. Acquisition parameters mainly include: labeling duration = 1500 ms, post-labeling delay = 1500 ms, TR/TE = 4000/9.1 ms, ASL voxel size = \(3 \times 3\times 5\) mm\(^3\), etc. Spatial resolutions of MRI images in this study are \(64 \times 64\times 21\). After obtaining raw MRI data, a series of pre-processes are essential to be applied, including motion correction, brain extraction (i.e., skull removal), intra-modality registration (i.e., using the first slice as the reference) separately within ASL and structural MRI images, inter-registration between ASL and structural MRI, etc. These pre-processes are realized by the well-known SPM toolbox.
A series of rigorous experiments are carried out in this study. Both qualitative and quantitative evaluations are fulfilled to reveal the superiority of the new model in bi-directional synthesis between ASL images and structural MRI. The new model has been compared with several popular GAN-based and non-GAN-based synthesis models, including Glow, WGAN-GP, LSGAN, CycleGAN, ResNet-19 and CNN-7. Figure 4 illustrates synthesized ASL/synthesized structural MRI images and their corresponding difference images. It is necessary to point out that, difference images are produced as the direct absolute difference between synthesized images and their golden standards that are real images obtained via actual scanning. It can be observed from Fig. 4 that, the ideal case belongs to Row 1 as there is no difference after subtracting the golden standard from itself. For Rows 2–8, it is clear that the new “WGAN-GP+Glow” model is capable to provide the least difference after comprehensively taking both synthesized ASL images and synthesized structural MRI outcomes into consideration.
Another more detailed quantitative experiment is carried out to differentiate progressions of dementia diseases (i.e., AD, MCI, and NCI) using synthesized ASL or synthesized structural MRI images obtained from all compared deep learning-based models. Five deep learning-based/shallow learning-based diagnosis tools are implemented and the diagnosis accuracy in Table 1 is calculated as the average based on all diagnosis outcomes obtained from the 5-fold cross validation. It can be summarized that, besides adopting real structural MRI (i.e., \(68.65\%\pm 2.34\%\)) or real ASL images (i.e., \(67.35\%\pm 1.39\%\)), the new “WGAN-GP+Glow” model can provide the highest accuracies based on synthesized structural MRI (i.e., \(66.96\%\pm 4.98\%\)) or synthesized ASL images (i.e., \(65.75\%\pm 4.65\%\)), among all compared models. Hence, the superiority of synthesized structural MRI/ASL images via the new “WGAN-GP+Glow” model can be quantitatively substantiated based on the dementia diagnosis test from the statistical point of view.
4 Conclusions
In this study, a novel “WGAN-GP+Glow” model is proposed to realize bi-directional synthesis between structural MRI and ASL images for the first time. GMM-based noise generated from Glow is incorporated into WGAN-GP to better reflect the characteristics of heterogeneity, which is commonly seen in medical images. Both the conventional “down-sampling” synthesis (i.e., from structural MRI to ASL images) and the more challenging “up-sampling” synthesis (i.e., from ASL images to structural MRI) are realized through the new model, which is thoroughly evaluated with comprehensive comparisons towards several popular GAN-based and conventional non-GAN-based deep learning models, both qualitatively and quantitatively. The superiority of the new model can be suggested therein. Future efforts will be emphasized on investigating more sophisticated GAN models to enrich details of synthesis outcomes in medical images.
References
Cordier, N., Delingette, H., Le, M., Ayache, N.: Extended modality propagation: image synthesis of pathological cases. IEEE-TMI 35(12), 2598–2608 (2016)
Huang, Y., et al.: Cross-modality image synthesis via weakly coupled and geometry co-regularized joint dictionary learning. IEEE-TMI 37(3), 815–827 (2018)
Polycarpou, I., et al.: Synthesis of realistic simultaneous positron emission tomography and magnetic resonance imaging data. IEEE-TMI 37(3), 703–711 (2018)
Zhou, Y., Giffard-Roisin, S., De Craene, M., et al.: A framework for the generation of realistic synthetic cardiac ultrasound and magnetic resonance imaging sequences from the same virtual patients. IEEE-TMI 37(3), 741–754 (2018)
Costa, P., Galdran, A., Meyer, M., et al.: End-to-end adversarial retinal image synthesis. IEEE-TMI 37(3), 781–791 (2018)
Huang, W., et al.: Arterial spin labeling images synthesis from sMRI using unbalanced deep discriminant learning. IEEE-TMI (2019). https://doi.org/10.1109/TMI.2019.2906677
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial networks. In: NIPS, Montreal, pp. 2672–2680 (2014)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv arXiv:1701.07875 (2017)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. arXiv arXiv:1704.00028 (2017)
Kingma, D., Dhariwal, P.: Glow: generative flow with invertible 1x1 convolutions. In: NIPS, Vancouver, pp. 10236–10245 (2018)
Dinh, L., Krueger, D., Bengio, Y.: NICE: non-linear independent components estimation. In: ICLR, San Diego (2015)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using Real NVP. In ICLR, Toulon (2017)
Acknowledgements
This work was jointly supported by the grant 61862043 approved by National Natural Science Foundation of China, and the key grant 20181ACB20006 approved by Natural Science Foundation of Jiangxi Province.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, W., Luo, M., Liu, X., Zhang, P., Ding, H., Ni, D. (2019). Novel Bi-directional Images Synthesis Based on WGAN-GP with GMM-Based Noise Generation. In: Suk, HI., Liu, M., Yan, P., Lian, C. (eds) Machine Learning in Medical Imaging. MLMI 2019. Lecture Notes in Computer Science(), vol 11861. Springer, Cham. https://doi.org/10.1007/978-3-030-32692-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-32692-0_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32691-3
Online ISBN: 978-3-030-32692-0
eBook Packages: Computer ScienceComputer Science (R0)