Abstract
Image-to-image translation aims to transform an image of one source domain to another target domain, which can be applied to many specific tasks, such as image restoration and style transfer. As one of the most popular frameworks, CycleGAN with cycle consistency loss can transform between different domains with unpaired training data, which greatly expands its application efficiency. However, due to its under-constrained mapping and dissatisfactory network design, the results of CycleGAN are sometimes unnatural and unreal. In some cases, there are visible artifacts in the synthetic images. In this paper, we propose an Enhanced CycleGAN (ECGAN) framework with multi-scale relativistic average discriminator, which integrates the loss function design and network structure optimization to make the generated images have more natural textures and fewer unwanted artifacts. In the evaluation, besides using quantitative full reference image quality assessment metrics (such as PSNR, SSIM), we also conduct an evaluation on the Frechet Inception distance (FID), which are more consistent with human subjective assessment according to natural, realistic and diverse. Experiments on two benchmark datasets, including CMP facades and CUHK face datasets, show that the proposed ECGAN framework outperforms the state-of-the-art methods in both quantitative and qualitative evaluation.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
With the development of deep learning technology, image-to-image translation has received great attention in recent years. Many computer vision tasks can be handled under the framework of image-to-image translation framework, such as image restoration and image enhancement. As we all know, the Generative Adversarial Network (GAN) framework is one of the most popular frameworks in this application field, and it has very powerful functions and vigorous vitality. And it has been significantly improved in some special image processing tasks, such as image super-resolution [1,2,3], semantic segmentation [4], image inpainting [5,6,7], etc.
In Generative Adversarial Network (GAN), the generator G is trained to increase the probability that fake data is real and the discriminator D tries to distinguish if one input image x is natural and realistic. Different from the standard discriminator D, Alexia et al. [8, 9] argues that it should also simultaneously increase the probability that fake data is real and decrease the probability that real data is real. They introduced this property by changing the discriminator into a relativistic form, which calculates the probability that a real image is relatively more natural and realistic than a fake one.
However, this discriminator is in the relativistic form of a single scale, which can still lead to artifacts in the generated image. To solve this problem, we propose an Enhanced CycleGAN (ECGAN) framework with multi-scale relativistic average discriminator. To sum it up, we improve the key components of original CycleGAN model in three aspects:
-
We improve the discriminator by introducing Multi-Scale Relativistic average Discriminator (MS-RaD), which tries to distinguish on different scales whether one image is more realistic than the other, rather than whether one image is real or fake.
-
We add a complementary loss item calculating between the synthesized images and real images while the cycle-consistency loss calculating between the cycled images and real images. The final complementary cycle-consistent loss is able to increases the quality and reduce artifacts of the results.
-
We introduce the Residual-in-Residual Dense Block (RDDB) to generator G as its basic module. This makes the network be of higher capacity and easier to train. We don’t apply Batch Normalization (BN)[10] or Instance Normalization (IN)[11] layers in generator as in [1].
Experiments show that those improvements help the generator create more realistic texture and details. Extensive experiments display that the ECGAN, outperforms the state-of-the-art methods on both similarity metrics and perceptual scores.
2 Related Works
For convenience of description, we define the task of image-to-image translation as follows: given a series of images \(\{I_{X}^i\}_{i=1}^N\) from source domain X and \(\{I_{Y}^j\}_{j=1}^M\) from target domain Y, we find the mapping relationship between the same image from a source domain X to a target domain Y, denoted by \({\mathcal {T}:I_X \rightarrow I_Y}\). Many methods [12,13,14,15] have been proposed to solve this problem. Among them, the most famous framework is the pix2pix generalized framework (proposed by Isola et al. [16]) based on Conditional GANs (cGANs).
Zhu et al. [17] presented CycleGAN using Cycle-Consistent Loss to handle the paired training data limitation of pix2pix. While converting an image from a source domain X to a target domain Y referred to as \({\mathcal {T}:I_X \rightarrow I_Y}\), cycle consistency loss manages to enforce \({\mathcal {F}(\mathcal {T}(I_X))\approx I_X}\) by introducing an inverse mapping \({\mathcal {F}:I_Y\rightarrow I_X}\) (and vice versa). Except for original adversarial losses and cycle consistency loss, perceptual loss [18] or \(L_1\) loss is also used to improve the quality of synthesised images in many works afterwards [19].
Isola et al. [16] applies the U-Net [20] architecture generator and patch-based discriminator. Besides U-Net [20], ResNet [21] is another popular architecture for generator. It use Residual block as basic module. Wang et al. [14] made it possible for pix2pix to synthesize \(2048\times 1024\) high-resolution photo-realistic images from semantic label maps [22, 23]. They improved original pix2pix framework by using a coarse-to-fine generator and a multi-scale discriminator. Coarse-to-fine generator can be decomposed into a global generator network and a local enhancer network. Multi-scale discriminator [14] are composed of three discriminators that share identical structure but operate at three different scales. Those discriminators are respectively trained with real and synthesized images at different scales. The discriminator at the coarsest scale can generate globally consistent images, while the finest one produces finer details.
complementary cycle-consistent loss. Left: loss with paired samples. Middle: original cycle-consistent coss. Right: complementary cycle-consistentcoss. The complementary loss item is shown as the red dash line and is calculated as \(L_1\) loss between the synthesized images and real images. (Color figure online )
3 ECGAN with Multi-Scale Relativistic Average Discriminator
Similar to the aforementioned definition, given a series of images \(\{I_{X}^i\}_{i=1}^N\) from source domain X and \(\{I_{Y}^j\}_{j=1}^M\) from target domain Y, our goal is to learn a mapping \({\mathcal {T}:I_X \rightarrow I_Y}\) from the source domain X to the target domain Y such that the distribution of images from \(\mathcal {T}(I_X)\) is indistinguishable from the distribution \(I_Y\).
There are mainly three modifications at the structure of network: (1) use Multi-Scale Relativistic average Discriminator (MS-RaD); (2) add a complementary loss item to the original cycle-consistent loss; (3) replace the original basic residual blocks with the Residual-in-Residual Dense Block (RRDB) and remove all BN or IN layers in generator G. The following will be described in detail one by one.
3.1 Multi-Scale Relativistic Average Discriminator
In standard GAN, the discriminator usually be defined as \(D(x)=\sigma (C(x))\), where \(\sigma \) means activation function and C(x) represents the output of the discriminator. The simplest way to make it relativistic [8], i.e, making the output of D on both real and fake data, is to define it as \(D(x)=\sigma (C(x_r)-C(x_f))\) with samples from real and fake pairs \(\hat{x}(x_r,x_f)\), where, subscripts r and f denote real image and fake image, respectively.
Rather than judging the probability that the input data is real, relativistic discriminators measure the probability that the input data is relatively more realistic than a randomly sampled data of its counterpart. To make it more globally, we can focus on the average of the relativistic discriminator: \(D(x)=\sigma (C(x_r)-E(C(x_f)))\), as shown in Fig. 2.
The different between standard discriminator and relativistic average discriminator [3]. Left: Standard discriminator judges the probability that the input data is real or fake. Right: Relativistic average discriminator judges the probability that a real (or fake) image is relatively more realistic than a fake (or real) one.
Here we extend the relativistic design of discriminators into multiple different scales,
\(L_D\) can be formulated as:
where \(\mathbb {P}\) represents the distribution of real data, \(\mathbb {Q}\) represents the distribution of fake data and D(x) is the discriminator evaluated at x.
3.2 Complementary Cycle-Consistent Loss
Our goal is to learn a mapping \({\mathcal {T}:I_X \rightarrow I_Y}\) such that the distribution of images from \(\mathcal {T}(I_X)\) is indistinguishable from the distribution \(I_Y\) using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping \({\mathcal {F}:I_Y\rightarrow I_X}\) and introduce a cycle consistency loss to enforce \({\mathcal {F}(\mathcal {T}(I_X))\approx I_X}\) (and vice versa).
Cycle-consistent loss helps learn the translation mapping \({\mathcal {T}:I_X \rightarrow I_Y}\) coupled with an inverse mapping \({\mathcal {F}:I_Y\rightarrow I_X}\), but it’s still highly under-constrained. As depicted in Fig. 1, cycle-consistent loss is calculated as \(L_1\) loss between real images \(I_X\) and cycled images \(I_X'' \triangleq \mathcal {F}(\mathcal {T}(I_X))\) in domain \(I_X\) , real image \(I_Y\) and cycled image \(I_Y'' \triangleq \mathcal {T}(\mathcal {F}(I_Y))\) in domain Y. We observe that the relationship between real images \(I_X\), \(I_Y\) and the synthesized image \(I_X'\triangleq \mathcal {F}(I_Y)\), \(I_Y'\triangleq \mathcal {T}(I_X)\) is missing. The translation \(I_X'\triangleq \mathcal {F}(I_Y)\) and \(I_Y'\triangleq \mathcal {T}(I_X)\) are not specifically constrained. So we add this term to the original loss and name the final loss as complementary cycle-consistent loss.
The final complementary cycle-consistent loss is defined as follows:
3.3 Residual-in-Residual Dense Block
Some previous works observe that more layers and connections could boost performance [1, 24, 25]. Zhang et al. [24] employs a multi-level residual network. Wang et al. [3] proposes a similar residual-in-residual structure, where the network capacity becomes higher benefiting from deeper and more complex structure. We replace original residual block with this Residual Block with Residual-in-Residual Dense Block (RRDB). The basic structure of RRDB are depicted in Fig. 3.
The different between RB and RRDB [3]. Left: Residual Block with or without Batch Normalization (BN) layers. Right: RRDB block (\(\beta \) is the residual scaling parameter).
We empirically observe that Batch Normalization layers [10] tend to bring artifacts also in image translation processing as Wang et al. [3] found in super-resolution, which they called BN artifacts. Removing all Batch Normalization layers achieves stable and consistent performance without artifacts, but reduces memory usage and computational resources dramatically.
The Generator and Discriminator architectures in our work are adapted from CycleGAN [17]. We replace Residual Block with Residual-in-Residual Dense Block using no Batch Normalization Layers.
4 Experiment
4.1 Datasets
To appraise the efficiency of the proposed method, we conducted some experiments on two benchmark datasets, namely CUHK and Facades. First, we describe the datasets in brief here.
-
(1)
CUHK Face Sketch Database [26]: The CUHK dataset consists of 188 face image pairs of sketch and corresponding face of students. We use its \(256\times 256\times 3\) resized and cropped version in our experiment. 100 images are used for the training and rest for the testing.
-
(2)
CMP Facade Database [27]: The Facade Database present facade images from different cities around the world and diverse styles. It includes 606 rectified images pairs of labels and corresponding facades with dimensions of \(256\times 256\times 3\). 400 pairs are used for the training while the others remaining for the testing.
4.2 Evaluation Metrics
The quantitative and qualitative results are computed to evaluate the performance of the proposed method. The Structural Similarity Index (SSIM)[28], Peak Signal to Noise Ratio (PSNR) and the Frechet Inception distance (FID)[29] are adapted to assess the results.
PSNR and SSIM are Full Reference Image Quality Assessment (FR-IQA)[30, 31] metrics, usually applied to judge the similarity between the generated image and ground truth image in many tasks such as image enhancement [32, 33], image de-raining [34,35,36] and super-resolution [1,2,3, 37].
FID [29] calculates the Wasserstein distance between the synthesized and real images in the feature space of an Inception-v3 network [38]. Lower FID score means the distance between synthetic and real image distributions are closer. It’s has been shown to be more principled, comprehensive and consistent with human evaluation in diversity and realism of the synthesized images. The qualitative comparison results of SSIM, PSNR and FID are shown in Figs. 4 and 5.
4.3 Training Information
We train with ADAM [39] optimizer by setting \({\beta _1}= 0.9\), \({\beta _2} = 0.999\) and a learning rate of 0.0002. The joint training of generator and discriminator networks are performed. We perform the training for 200 epochs with batch size 1, which took roughly 5 hours on an i7 with 8 GB of memory and a GeForce GTX 1080 Ti GPU.
4.4 Experimental Results and Analysis
4.5 Quantitative Evaluation
Table 1 lists the comparative results of CycleGAN [17], CSGAN [40] and the proposed method over the CUHK sketch-to-face and Facede labels-to-buildings datasets, respectively. In terms of the average scores given by SSIM and PSNR metrics, the proposed method clearly shows improved results over the others. The highest scores of SSIM and PSNR metrics show that the proposed method generates more structurally similar faces for a given sketch. The lowest FID score means it also achieves the most perceptual results.
4.6 Qualitative Evaluation
Figures 4 and 5 show the qualitative comparison of the proposed results on CUHK and Facades dataset, respectively. The results generated by CycleGAN contain different type of artifacts such as face distortion, color inconsistencies, BN artifacts [3], etc. The results of CSGAN is better, but still suffers with the BN artifacts for different images. Those unwanted side effects are significantly reduced by our method. The results of proposed method are more natural, realistic and diverse with reduced artifacts.
4.7 Ablation Experiments
We also conduct ablation study on each part. The results are shown at Table 2. Multi-RaD, CCCL, and RRDB represent Multi-Scale Relativistic average Discriminator, Complementary Cycle-Consistent Loss and the Residual-in-Residual Dense Block, respectively. It can be seen from the Table 2 that with those three components helps achieve better results than without.
5 Conclusion
We present an ECGAN model that achieves both structurally similar and better perceptual quality results. We firstly expend the Relativistic average Discriminator into a multi-scale form, which learns to judge on different scales whether one image is more realistic than another, leading the generator G to create more natural textures and details. A Complementary Cycle-Consistent Loss is introduced to the original CycleGAN objective function to guide the translation to the desired direction and minimize unwanted artifacts. We have also introduced the structure containing several RDDB blocks without batch normalization layers into the field of image-to-image translation. The experiment shows that the proposed method is better or comparable with the recent state-of-art methods on two benchmark image translation datasets.
References
Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017
Wang, X., Yu, K., Dong, C., Loy, C.C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Wang, X., et al.: Esrgan: enhanced super-resolution generative adversarial networks. In: The European Conference on Computer Vision Workshops (ECCVW), September 2018
Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1925–1934 (2017)
Jo, Y., Park, J.: Sc-fegan: face editing generative adversarial network with user’s sketch and color. arXiv preprint arXiv:1902.06838 (2019)
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892 (2018)
Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: Edgeconnect: generative image inpainting with adversarial edge learning (2019)
Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734 (2018)
Jolicoeur-Martineau, A.: On relativistic f-divergences. arXiv preprint arXiv:1901.02474 (2019)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
Pumarola, A., Agudo, A., Martinez, A., Sanfeliu, A., Moreno-Noguer, F.: Ganimation: anatomically-aware facial animation from a single image. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Xian, W., et al.: Texturegan: Controlling deep image synthesis with texture patches. arXiv preprint arXiv:1706.02823 (2017)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networkss. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Nah, S., Kim, T.H., Lee, K.M.: Deep multi-scale convolutional neural network for dynamic scene deblurring, pp. 257–265 (2016)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Cordts, M., et al.: The cityscapes dataset. In: CVPR Workshop on The Future of Datasets in Vision (2015)
Zhang, K., Sun, M., Han, T.X., Yuan, X., Guo, L., Liu, T.: Residual networks of residual networks: multilevel residual networks. IEEE Trans. Circuits Syst. Video Technol. 28(6), 1303–1314 (2018)
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 286–301 (2018)
Wang, X., Tang, X.: Face photo-sketch synthesis and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 1955–1967 (2009)
Tyleček, R., Šára, R.: Spatial pattern templates for recognition of objects with regular structure. In: Weickert, J., Hein, M., Schiele, B. (eds.) GCPR 2013. LNCS, vol. 8142, pp. 364–374. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40602-7_39
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image process. 13(4), 600–612 (2004)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 15(11), 3440–3451 (2006)
Zhang, L., Zhang, L., Mou, X., Zhang, D.: A comprehensive evaluation of full reference image quality assessment algorithms. In: 2012 19th IEEE International Conference on Image Processing, pp. 1477–1480. IEEE (2012)
Chen, Y.S., Wang, Y.C., Kao, M.H., Chuang, Y.Y.: Deep photo enhancer: unpaired learning for image enhancement from photographs with gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6306–6314 (2018)
Liu, R., Ma, L., Wang, Y., Zhang, L.: Learning converged propagations with deep prior ensemble for image enhancement. IEEE Trans. Image Process. 28(3), 1528–1543 (2019)
Zhang, H., Sindagi, V., Patel, V.M.: Image de-raining using a conditional generative adversarial network. arXiv preprint arXiv:1701.05957 (2017)
Fu, X., Huang, J., Ding, X., Liao, Y., Paisley, J.: Clearing the skies: a deep network architecture for single-image rain removal. IEEE Trans. Image Process. 26(6), 2944–2956 (2017)
Kim, J.H., Lee, C., Sim, J.Y., Kim, C.S.: Single-image deraining using an adaptive nonlocal means filter. In: 2013 IEEE International Conference on Image Processing, pp. 914–917. IEEE (2013)
Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_13
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. Computer Science (2014)
Kancharagunta, K.B., Dubey, S.R.: Csgan: cyclic-synthesized generative adversarial networks for image-to-image transformation. arXiv preprint arXiv:1901.03554 (2019)
Acknowledgement
This work was supported in part by the National Key Research and Development Program of China (No. 2018YFB1601102 and No. 2017YFC1601004), and Shenzhen special fund for the strategic development of emerging industries (No. JCYJ20170412170118573).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Xia, W., Yang, Y., Bao, Xy. (2019). ECGAN: Image Translation with Multi-scale Relativistic Average Discriminator. In: Wang, D., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2019. AIMS 2019. Lecture Notes in Computer Science(), vol 11516. Springer, Cham. https://doi.org/10.1007/978-3-030-23367-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-23367-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23366-2
Online ISBN: 978-3-030-23367-9
eBook Packages: Computer ScienceComputer Science (R0)