Abstract
In recent years, text-guided scene image manipulation has received extensive attention in the computer vision community. Most of the existing research has focused on manipulating a single object from available conditioning information in each step. Manipulating multiple objects often relies on iterative networks progressively generating the edited image for each instruction. In this paper, we study a setting that allows users to edit multiple objects based on complex text instructions in a single stage and propose the Single-stage and Multi-object editing Generative Adversarial Network (SM-GAN) to tackle problems in this setting, which contains two key components: (i) the Spatial Semantic Enhancement module (SSE) deepens the spatial semantic prediction process to select all correct positions in the image space that need to be modified according to text instructions in a single stage, (ii) the Multi-object Detail Consistency module (MDC) learns semantic attributes adaptive modulation parameter conditioned on text instructions to effectively fuse text features and image features, and can ensure the generated visual attributes are aligned with text instructions. We construct the Multi-CLEVR dataset for CLEVR scene image construction using complex text instructions with single-stage processing. Extensive experiments on the Multi-CLEVR and CoDraw datasets have demonstrated the superior performance of the proposed method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Avrahami, O., et al.: Blended latent diffusion. arXiv arXiv:2206.02779 (2022)
Avrahami, O., et al.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)
Cheng, Y., et al.: Sequential attention GAN for interactive image editing. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4383–4391 (2020)
Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
El-Nouby, A., et al.: Keep drawing it: iterative language-based image generation and editing. arXiv arXiv:1811.09845 (2018)
El-Nouby, A., et al.: Tell, draw, and repeat: generating and modifying images based on continual linguistic instruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10304–10312 (2019)
Fu, T., et al.: SSCR: iterative language-based image editing via self-supervised counterfactual reasoning. arXiv preprint arXiv:2009.09566 (2020)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Heusel M., et al.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6629–6640 (2017)
Jiang, Y., et al.: Talk-to-edit: fine-grained facial editing via dialog. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13799–13808 (2021)
Johnson, J., et al.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Kim, J., et al.: CoDraw: collaborative drawing as a testbed for grounded goal-driven communication. arXiv arXiv:1712.05558 (2017)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023)
Liu, Y., et al.: Describe what to change: a text-guided unsupervised image-to-image translation approach. In: 28th ACM International Conference on Multimedia, pp. 1357–1365 (2020)
Ling, H., et al.: EditGAN: high-precision semantic image editing. Adv. Neural. Inf. Process. Syst. 34, 16331–16345 (2021)
Liu, Z., Luo, P., et al.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015)
Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.: ManiGAN: text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7880–7889 (2020)
Matsumori, S., et al.: LatteGAN: visually guided language attention for multi-turn text-conditioned image manipulation. IEEE Access 9, 160521–160532 (2021)
Nam, S., Kim, Y., Kim, S.J.: Text-adaptive generative adversarial networks: manipulating images with natural language. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE (2008)
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleCLIP: text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094 (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Shi, J., Xu, N., Xu, Y., Bui, T., Dernoncourt, F., Xu, C.: Learning by planning: language-guided global image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13590–13599 (2021)
Tao, M., Tang, H., Wu, F., Jing, X.-Y., Bao, B.-K., Xu, C.: DF-GAN: a simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)
Vaswani, A., et al.: Attention is all you need. Adv. Neural. Inf. Process. Syst. 30, 5998–6008 (2017)
Wah, C., Branson, S., et al.: The Caltech-UCSD Birds-200-2011 dataset
Xu, Z., et al.: Predict, prevent, and evaluate: disentangled text-driven image manipulation empowered by pre-trained vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18229–18238 (2022)
Xu, T., Zhang, P., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Zhang, T., Tseng, H.-Y., Jiang, L., Yang, W., Lee, H., Essa, I.: Text as neural operator: image manipulation by text instruction. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1893–1902 (2021)
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
Blender Online Community: Blender - a 3D modelling and rendering package (2016). http://www.blender.org/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, R., Wu, L., Dong, P., He, M. (2024). SM-GAN: Single-Stage and Multi-object Text Guided Image Editing. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-53308-2_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53307-5
Online ISBN: 978-3-031-53308-2
eBook Packages: Computer ScienceComputer Science (R0)