SM-GAN: Single-Stage and Multi-object Text Guided Image Editing | SpringerLink
Skip to main content

SM-GAN: Single-Stage and Multi-object Text Guided Image Editing

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14555))

Included in the following conference series:

  • 733 Accesses

Abstract

In recent years, text-guided scene image manipulation has received extensive attention in the computer vision community. Most of the existing research has focused on manipulating a single object from available conditioning information in each step. Manipulating multiple objects often relies on iterative networks progressively generating the edited image for each instruction. In this paper, we study a setting that allows users to edit multiple objects based on complex text instructions in a single stage and propose the Single-stage and Multi-object editing Generative Adversarial Network (SM-GAN) to tackle problems in this setting, which contains two key components: (i) the Spatial Semantic Enhancement module (SSE) deepens the spatial semantic prediction process to select all correct positions in the image space that need to be modified according to text instructions in a single stage, (ii) the Multi-object Detail Consistency module (MDC) learns semantic attributes adaptive modulation parameter conditioned on text instructions to effectively fuse text features and image features, and can ensure the generated visual attributes are aligned with text instructions. We construct the Multi-CLEVR dataset for CLEVR scene image construction using complex text instructions with single-stage processing. Extensive experiments on the Multi-CLEVR and CoDraw datasets have demonstrated the superior performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Avrahami, O., et al.: Blended latent diffusion. arXiv arXiv:2206.02779 (2022)

  2. Avrahami, O., et al.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)

    Google Scholar 

  3. Cheng, Y., et al.: Sequential attention GAN for interactive image editing. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4383–4391 (2020)

    Google Scholar 

  4. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  5. El-Nouby, A., et al.: Keep drawing it: iterative language-based image generation and editing. arXiv arXiv:1811.09845 (2018)

  6. El-Nouby, A., et al.: Tell, draw, and repeat: generating and modifying images based on continual linguistic instruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10304–10312 (2019)

    Google Scholar 

  7. Fu, T., et al.: SSCR: iterative language-based image editing via self-supervised counterfactual reasoning. arXiv preprint arXiv:2009.09566 (2020)

  8. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  9. Heusel M., et al.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6629–6640 (2017)

    Google Scholar 

  10. Jiang, Y., et al.: Talk-to-edit: fine-grained facial editing via dialog. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13799–13808 (2021)

    Google Scholar 

  11. Johnson, J., et al.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)

    Google Scholar 

  12. Kim, J., et al.: CoDraw: collaborative drawing as a testbed for grounded goal-driven communication. arXiv arXiv:1712.05558 (2017)

  13. Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023)

    Google Scholar 

  14. Liu, Y., et al.: Describe what to change: a text-guided unsupervised image-to-image translation approach. In: 28th ACM International Conference on Multimedia, pp. 1357–1365 (2020)

    Google Scholar 

  15. Ling, H., et al.: EditGAN: high-precision semantic image editing. Adv. Neural. Inf. Process. Syst. 34, 16331–16345 (2021)

    Google Scholar 

  16. Liu, Z., Luo, P., et al.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015)

    Google Scholar 

  17. Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.: ManiGAN: text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7880–7889 (2020)

    Google Scholar 

  18. Matsumori, S., et al.: LatteGAN: visually guided language attention for multi-turn text-conditioned image manipulation. IEEE Access 9, 160521–160532 (2021)

    Article  Google Scholar 

  19. Nam, S., Kim, Y., Kim, S.J.: Text-adaptive generative adversarial networks: manipulating images with natural language. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  20. Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE (2008)

    Google Scholar 

  21. Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleCLIP: text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094 (2021)

    Google Scholar 

  22. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  23. Shi, J., Xu, N., Xu, Y., Bui, T., Dernoncourt, F., Xu, C.: Learning by planning: language-guided global image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13590–13599 (2021)

    Google Scholar 

  24. Tao, M., Tang, H., Wu, F., Jing, X.-Y., Bao, B.-K., Xu, C.: DF-GAN: a simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)

    Google Scholar 

  25. Vaswani, A., et al.: Attention is all you need. Adv. Neural. Inf. Process. Syst. 30, 5998–6008 (2017)

    Google Scholar 

  26. Wah, C., Branson, S., et al.: The Caltech-UCSD Birds-200-2011 dataset

    Google Scholar 

  27. Xu, Z., et al.: Predict, prevent, and evaluate: disentangled text-driven image manipulation empowered by pre-trained vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18229–18238 (2022)

    Google Scholar 

  28. Xu, T., Zhang, P., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)

    Google Scholar 

  29. Zhang, T., Tseng, H.-Y., Jiang, L., Yang, W., Lee, H., Essa, I.: Text as neural operator: image manipulation by text instruction. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1893–1902 (2021)

    Google Scholar 

  30. Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)

    Google Scholar 

  31. Blender Online Community: Blender - a 3D modelling and rendering package (2016). http://www.blender.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, R., Wu, L., Dong, P., He, M. (2024). SM-GAN: Single-Stage and Multi-object Text Guided Image Editing. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53308-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53307-5

  • Online ISBN: 978-3-031-53308-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics