SM-GAN: Single-Stage and Multi-object Text Guided Image Editing

Li, Ruichen; Wu, Lei; Dong, Pei; He, Minggang

doi:10.1007/978-3-031-53308-2_16

Ruichen Li¹⁴,
Lei Wu¹⁴,
Pei Dong¹⁴ &
…
Minggang He¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14555))

Included in the following conference series:

International Conference on Multimedia Modeling

733 Accesses

Abstract

In recent years, text-guided scene image manipulation has received extensive attention in the computer vision community. Most of the existing research has focused on manipulating a single object from available conditioning information in each step. Manipulating multiple objects often relies on iterative networks progressively generating the edited image for each instruction. In this paper, we study a setting that allows users to edit multiple objects based on complex text instructions in a single stage and propose the Single-stage and Multi-object editing Generative Adversarial Network (SM-GAN) to tackle problems in this setting, which contains two key components: (i) the Spatial Semantic Enhancement module (SSE) deepens the spatial semantic prediction process to select all correct positions in the image space that need to be modified according to text instructions in a single stage, (ii) the Multi-object Detail Consistency module (MDC) learns semantic attributes adaptive modulation parameter conditioned on text instructions to effectively fuse text features and image features, and can ensure the generated visual attributes are aligned with text instructions. We construct the Multi-CLEVR dataset for CLEVR scene image construction using complex text instructions with single-stage processing. Extensive experiments on the Multi-CLEVR and CoDraw datasets have demonstrated the superior performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GaussCtrl: Multi-view Consistent Text-Driven 3D Gaussian Splatting Editing

Free-Editor: Zero-Shot Text-Driven 3D Scene Editing

Watch Your Steps: Local Image and Scene Editing by Text Instructions

References

Avrahami, O., et al.: Blended latent diffusion. arXiv arXiv:2206.02779 (2022)
Avrahami, O., et al.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)
Google Scholar
Cheng, Y., et al.: Sequential attention GAN for interactive image editing. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4383–4391 (2020)
Google Scholar
Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
El-Nouby, A., et al.: Keep drawing it: iterative language-based image generation and editing. arXiv arXiv:1811.09845 (2018)
El-Nouby, A., et al.: Tell, draw, and repeat: generating and modifying images based on continual linguistic instruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10304–10312 (2019)
Google Scholar
Fu, T., et al.: SSCR: iterative language-based image editing via self-supervised counterfactual reasoning. arXiv preprint arXiv:2009.09566 (2020)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Heusel M., et al.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6629–6640 (2017)
Google Scholar
Jiang, Y., et al.: Talk-to-edit: fine-grained facial editing via dialog. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13799–13808 (2021)
Google Scholar
Johnson, J., et al.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Google Scholar
Kim, J., et al.: CoDraw: collaborative drawing as a testbed for grounded goal-driven communication. arXiv arXiv:1712.05558 (2017)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023)
Google Scholar
Liu, Y., et al.: Describe what to change: a text-guided unsupervised image-to-image translation approach. In: 28th ACM International Conference on Multimedia, pp. 1357–1365 (2020)
Google Scholar
Ling, H., et al.: EditGAN: high-precision semantic image editing. Adv. Neural. Inf. Process. Syst. 34, 16331–16345 (2021)
Google Scholar
Liu, Z., Luo, P., et al.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015)
Google Scholar
Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.: ManiGAN: text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7880–7889 (2020)
Google Scholar
Matsumori, S., et al.: LatteGAN: visually guided language attention for multi-turn text-conditioned image manipulation. IEEE Access 9, 160521–160532 (2021)
Article Google Scholar
Nam, S., Kim, Y., Kim, S.J.: Text-adaptive generative adversarial networks: manipulating images with natural language. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE (2008)
Google Scholar
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleCLIP: text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094 (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Shi, J., Xu, N., Xu, Y., Bui, T., Dernoncourt, F., Xu, C.: Learning by planning: language-guided global image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13590–13599 (2021)
Google Scholar
Tao, M., Tang, H., Wu, F., Jing, X.-Y., Bao, B.-K., Xu, C.: DF-GAN: a simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural. Inf. Process. Syst. 30, 5998–6008 (2017)
Google Scholar
Wah, C., Branson, S., et al.: The Caltech-UCSD Birds-200-2011 dataset
Google Scholar
Xu, Z., et al.: Predict, prevent, and evaluate: disentangled text-driven image manipulation empowered by pre-trained vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18229–18238 (2022)
Google Scholar
Xu, T., Zhang, P., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Google Scholar
Zhang, T., Tseng, H.-Y., Jiang, L., Yang, W., Lee, H., Essa, I.: Text as neural operator: image manipulation by text instruction. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1893–1902 (2021)
Google Scholar
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
Google Scholar
Blender Online Community: Blender - a 3D modelling and rendering package (2016). http://www.blender.org/

Download references

Author information

Authors and Affiliations

Shandong University, Jinan, China
Ruichen Li, Lei Wu & Pei Dong
Shandong Survey and Design Institute of Water Conservancy, Jinan, China
Minggang He

Authors

Ruichen Li
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Pei Dong
View author publications
You can also search for this author in PubMed Google Scholar
Minggang He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Wu .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, R., Wu, L., Dong, P., He, M. (2024). SM-GAN: Single-Stage and Multi-object Text Guided Image Editing. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-53308-2_16
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53307-5
Online ISBN: 978-3-031-53308-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SM-GAN: Single-Stage and Multi-object Text Guided Image Editing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GaussCtrl: Multi-view Consistent Text-Driven 3D Gaussian Splatting Editing

Free-Editor: Zero-Shot Text-Driven 3D Scene Editing

Watch Your Steps: Local Image and Scene Editing by Text Instructions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SM-GAN: Single-Stage and Multi-object Text Guided Image Editing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GaussCtrl: Multi-view Consistent Text-Driven 3D Gaussian Splatting Editing

Free-Editor: Zero-Shot Text-Driven 3D Scene Editing

Watch Your Steps: Local Image and Scene Editing by Text Instructions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation