Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

Fang, Shuangkang; Wang, Yufeng; Tsai, Yi-Hsuan; Yang, Yi; Ding, Wenrui; Zhou, Shuchang; Yang, Ming-Hsuan

doi:10.1007/978-3-031-72946-1_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15100))

Included in the following conference series:

European Conference on Computer Vision

243 Accesses

Abstract

Recent work on image content manipulation based on vision-language pre-training models has been effectively extended to text-driven 3D scene editing. However, existing schemes for 3D scene editing still exhibit certain shortcomings, hindering their further interactive design. Such schemes typically adhere to fixed input patterns, limiting users’ flexibility in text input. Moreover, their editing capabilities are constrained by a single or a few 2D visual models and require intricate pipeline design to integrate these models into 3D reconstruction processes. To address the aforementioned issues, we propose a dialogue-based 3D scene editing approach, termed CE3D, which is centered around a large language model that allows for arbitrary textual input from users and interprets their intentions, subsequently facilitating the autonomous invocation of the corresponding visual expert models. Furthermore, we design a scheme utilizing Hash-Atlas to represent 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images. This design achieves complete decoupling between the 2D editing and 3D reconstruction processes, enabling CE3D to flexibly integrate a wide range of existing 2D or 3D visual models without necessitating intricate fusion designs. Experimental results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects, possessing strong scene comprehension and multi-round dialog capabilities. The source codes can be made available at https://sk-fun.fun/CE3D.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation

Article 14 January 2025

LatentEditor: Text Driven Local Editing of 3D Scenes

3DEgo: 3D Editing on the Go!

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2LIVE: text-driven layered image and video editing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13675, pp. 707–723 . Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_41
Boss, M., Braun, R., Jampani, V., Barron, J.T., Liu, C., Lensch, H.: Nerd: neural reflectance decomposition from image collections. In: ICCV, pp. 12684–12694 (2021)
Google Scholar
Boss, M., Jampani, V., Braun, R., Liu, C., Barron, J., Lensch, H.: Neural-pil: neural pre-integrated lighting for reflectance decomposition. In: NeurIPS, pp. 10691–10704 (2021)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. arXiv preprint arXiv:2211.09800 (2022)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS, pp. 1877–1901 (2020)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR, pp. 7291–7299 (2017)
Google Scholar
Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: tensorial radiance fields. arXiv preprint arXiv:2203.09517 (2022)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, pp. 8780–8794 (2021)
Google Scholar
Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Fang, J., Wang, J., Zhang, X., Xie, L., Tian, Q.: Gaussianeditor: editing 3d gaussians delicately with text instructions. arXiv preprint arXiv:2311.16037 (2023)
Fang, S., et al.: Editing 3d scenes via text prompts without retraining. arXiv preprint arXiv: 2309.04917 (2023)
Fang, S., et al.: PVD-AL: progressive volume distillation with active learning for efficient conversion between different nerf architectures. arXiv preprint arXiv:2304.04012 (2023)
Fang, S., Xu, W., Wang, H., Yang, Y., Wang, Y., Zhou, S.: One is all: bridging the gap between neural radiance fields architectures with progressive volume distillation. arXiv preprint arXiv:2211.15977 (2022)
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373 (2023)
Gu, G., Ko, B., Go, S., Lee, S.H., Lee, J., Shin, M.: Towards light-weight and real-time line segment detection. In: AAAI, pp. 726–734 (2022)
Google Scholar
Gu, J., Liu, L., Wang, P., Theobalt, C.: Stylenerf: a style-based 3d aware generator for high-resolution image synthesis. In: ICLR, pp. 1–25 (2022)
Google Scholar
Han, F., Ye, S., He, M., Chai, M., Liao, J.: Exemplar-based 3d portrait stylization. IEEE TVCG 29(2), 1371–1383 (2021)
Google Scholar
Haque, A., Tancik, M., Efros, A.A., Holynski, A., Kanazawa, A.: Instruct-nerf2nerf: editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS, pp. 6840–6851 (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Höllein, L., Johnson, J., Nießner, M.: Stylemesh: style transfer for indoor 3d scene reconstructions. In: CVPR, pp. 6198–6208 (2022)
Google Scholar
Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: Promptcap: prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022)
Huang, H.P., Tseng, H.Y., Saini, S., Singh, M., Yang, M.H.: Learning to stylize novel views. In: ICCV, pp. 13869–13878 (2021)
Google Scholar
Huang, Y.H., He, Y., Yuan, Y.J., Lai, Y.K., Gao, L.: Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In: CVPR, pp. 18342–18352 (2022)
Google Scholar
Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. ACM TOG 40(6), 1–12 (2021)
Article Google Scholar
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276 (2022)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM TOG 42(4) (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: benchmarking large-scale scene reconstruction. ACM TOG 36(4), 1–13 (2017)
Article Google Scholar
Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing nerf for editing via feature field distillation. arXiv preprint arXiv:2205.15585 (2022)
Li, D., Li, J., Hoi, S.C.: Blip-diffusion: pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900 (2022)
Google Scholar
Li, Y., Lin, Z.H., Forsyth, D., Huang, J.B., Wang, S.: Climatenerf: physically-based neural rendering for extreme climate synthesis. arXiv preprint arXiv:2211.13226 (2022)
Liu, S., et al.: Grounding DINO: marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: neural radiance fields for unconstrained photo collections. In: CVPR, pp. 7210–7219 (2021)
Google Scholar
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM TOG 38(4), 1–14 (2019)
Article Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Mu, F., Wang, J., Wu, Y., Li, Y.: 3D photo stylization: learning to generate stylized novel views from a single image. In: CVPR, pp. 16273–16282 (2022)
Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989 (2022)
Nichol, A.Q., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML, pp. 16784–16804 (2022)
Google Scholar
OpenAI: Introducing ChatGPT (2022). https://openai.com/blog/chatgpt
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV, pp. 12179–12188 (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS, pp. 36479–36494 (2022)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Sun, C., Han, J., Deng, W., Wang, X., Qin, Z., Gould, S.: 3D-GPT: procedural 3d modeling with large language models. arXiv preprint arXiv:2310.12945 (2023)
Surís, D., Menon, S., Vondrick, C.: Vipergpt: visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023)
Tang, J., Chen, X., Wang, J., Zeng, G.: Compressible-composable nerf via rank-residual decomposition. arXiv preprint arXiv:2205.14870 (2022)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, C., Chai, M., He, M., Chen, D., Liao, J.: Clip-nerf: text-and-image driven manipulation of neural radiance fields. In: CVPR, pp. 3835–3844 (2022)
Google Scholar
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: Nerf-art: text-driven neural radiance fields stylization. arXiv preprint arXiv:2212.08070 (2022)
Wang, Q., et al.: Ibrnet: learning multi-view image-based rendering. In: CVPR, pp. 4690–4699 (2021)
Google Scholar
Wang, Z., Huang, H., Zhao, Y., Zhang, Z., Zhao, Z.: Chat-3D: data-efficiently tuning large language model for universal dialogue of 3D scenes. arXiv preprint arXiv:2308.08769 (2023)
Wang, Z., et al.: Language models with image descriptors are strong few-shot video-language learners. In: NeurIPS, pp. 8483–8497 (2022)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS, pp. 24824–24837 (2022)
Google Scholar
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual chatgpt: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV, pp. 1395–1403 (2015)
Google Scholar
Xu, Z., Baojie, X., Guoxin, W.: Canny edge detection based on open CV. In: ICEMI, pp. 53–56 (2017)
Google Scholar
Yang, Z., et al.: An empirical study of GPT-3 for few-shot knowledge-based VQA. In: AAAI, pp. 3081–3089 (2022)
Google Scholar
Yang, Z., et al.: MM-REACT: prompting ChatGPT for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
Ye, M., Danelljan, M., Yu, F., Ke, L.: Gaussian grouping: segment and edit anything in 3d scenes. arXiv preprint arXiv:2312.00732 (2023)
Yin, Y., Fu, Z., Yang, F., Lin, G.: Or-NeRF: object removing from 3d scenes guided by multiview segmentation with neural radiance fields. arXiv preprint arXiv:2305.10503 (2023)
Zeng, A., et al.: Socratic models: composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598 (2022)
Zhang, K., et al.: ARF: artistic radiance fields. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13691, pp. 717–733. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_41
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV, pp. 3836–3847 (2023)
Google Scholar
Zhou, Q.Y., Koltun, V.: Color map optimization for 3D reconstruction with consumer depth cameras. ACM TOG 33(4), 1–10 (2014)
Article Google Scholar
Zhou, S., Li, C., Chan, K.C., Loy, C.C.: Propainter: improving propagation and transformer for video inpainting. In: ICCV, pp. 10477–10486 (2023)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant U20B2042 and 62076019.

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Beihang University, Beijing, China
Shuangkang Fang
Institute of Unmanned System, Beihang University, Beijing, China
Yufeng Wang & Wenrui Ding
Google, Mountain View, USA
Yi-Hsuan Tsai & Ming-Hsuan Yang
Megvii, Beijing, China
Yi Yang & Shuchang Zhou
University of California, Merced, Merced, USA
Ming-Hsuan Yang

Authors

Shuangkang Fang
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Hsuan Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wenrui Ding
View author publications
You can also search for this author in PubMed Google Scholar
Shuchang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Hsuan Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yufeng Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 26569 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fang, S. et al. (2025). Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15100. Springer, Cham. https://doi.org/10.1007/978-3-031-72946-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-72946-1_12
Published: 02 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72945-4
Online ISBN: 978-3-031-72946-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation

LatentEditor: Text Driven Local Editing of 3D Scenes

3DEgo: 3D Editing on the Go!

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 26569 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation

LatentEditor: Text Driven Local Editing of 3D Scenes

3DEgo: 3D Editing on the Go!

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 26569 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation