Abstract
Recent work on image content manipulation based on vision-language pre-training models has been effectively extended to text-driven 3D scene editing. However, existing schemes for 3D scene editing still exhibit certain shortcomings, hindering their further interactive design. Such schemes typically adhere to fixed input patterns, limiting users’ flexibility in text input. Moreover, their editing capabilities are constrained by a single or a few 2D visual models and require intricate pipeline design to integrate these models into 3D reconstruction processes. To address the aforementioned issues, we propose a dialogue-based 3D scene editing approach, termed CE3D, which is centered around a large language model that allows for arbitrary textual input from users and interprets their intentions, subsequently facilitating the autonomous invocation of the corresponding visual expert models. Furthermore, we design a scheme utilizing Hash-Atlas to represent 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images. This design achieves complete decoupling between the 2D editing and 3D reconstruction processes, enabling CE3D to flexibly integrate a wide range of existing 2D or 3D visual models without necessitating intricate fusion designs. Experimental results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects, possessing strong scene comprehension and multi-round dialog capabilities. The source codes can be made available at https://sk-fun.fun/CE3D.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2LIVE: text-driven layered image and video editing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13675, pp. 707–723 . Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_41
Boss, M., Braun, R., Jampani, V., Barron, J.T., Liu, C., Lensch, H.: Nerd: neural reflectance decomposition from image collections. In: ICCV, pp. 12684–12694 (2021)
Boss, M., Jampani, V., Braun, R., Liu, C., Barron, J., Lensch, H.: Neural-pil: neural pre-integrated lighting for reflectance decomposition. In: NeurIPS, pp. 10691–10704 (2021)
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. arXiv preprint arXiv:2211.09800 (2022)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS, pp. 1877–1901 (2020)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR, pp. 7291–7299 (2017)
Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: tensorial radiance fields. arXiv preprint arXiv:2203.09517 (2022)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, pp. 8780–8794 (2021)
Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Fang, J., Wang, J., Zhang, X., Xie, L., Tian, Q.: Gaussianeditor: editing 3d gaussians delicately with text instructions. arXiv preprint arXiv:2311.16037 (2023)
Fang, S., et al.: Editing 3d scenes via text prompts without retraining. arXiv preprint arXiv: 2309.04917 (2023)
Fang, S., et al.: PVD-AL: progressive volume distillation with active learning for efficient conversion between different nerf architectures. arXiv preprint arXiv:2304.04012 (2023)
Fang, S., Xu, W., Wang, H., Yang, Y., Wang, Y., Zhou, S.: One is all: bridging the gap between neural radiance fields architectures with progressive volume distillation. arXiv preprint arXiv:2211.15977 (2022)
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373 (2023)
Gu, G., Ko, B., Go, S., Lee, S.H., Lee, J., Shin, M.: Towards light-weight and real-time line segment detection. In: AAAI, pp. 726–734 (2022)
Gu, J., Liu, L., Wang, P., Theobalt, C.: Stylenerf: a style-based 3d aware generator for high-resolution image synthesis. In: ICLR, pp. 1–25 (2022)
Han, F., Ye, S., He, M., Chai, M., Liao, J.: Exemplar-based 3d portrait stylization. IEEE TVCG 29(2), 1371–1383 (2021)
Haque, A., Tancik, M., Efros, A.A., Holynski, A., Kanazawa, A.: Instruct-nerf2nerf: editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS, pp. 6840–6851 (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Höllein, L., Johnson, J., Nießner, M.: Stylemesh: style transfer for indoor 3d scene reconstructions. In: CVPR, pp. 6198–6208 (2022)
Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: Promptcap: prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699 (2022)
Huang, H.P., Tseng, H.Y., Saini, S., Singh, M., Yang, M.H.: Learning to stylize novel views. In: ICCV, pp. 13869–13878 (2021)
Huang, Y.H., He, Y., Yuan, Y.J., Lai, Y.K., Gao, L.: Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In: CVPR, pp. 18342–18352 (2022)
Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. ACM TOG 40(6), 1–12 (2021)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276 (2022)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM TOG 42(4) (2023)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: benchmarking large-scale scene reconstruction. ACM TOG 36(4), 1–13 (2017)
Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing nerf for editing via feature field distillation. arXiv preprint arXiv:2205.15585 (2022)
Li, D., Li, J., Hoi, S.C.: Blip-diffusion: pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900 (2022)
Li, Y., Lin, Z.H., Forsyth, D., Huang, J.B., Wang, S.: Climatenerf: physically-based neural rendering for extreme climate synthesis. arXiv preprint arXiv:2211.13226 (2022)
Liu, S., et al.: Grounding DINO: marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: neural radiance fields for unconstrained photo collections. In: CVPR, pp. 7210–7219 (2021)
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM TOG 38(4), 1–14 (2019)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Mu, F., Wang, J., Wu, Y., Li, Y.: 3D photo stylization: learning to generate stylized novel views from a single image. In: CVPR, pp. 16273–16282 (2022)
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989 (2022)
Nichol, A.Q., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML, pp. 16784–16804 (2022)
OpenAI: Introducing ChatGPT (2022). https://openai.com/blog/chatgpt
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV, pp. 12179–12188 (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS, pp. 36479–36494 (2022)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Sun, C., Han, J., Deng, W., Wang, X., Qin, Z., Gould, S.: 3D-GPT: procedural 3d modeling with large language models. arXiv preprint arXiv:2310.12945 (2023)
Surís, D., Menon, S., Vondrick, C.: Vipergpt: visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023)
Tang, J., Chen, X., Wang, J., Zeng, G.: Compressible-composable nerf via rank-residual decomposition. arXiv preprint arXiv:2205.14870 (2022)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, C., Chai, M., He, M., Chen, D., Liao, J.: Clip-nerf: text-and-image driven manipulation of neural radiance fields. In: CVPR, pp. 3835–3844 (2022)
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: Nerf-art: text-driven neural radiance fields stylization. arXiv preprint arXiv:2212.08070 (2022)
Wang, Q., et al.: Ibrnet: learning multi-view image-based rendering. In: CVPR, pp. 4690–4699 (2021)
Wang, Z., Huang, H., Zhao, Y., Zhang, Z., Zhao, Z.: Chat-3D: data-efficiently tuning large language model for universal dialogue of 3D scenes. arXiv preprint arXiv:2308.08769 (2023)
Wang, Z., et al.: Language models with image descriptors are strong few-shot video-language learners. In: NeurIPS, pp. 8483–8497 (2022)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS, pp. 24824–24837 (2022)
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual chatgpt: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV, pp. 1395–1403 (2015)
Xu, Z., Baojie, X., Guoxin, W.: Canny edge detection based on open CV. In: ICEMI, pp. 53–56 (2017)
Yang, Z., et al.: An empirical study of GPT-3 for few-shot knowledge-based VQA. In: AAAI, pp. 3081–3089 (2022)
Yang, Z., et al.: MM-REACT: prompting ChatGPT for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
Ye, M., Danelljan, M., Yu, F., Ke, L.: Gaussian grouping: segment and edit anything in 3d scenes. arXiv preprint arXiv:2312.00732 (2023)
Yin, Y., Fu, Z., Yang, F., Lin, G.: Or-NeRF: object removing from 3d scenes guided by multiview segmentation with neural radiance fields. arXiv preprint arXiv:2305.10503 (2023)
Zeng, A., et al.: Socratic models: composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598 (2022)
Zhang, K., et al.: ARF: artistic radiance fields. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13691, pp. 717–733. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_41
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV, pp. 3836–3847 (2023)
Zhou, Q.Y., Koltun, V.: Color map optimization for 3D reconstruction with consumer depth cameras. ACM TOG 33(4), 1–10 (2014)
Zhou, S., Li, C., Chan, K.C., Loy, C.C.: Propainter: improving propagation and transformer for video inpainting. In: ICCV, pp. 10477–10486 (2023)
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant U20B2042 and 62076019.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fang, S. et al. (2025). Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15100. Springer, Cham. https://doi.org/10.1007/978-3-031-72946-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-72946-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72945-4
Online ISBN: 978-3-031-72946-1
eBook Packages: Computer ScienceComputer Science (R0)