Abstract
We consider the problem of editing 3D objects and scenes based on open-ended language instructions. A common approach to this problem is to use a 2D image generator or editor to guide the 3D editing process, obviating the need for 3D data. However, this process is often inefficient due to the need for iterative updates of costly 3D representations, such as neural radiance fields, either through individual view edits or score distillation sampling. A major disadvantage of this approach is the slow convergence caused by aggregating inconsistent information across views, as the guidance from 2D models is not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two stages. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. To do so, we propose a training-free approach that integrates cues from the 3D geometry of the underlying scene. Second, given a multi-view consistent edited sequence of images, we directly and efficiently optimize the 3D representation, which is based on 3D Gaussian Splatting. Because it avoids incremental and iterative edits, DGE is significantly more accurate and efficient than existing approaches and offers additional benefits, such as enabling selective editing of parts of the scene.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bao, C., et al.: SINE: semantic-driven image-based NeRF editing with prior-guided editing field. In: CVPR (2023)
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: text-driven layered image and video editing. In: ECCV (2022)
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: CVPR (2022)
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)
Cen, J., et al.: Segment any 3D gaussians. arXiv preprint arXiv:2312.00860 (2023)
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. In: ICCV, pp. 23206–23217 (2023)
Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: CVPR (2022)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. In: SIGGRAPH (2023)
Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: TensoRF: tensorial radiance fields. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol. 13692. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_20
Chen, J., Lyu, J., Wang, Y.: NeuralEditor: editing neural radiance fields via manipulating point clouds. In: CVPR (2023)
Chen, M., Laina, I., Vedaldi, A.: Training-free layout control with cross-attention guidance. In: WACV (2023)
Chen, M., Xie, J., Laina, I., Vedaldi, A.: SHAP-EDITOR: instruction-guided latent 3D editing in seconds. In: CVPR (2024)
Chen, Y., et al.: Gaussianeditor: swift and controllable 3D editing with gaussian splatting. In: CVPR (2024)
Cheng, X., et al.: Progressive3D: progressively local editing for text-to-3D content creation with complex semantic prompts. In: ICLR (2024)
Chiang, P.Z., Tsai, M.S., Tseng, H.Y., sheng Lai, W., Chiu, W.C.: Stylizing 3D scene via implicit representation and hypernetwork. arXiv:2105.13016 (2022)
Dong, J., Wang, Y.: ViCA-NeRF: view-consistency-aware 3D editing of neural radiance fields. In: NeurIPS (2024)
Epstein, D., Jabri, A., Poole, B., Efros, A.A., Holynski, A.: Diffusion self-guidance for controllable image generation. In: NeurIPS (2023)
Fang, J., Wang, J., Zhang, X., Xie, L., Tian, Q.: Gaussianeditor: editing 3D gaussians delicately with text instructions. In: CVPR (2024)
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2023)
Gao, W., Aigerman, N., Groueix, T., Kim, V., Hanocka, R.: TextDeformer: geometry manipulation using text guidance. In: SIGGRAPH (2023)
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: consistent diffusion features for consistent video editing. In: ICLR (2024)
Gong, B., Wang, Y., Han, X., Dou, Q.: RecolorNeRF: layer decomposed radiance field for efficient color editing of 3D scenes. In: ACM MM (2023)
Gordon, O., Avrahami, O., Lischinski, D.: Blended-NeRF: zero-shot object generation and blending in existing neural radiance fields. In: ICCV (2023)
Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: CVPR (2018)
Haque, A., Tancik, M., Efros, A.A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: ICCV (2023)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision (2000)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-or, D.: Prompt-to-prompt image editing with cross-attention control. In: ICLR (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars. ACM Trans. Graph. (TOG) 41(4), 1–19 (2022)
Huang, H., Tseng, H., Saini, S., Singh, M., Yang, M.: Learning to stylize novel views. In: ICCV (2021)
Huang, Y., He, Y., Yuan, Y., Lai, Y., Gao, L.: StylizedNeRF: consistent 3D scene stylization as stylized NeRF via 2D-3D mutual learning. In: CVPR (2022)
Huang, Z., Shi, Y., Bruce, N., Gong, M.: SealD-NeRF: interactive pixel-level editing for dynamic scenes by neural radiance fields. arXiv:2402.13510 (2024)
Jambon, C., Kerbl, B., Kopanas, G., Diolatzis, S., Leimkühler, T., Drettakis, G.: NeRFshop: interactive editing of neural radiance fields. Proc. ACM Comput. Graph. Interact. Tech. 6(1), 1–21 (2023)
Kamata, H., Sakuma, Y., Hayakawa, A., Ishii, M., Narihira, T.: Instruct 3D-to-3D: text instruction guided 3D-to-3D conversion. arXiv preprint arXiv:2303.15780 (2023)
Kania, K., Yi, K.M., Kowalski, M., Trzciński, T., Tagliasacchi, A.: CoNeRF: controllable neural radiance fields. In: CVPR, pp. 18623–18632 (2022)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 139 (2023)
Khachatryan, L., et al.: Text2Video-zero: text-to-image diffusion models are zero-shot video generators. In: ICCV (2023)
Kirillov, A., et al.: Segment anything. In: CVPR (2023)
Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing nerf for editing via feature field distillation. NeurIPS 35, 23311–23330 (2022)
Kuang, Z., Luan, F., Bi, S., Shu, Z., Wetzstein, G., Sunkavalli, K.: PaletteNeRF: palette-based appearance editing of neural radiance fields. In: CVPR (2023)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: CVPR (2023)
Lazova, V., Guzov, V., Olszewski, K., Tulyakov, S., Pons-Moll, G.: Control-NeRF: editable feature volumes for scene rendering and manipulation. In: WACV (2023)
Lee, J.H., Kim, D.S.: Ice-NeRF: interactive color editing of nerfs via decomposition-aware weight optimization. In: ICCV, pp. 3491–3501 (2023)
Lei, J., Zhang, Y., Jia, K., et al.: Tango: text-driven photorealistic and robust 3D stylization via lighting decomposition. NeurIPS 35, 30923–30936 (2022)
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)
Li, G., Zheng, H., Wang, C., Li, C., Zheng, C., Tao, D.: 3DDesigner: towards photorealistic 3D object generation and editing with text-guided diffusion models. arXiv:2211.14108 (2022)
Li, L.H., et al.: Grounded language-image pre-training. In: CVPR, pp. 10965–10975 (2022)
Li, Y., et al.: FocalDreamer: text-driven 3D editing via focal-fusion assembly. In: AAAI (2024)
Li, Y., et al.: GLIGEN: open-set grounded text-to-image generation. In: CVPR (2023)
Lin, Y., et al.: CompoNeRF: text-guided multi-object compositional NeRF with editable 3D scene layout. arXiv:2303.13843(2023)
Liu, H., Shen, I., Chen, B.: NeRF-In: free-form NeRF inpainting with RGB-D priors. arXiv:2206.04901 (2022)
Liu, S., Zhang, X., Zhang, Z., Zhang, R., Zhu, J., Russell, B.: Editing conditional radiance fields. In: ICCV (2021)
Melas-Kyriazi, L., et al.: IM-3D: iterative multiview diffusion and reconstruction for high-quality 3D generation. In: ICML (2024)
Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: RealFusion: 360 reconstruction of any object from a single image. In: CVPR (2023)
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: ICLR (2022)
Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2Mesh: text-driven neural stylization for meshes. In: CVPR (2022)
Mikaeili, A., Perel, O., Safaee, M., Cohen-Or, D., Mahdavi-Amiri, A.: SKED: sketch-guided text-based 3D editing. In: ICCV (2023)
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph. (TOG) 38, 1–14 (2019)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Mirzaei, A., et al.: Watch your steps: local image and scene editing by text instructions arXiv:2308.08947 (2023)
Mirzaei, A., et al.: Spin-NeRF: multiview segmentation and perceptual inpainting with neural radiance fields. In: CVPR, pp. 20669–20679 (2023)
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: CVPR, pp. 6038–6047 (2023)
Nguyen-Phuoc, T., Liu, F., Xiao, L.: SNeRF: stylized neural implicit representations for 3D scenes. ACM Trans. Graph. (TOG) 41(4), 1–11 (2022)
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741 (2021)
Park, J., Kwon, G., Ye, J.C.: ED-NeRF: efficient text-guided editing of 3D scene using latent space NeRF. In: ICLR (2024)
Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: SIGGRAPH (2023)
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv:2307.01952 (2023)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: ICLR (2023)
Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing. In: ICCV (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, vol. 139, pp. 8748–8763 (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Sella, E., Fiebelman, G., Hedman, P., Averbuch-Elor, H.: Vox-e: text-guided voxel editing of 3D objects. In: ICCV (2023)
Shi, Y., Xue, C., Pan, J., Zhang, W., Tan, V.Y., Bai, S.: DragDiffusion: harnessing diffusion models for interactive point-based image editing. arXiv:2306.14435 (2023)
Song, H., Choi, S., Do, H., Lee, C., Kim, T.: Blending-NeRF: text-driven localized editing in neural radiance fields. In: ICCV, pp. 14383–14393 (2023)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Sun, C., Sun, M., Chen, H.: Direct voxel grid optimization: super-fast convergence for radiance fields reconstruction. In: CVPR (2022)
Sun, C., Liu, Y., Han, J., Gould, S.: NeRFEditor: differentiable style decomposition for full 3D scene editing. In: WACV (2022)
Teng, Y., Xie, E., Wu, Y., Han, H., Li, Z., Liu, X.: Drag-a-video: non-rigid video editing with point-based interaction. arXiv:2312.02936 (2023)
Tretschk, E., Golyanik, V., Zollhöfer, M., Bozic, A., Lassner, C., Theobalt, C.: Scenerflow: time-consistent reconstruction of general dynamic scenes. arXiv:2308.08258 (2023)
Tschernezki, V., Laina, I., Larlus, D., Vedaldi, A.: Neural feature fusion fields: 3D distillation of self-supervised 2D image representation. arXiv:2209.03494 (2022)
Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: CVPR (2023)
Vargas, F., Grathwohl, W.S., Doucet, A.: Denoising diffusion samplers. In: ICLR (2023)
Wang, B., Dutt, N.S., Mitra, N.J.: ProteusNeRF: fast lightweight NeRF editing using 3D-aware image context. arXiv:2310.09965 (2023)
Wang, C., Chai, M., He, M., Chen, D., Liao, J.: CLIP-NeRF: text-and-image driven manipulation of neural radiance fields. In: CVPR (2022)
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: NeRF-Art: text-driven neural radiance fields stylization. arXiv:2212.08070 (2022)
Wang, D., Zhang, T., Abboud, A., Süsstrunk, S.: Inpaintnerf360: text-guided 3D inpainting on unbounded neural radiance fields. arXiv preprint arXiv:2305.15094 (2023)
Wang, X., et al.: Seal-3D: interactive pixel-level editing for neural radiance fields. In: ICCV (2023)
Weder, S., et al.: Removing objects from neural radiance fields. In: CVPR, pp. 16528–16538 (2023)
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)
Xu, J., Wang, X., Cao, Y., Cheng, W., Shan, Y., Gao, S.: InstructP2P: learning to edit 3D point clouds with text instructions. arXiv:2306.07154 (2023)
Xu, S., Li, L., Shen, L., Lian, Z.: DeSRF: deformable stylized radiance field. In: CVPR, pp. 709–718 (2023)
Xu, T., Harada, T.: Deforming radiance fields with cages. In: ECCV (2022)
Yang, B., et al.: NeuMesh: learning disentangled neural mesh-based implicit field for geometry and texture editing. In: ECCV (2022)
Yang, B., et al.: Learning object-compositional neural radiance field for editable scene rendering. In: ICCV (2021)
Yu, L., Xiang, W., Han, K.: Edit-diffNeRF: editing 3D neural radiance fields using 2D diffusion model. arXiv preprint arXiv:2306.09551 (2023)
Yuan, Y., Sun, Y., Lai, Y., Ma, Y., Jia, R., Gao, L.: NeRF-editing: geometry editing of neural radiance fields. In: CVPR (2022)
Zhang, H., Feng, Y., Kulits, P., Wen, Y., Thies, J., Black, M.J.: Text-guided generation and editing of compositional 3D avatars. In: 2024 International Conference on 3D Vision (3DV) (2024)
Zhang, J., et al.: Editable free-viewpoint video using a layered neural representation. ACM Trans. Graph. 40(4), 1–18 (2021)
Zhang, K., et al.: ARF: artistic radiance fields. In: ECCV (2022)
Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: MagicBrush: a manually annotated dataset for instruction-guided image editing. In: NeurIPS (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV (2023)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Zhang, S., et al.: Hive: harnessing human feedback for instructional visual editing. In: CVPR (2024)
Zheng, C., Lin, W., Xu, F.: EditableNeRF: editing topologically varying neural radiance fields by key points. In: CVPR (2023)
Zhou, S., et al.: Feature 3DGS: supercharging 3D gaussian splatting to enable distilled feature fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21676–21685 (2024)
Zhou, X., He, Y., Yu, F.R., Li, J., Li, Y.: Repaint-NeRF: NeRF editting via semantic masks and diffusion models. In: IJCAI (2023)
Zhuang, J., Wang, C., Lin, L., Liu, L., Li, G.: DreamEditor: text-driven 3D scene editing with neural fields. In: SIGGRAPH (2023)
Acknowledgements
This research is supported by ERC-CoG UNION 101001212. I. L. is also partially supported by the VisualAI EPSRC grant (EP/T028572/1).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Ethics
For further details on ethics, data protection, and copyright please see https://www.robots.ox.ac.uk/~vedaldi/research/union/ethics.html.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, M., Laina, I., Vedaldi, A. (2025). DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-72904-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)