Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

Zhu, Shenhao; Chen, Junming Leo; Dai, Zuozhuo; Dong, Zilong; Xu, Yinghui; Cao, Xun; Yao, Yao; Zhu, Hao; Zhu, Siyu

doi:10.1007/978-3-031-73001-6_9

Shenhao Zhu¹³,
Junming Leo Chen¹⁴,
Zuozhuo Dai¹⁵,
Zilong Dong¹⁵,
Yinghui Xu¹⁴,
Xun Cao¹³,
Yao Yao¹³,
Hao Zhu¹³ &
…
Siyu Zhu¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15113))

Included in the following conference series:

European Conference on Computer Vision

455 Accesses

Abstract

In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion guidance in current human generative techniques. The methodology utilizes the SMPL(Skinned Multi-Person Linear) model as the 3D human parametric model to establish a unified representation of body shape and pose. This facilitates the accurate capture of intricate human geometry and motion characteristics from source videos. Specifically, we incorporate rendered depth images, normal maps, and semantic maps obtained from SMPL sequences, alongside skeleton-based motion guidance, to enrich the conditions of the latent diffusion model with comprehensive 3D shape and detailed pose attributes. A multi-layer motion fusion module, integrating self-attention mechanisms, is employed to fuse the shape and motion latent representations in the spatial domain. By representing the 3D human parametric model as the motion guidance, we can perform parametric shape alignment of the human body between the reference image and the source video motion. Experimental evaluations on benchmark datasets demonstrate the methodology’s superior ability to generate high-quality human animations that accurately capture pose and shape variations. Furthermore, our approach also exhibits superior generalization capabilities on the proposed in-the-wild dataset.

S. Zhu, J. L. Chen—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Action2video: Generating Videos of Human 3D Actions

Article 04 January 2022

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

TCAN: Animating Human Images with Temporally Consistent Pose Guidance Using Diffusion Models

Notes

1.
Animate Anyone: https://github.com/MooreThreads/Moore-AnimateAnyone.
2.
MagicAnimate: https://github.com/magic-research/magic-animate.

References

AlBahar, B., Saito, S., Tseng, H.Y., Kim, C., Kopf, J., Huang, J.B.: Single-image 3D human digitization with shape-guided diffusion. In: SIGGRAPH Asia (2023)
Google Scholar
Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3D people models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Balaji, Y., Min, M.R., Bai, B., Chellappa, R., Graf, H.P.: Conditional GAN with discriminative filter generation for text-to-video synthesis. In: Proceedings of the International Joint Conference on Artificial Intelligence (2019)
Google Scholar
Balaji, Y., et al.: eDiff-I: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
Bhunia, A.K., et al.: Person image synthesis via denoising diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Cao, Y., Cao, Y.P., Han, K., Shan, Y., Wong, K.Y.K.: DreamAvatar: text-and-shape guided 3D human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916 (2023)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)
Google Scholar
Feng, M., et al.: DreaMoving: a human dance video generation framework based on diffusion models. arXiv preprint arXiv:2312.05107 (2023)
Fu, J., et al.: StyleGAN-Human: a data-centric odyssey of human generation. arXiv preprint arXiv:2204.11823 (2022)
Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4D: reconstructing and tracking humans with transformers. arXiv preprint arXiv:2305.20091 (2023)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)
Google Scholar
Guler, R., Neverova, N., DensePose, I.: Dense human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Guo, Y., et al.: AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)
Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.J.: Populating 3D scenes by learning human-scene interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
He, T., Xu, Y., Saito, S., Soatto, S., Tung, T.: ARCH++: animation-ready clothed human reconstruction revisited. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: International Conference on Pattern Recognition. IEEE (2010)
Google Scholar
Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117 (2023)
Hu, S., Hong, F., Pan, L., Mei, H., Yang, L., Liu, Z.: SHERF: generalizable human nerf from a single image. arXiv preprint arXiv:2303.12791 (2023)
Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., Zhou, J.: Composer: creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778 (2023)
Jiang, S., Jiang, H., Wang, Z., Luo, H., Chen, W., Xu, L.: HumanGen: generating human radiance fields with explicit priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12543–12554 (2023)
Google Scholar
Karras, J., Holynski, A., Wang, T.C., Kemelmacher-Shlizerman, I.: DreamPose: fashion video synthesis with stable diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Li, B., Rajasegaran, J., Gandelsman, Y., Efros, A.A., Malik, J.: Synthesizing moving people with 3D control. arXiv preprint arXiv:2401.10889 (2024)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (2015)
Google Scholar
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-Solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Lu, J., Lin, J., Dou, H., Zhang, Y., Deng, Y., Wang, H.: DPoser: diffusion model as robust 3D human pose prior. arXiv preprint arXiv:2312.05541 (2023)
Ma, Q., et al.: Learning to dress 3D people in generative clothing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
Mu, J., Sang, S., Vasconcelos, N., Wang, X.: ActorsNeRF: animatable few-shot human rendering with generalizable NeRFs. arXiv preprint arXiv:2304.14401 (2023)
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Prokudin, S., Black, M.J., Romero, J.: SMPLpix: neural avatars from 3D human models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1810–1819 (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)
Ren, Y., Li, G., Liu, S., Li, T.H.: Deep spatial transformation for pose-guided person image generation and animation. IEEE Trans. Image Process. 29, 8622–8635 (2020)
Article Google Scholar
Guler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Sarkar, K., Liu, L., Golyanik, V., Theobalt, C.: HumanGAN: a generative model of humans images (2021)
Google Scholar
Sarkar, K., Mehta, D., Xu, W., Golyanik, V., Theobalt, C.: Neural re-rendering of humans from a single image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 596–613. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_35
Chapter Google Scholar
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Siarohin, A., Woodford, O.J., Ren, J., Chai, M., Tulyakov, S.: Motion representations for articulated animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning. PMLR (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2020)
Google Scholar
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (2020)
Google Scholar
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069 (2021)
Tseng, J., Castellon, R., Liu, K.: EDGE: editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
Wang, T., et al.: DISCO: disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040 (2023)
Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head synthesis for video conferencing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: G3AN: disentangling appearance and motion for video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Xu, Z., et al.: MagicAnimate: temporally consistent human image animation using diffusion model. arXiv preprint arXiv:2311.16498 (2023)
Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar
Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: IP-Adapter: text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)
Yoon, J.S., Liu, L., Golyanik, V., Sarkar, K., Park, H.S., Theobalt, C.: Pose-guided human animation from a single image in the wild (2021)
Google Scholar
Yoon, J.S., Liu, L., Golyanik, V., Sarkar, K., Park, H.S., Theobalt, C.: Pose-guided human animation from a single image in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Yu, W.Y., Po, L.M., Cheung, R.C., Zhao, Y., Xue, Y., Li, K.: Bidirectionally deformable motion modulation for video-based human pose transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar
Zhang, J., Yan, H., Xu, Z., Feng, J., Liew, J.H.: MagicAvatar: multimodal avatar generation and animation. arXiv preprint arXiv:2308.14748 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar
Zhang, P., Yang, L., Lai, J.H., Xie, X.: Exploring dual-task correlation for pose guided person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar

Download references

Acknowledgement

This work was supported by the NSFC grant 62441204.

Author information

Authors and Affiliations

Nanjing University, Nanjing, China
Shenhao Zhu, Xun Cao, Yao Yao & Hao Zhu
Fudan University, Shanghai, China
Junming Leo Chen, Yinghui Xu & Siyu Zhu
Alibaba Group, Hangzhou, China
Zuozhuo Dai & Zilong Dong

Authors

Shenhao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Junming Leo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zuozhuo Dai
View author publications
You can also search for this author in PubMed Google Scholar
Zilong Dong
View author publications
You can also search for this author in PubMed Google Scholar
Yinghui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xun Cao
View author publications
You can also search for this author in PubMed Google Scholar
Yao Yao
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Siyu Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hao Zhu or Siyu Zhu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 84478 KB)

Supplementary material 2 (pdf 10923 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, S. et al. (2025). Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15113. Springer, Cham. https://doi.org/10.1007/978-3-031-73001-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-73001-6_9
Published: 27 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73000-9
Online ISBN: 978-3-031-73001-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance