PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

Zhong, Yong; Zhao, Min; You, Zebin; Yu, Xiaofeng; Zhang, Changwang; Li, Chongxuan

doi:10.1007/978-3-031-72784-9_14

Yong Zhong¹³,
Min Zhao¹³,
Zebin You¹³,
Xiaofeng Yu¹⁴,
Changwang Zhang¹⁴ &
…
Chongxuan Li¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15102))

Included in the following conference series:

European Conference on Computer Vision

155 Accesses

Abstract

In this paper, we introduce PoseCrafter, a one-shot method for personalized video generation following the control of flexible poses. Built upon Stable Diffusion and ControlNet, we carefully design an inference process to produce high-quality videos without the corresponding ground-truth frames. First, we select an appropriate reference frame from the training video and invert it to initialize all latent variables for generation. Then, we insert the corresponding training pose into the target pose sequences to enhance faithfulness through a trained temporal attention module. Furthermore, to alleviate the face and hand degradation resulting from discrepancies between poses of training videos and inference poses, we implement simple latent editing through an affine transformation matrix involving facial and hand landmarks. Extensive experiments on several datasets demonstrate that PoseCrafter achieves superior results to baselines pre-trained on a vast collection of videos under 8 commonly used metrics. Besides, PoseCrafter can follow poses from different individuals or artificial edits and simultaneously retain the human identity in an open-domain training video. Our project page is available at https://ml-gsai.github.io/PoseCrafter-demo/.

Y. Zhong and M. Zhao—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 14871; Price includes VAT (Japan)

Softcover Book: JPY 18589; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Pose Guided Human Video Generation

TCAN: Animating Human Images with Temporally Consistent Pose Guidance Using Diffusion Models

PISE-V: person image and video synthesis with decoupled GAN

Article 15 December 2024

Notes

1.
To be compatible with key-frame attention, we place \(\textbf{q}_r\) as the key frame with position k, and in our experiments, we set \(k=1\) by default.
2.
Disco additionally collects 250 internal TikTok-style videos for training.

References

Bhunia, A.K., et al.: Person image synthesis via denoising diffusion model. In: CVPR (2023)
Google Scholar
Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Chai, W., Guo, X., Wang, G., Lu, Y.: StableVideo: text-driven consistency-aware diffusion video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23040–23050 (2023)
Google Scholar
Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5933–5942 (2019)
Google Scholar
Chang, D., et al.: Magicdance: realistic human dance video generation with motions & facial expressions transfer. arXiv preprint arXiv:2311.12052 (2023)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graphics (TOG) 42(4), 1–10 (2023)
Article Google Scholar
Chen, W., et al.: Control-a-video: controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840 (2023)
Civitai: leosamsMoonfilm. https://civitai.com/models/43977?modelVersionId=113623 (2023)
Epstein, D., Jabri, A., Poole, B., Efros, A., Holynski, A.: Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems 36 (2024)
Google Scholar
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346–7356 (2023)
Google Scholar
Girdhar, R., et al.: Emu video: factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023)
Gu, X., Wen, C., Song, J., Gao, Y.: Seer: language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897 (2023)
Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7297–7306 (2018)
Google Scholar
Guo, Y., et al.: AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv:2204.03458 (2022)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117 (2023)
Huchenlei: sd-webui-openpose-editor (2023). https://github.com/huchenlei/sd-webui-openpose-editor
Jabberi, M., Wali, A., Chaudhuri, B.B., Alimi, A.M.: 68 landmarks are efficient for 3D face alignment: what about more? 3D face alignment method applied to face recognition. Multimedia Tools Appl. 1–35 (2023)
Google Scholar
Jafarian, Y., Park, H.S.: Learning high fidelity depths of dressed humans by watching social media dance videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12753–12762 (2021)
Google Scholar
Khachatryan, L., et al.: Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)
Liew, J.H., Yan, H., Zhang, J., Xu, Z., Feng, J.: MagicEdit: high-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749 (2023)
Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: video editing with cross-attention control. arXiv preprint arXiv:2303.04761 (2023)
Ma, L., Gao, T., Jiang, H., Shen, H., Huang, K.: WaveIPT: joint attention and flow alignment in the wavelet domain for pose transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7215–7225 (2023)
Google Scholar
Mou, C., et al.: T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
Ni, H., Shi, C., Li, K., Huang, S.X., Min, M.R.: Conditional image-to-video generation with latent flow diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18444–18455 (2023)
Google Scholar
Nie, S., Guo, H.A., Lu, C., Zhou, Y., Zheng, C., Li, C.: The blessing of randomness: SDE beats ode in general diffusion-based image editing. arXiv preprint arXiv:2311.01410 (2023)
Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535 (2023)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Google Scholar
Shin, C., Kim, H., Lee, C.H., Lee, S.G., Yoon, S.: Edit-a-video: single video editing with object-aware consistency. arXiv preprint arXiv:2303.07945 (2023)
Siarohin, A., Woodford, O.J., Ren, J., Chai, M., Tulyakov, S.: Motion representations for articulated animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13653–13662 (2021)
Google Scholar
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Voleti, V., Jolicoeur-Martineau, A., Pal, C.: MCVD-masked conditional video diffusion for prediction, generation, and interpolation. Adv. Neural. Inf. Process. Syst. 35, 23371–23385 (2022)
Google Scholar
Wang, T., et al.: Disco: disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040 (2023)
Wang, W., et al.: Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599 (2023)
Wu, J.Z., et al.: Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)
Google Scholar
Xing, Z., et al.: A survey on video diffusion models. arXiv preprint arXiv:2310.10647 (2023)
Xu, Z., et al.: Magicanimate: temporally consistent human image animation using diffusion model. arXiv preprint arXiv:2311.16498 (2023)
Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4210–4220 (2023)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Google Scholar
Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: ControlVideo: training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077 (2023)
Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: CVPR (2022)
Google Scholar
Zhao, M., Bao, F., Li, C., Zhu, J.: Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Adv. Neural. Inf. Process. Syst. 35, 3609–3623 (2022)
Google Scholar
Zhao, M., Wang, R., Bao, F., Li, C., Zhu, J.: ControlVideo: adding conditional control for one shot text-to-video editing. arXiv preprint arXiv:2305.17098 (2023)
Zhou, X., Yin, M., Chen, X., Sun, L., Gao, C., Li, Q.: Cross attention based style distribution for controllable person image synthesis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13675, pp. 161–178. Springer, Cham (2022)
Google Scholar

Download references

Acknowledgements

This work was supported by NSF of China (No. 62076145), Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China, the Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China (No. 22XNKJ13), and the Huawei Group Research Fund. C. Li was also sponsored by Beijing Nova Program (No. 20220484044). The work was partially done at the Engineering Research Center of Next-Generation IntelligentSearch and Recommendation, Ministry of Education.

Author information

Authors and Affiliations

Gaoling School of AI, Renmin University of China, Beijing, China
Yong Zhong, Min Zhao, Zebin You & Chongxuan Li
Huawei Technologies Co., Ltd., Shenzhen, China
Xiaofeng Yu & Changwang Zhang

Authors

Yong Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zebin You
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Changwang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chongxuan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chongxuan Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2969 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhong, Y., Zhao, M., You, Z., Yu, X., Zhang, C., Li, C. (2025). PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15102. Springer, Cham. https://doi.org/10.1007/978-3-031-72784-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-72784-9_14
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72783-2
Online ISBN: 978-3-031-72784-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Pose Guided Human Video Generation

TCAN: Animating Human Images with Temporally Consistent Pose Guidance Using Diffusion Models

PISE-V: person image and video synthesis with decoupled GAN

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2969 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Pose Guided Human Video Generation

TCAN: Animating Human Images with Temporally Consistent Pose Guidance Using Diffusion Models

PISE-V: person image and video synthesis with decoupled GAN

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2969 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation