PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control | SpringerLink
Skip to main content

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15102))

Included in the following conference series:

  • 155 Accesses

Abstract

In this paper, we introduce PoseCrafter, a one-shot method for personalized video generation following the control of flexible poses. Built upon Stable Diffusion and ControlNet, we carefully design an inference process to produce high-quality videos without the corresponding ground-truth frames. First, we select an appropriate reference frame from the training video and invert it to initialize all latent variables for generation. Then, we insert the corresponding training pose into the target pose sequences to enhance faithfulness through a trained temporal attention module. Furthermore, to alleviate the face and hand degradation resulting from discrepancies between poses of training videos and inference poses, we implement simple latent editing through an affine transformation matrix involving facial and hand landmarks. Extensive experiments on several datasets demonstrate that PoseCrafter achieves superior results to baselines pre-trained on a vast collection of videos under 8 commonly used metrics. Besides, PoseCrafter can follow poses from different individuals or artificial edits and simultaneously retain the human identity in an open-domain training video. Our project page is available at https://ml-gsai.github.io/PoseCrafter-demo/.

Y. Zhong and M. Zhao—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 14871
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 18589
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    To be compatible with key-frame attention, we place \(\textbf{q}_r\) as the key frame with position k, and in our experiments, we set \(k=1\) by default.

  2. 2.

    Disco additionally collects 250 internal TikTok-style videos for training.

References

  1. Bhunia, A.K., et al.: Person image synthesis via denoising diffusion model. In: CVPR (2023)

    Google Scholar 

  2. Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  3. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)

    Google Scholar 

  4. Chai, W., Guo, X., Wang, G., Lu, Y.: StableVideo: text-driven consistency-aware diffusion video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23040–23050 (2023)

    Google Scholar 

  5. Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5933–5942 (2019)

    Google Scholar 

  6. Chang, D., et al.: Magicdance: realistic human dance video generation with motions & facial expressions transfer. arXiv preprint arXiv:2311.12052 (2023)

  7. Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graphics (TOG) 42(4), 1–10 (2023)

    Article  Google Scholar 

  8. Chen, W., et al.: Control-a-video: controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840 (2023)

  9. Civitai: leosamsMoonfilm. https://civitai.com/models/43977?modelVersionId=113623 (2023)

  10. Epstein, D., Jabri, A., Poole, B., Efros, A., Holynski, A.: Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems 36 (2024)

    Google Scholar 

  11. Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346–7356 (2023)

    Google Scholar 

  12. Girdhar, R., et al.: Emu video: factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023)

  13. Gu, X., Wen, C., Song, J., Gao, Y.: Seer: language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897 (2023)

  14. Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7297–7306 (2018)

    Google Scholar 

  15. Guo, Y., et al.: AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  16. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

  17. Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  18. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv:2204.03458 (2022)

  19. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  20. Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117 (2023)

  21. Huchenlei: sd-webui-openpose-editor (2023). https://github.com/huchenlei/sd-webui-openpose-editor

  22. Jabberi, M., Wali, A., Chaudhuri, B.B., Alimi, A.M.: 68 landmarks are efficient for 3D face alignment: what about more? 3D face alignment method applied to face recognition. Multimedia Tools Appl. 1–35 (2023)

    Google Scholar 

  23. Jafarian, Y., Park, H.S.: Learning high fidelity depths of dressed humans by watching social media dance videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12753–12762 (2021)

    Google Scholar 

  24. Khachatryan, L., et al.: Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)

  25. Liew, J.H., Yan, H., Zhang, J., Xu, Z., Feng, J.: MagicEdit: high-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749 (2023)

  26. Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: video editing with cross-attention control. arXiv preprint arXiv:2303.04761 (2023)

  27. Ma, L., Gao, T., Jiang, H., Shen, H., Huang, K.: WaveIPT: joint attention and flow alignment in the wavelet domain for pose transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7215–7225 (2023)

    Google Scholar 

  28. Mou, C., et al.: T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)

  29. Ni, H., Shi, C., Li, K., Huang, S.X., Min, M.R.: Conditional image-to-video generation with latent flow diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18444–18455 (2023)

    Google Scholar 

  30. Nie, S., Guo, H.A., Lu, C., Zhou, Y., Zheng, C., Li, C.: The blessing of randomness: SDE beats ode in general diffusion-based image editing. arXiv preprint arXiv:2311.01410 (2023)

  31. Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535 (2023)

  32. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

  33. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  34. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

    Google Scholar 

  35. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)

    Google Scholar 

  36. Shin, C., Kim, H., Lee, C.H., Lee, S.G., Yoon, S.: Edit-a-video: single video editing with object-aware consistency. arXiv preprint arXiv:2303.07945 (2023)

  37. Siarohin, A., Woodford, O.J., Ren, J., Chai, M., Tulyakov, S.: Motion representations for articulated animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13653–13662 (2021)

    Google Scholar 

  38. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)

  39. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  40. Voleti, V., Jolicoeur-Martineau, A., Pal, C.: MCVD-masked conditional video diffusion for prediction, generation, and interpolation. Adv. Neural. Inf. Process. Syst. 35, 23371–23385 (2022)

    Google Scholar 

  41. Wang, T., et al.: Disco: disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040 (2023)

  42. Wang, W., et al.: Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599 (2023)

  43. Wu, J.Z., et al.: Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)

    Google Scholar 

  44. Xing, Z., et al.: A survey on video diffusion models. arXiv preprint arXiv:2310.10647 (2023)

  45. Xu, Z., et al.: Magicanimate: temporally consistent human image animation using diffusion model. arXiv preprint arXiv:2311.16498 (2023)

  46. Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4210–4220 (2023)

    Google Scholar 

  47. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)

    Google Scholar 

  48. Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: ControlVideo: training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077 (2023)

  49. Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: CVPR (2022)

    Google Scholar 

  50. Zhao, M., Bao, F., Li, C., Zhu, J.: Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Adv. Neural. Inf. Process. Syst. 35, 3609–3623 (2022)

    Google Scholar 

  51. Zhao, M., Wang, R., Bao, F., Li, C., Zhu, J.: ControlVideo: adding conditional control for one shot text-to-video editing. arXiv preprint arXiv:2305.17098 (2023)

  52. Zhou, X., Yin, M., Chen, X., Sun, L., Gao, C., Li, Q.: Cross attention based style distribution for controllable person image synthesis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13675, pp. 161–178. Springer, Cham (2022)

    Google Scholar 

Download references

Acknowledgements

This work was supported by NSF of China (No. 62076145), Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China, the Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China (No. 22XNKJ13), and the Huawei Group Research Fund. C. Li was also sponsored by Beijing Nova Program (No. 20220484044). The work was partially done at the Engineering Research Center of Next-Generation IntelligentSearch and Recommendation, Ministry of Education.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chongxuan Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2969 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhong, Y., Zhao, M., You, Z., Yu, X., Zhang, C., Li, C. (2025). PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15102. Springer, Cham. https://doi.org/10.1007/978-3-031-72784-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72784-9_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72783-2

  • Online ISBN: 978-3-031-72784-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics