Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation | SpringerLink
Skip to main content

Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Category-Agnostic Pose Estimation (CAPE) aims to detect keypoints of an arbitrary unseen category in images, based on several provided examples of that category. This is a challenging task, as the limited data of unseen categories makes it difficult for models to generalize effectively. To address this challenge, previous methods typically train models on a set of predefined base categories with extensive annotations. In this work, we propose to harness rich knowledge in the off-the-shelf text-to-image diffusion model to effectively address CAPE, without training on carefully prepared base categories. To this end, we propose a Prompt Pose Matching (PPM) framework, which learns pseudo prompts corresponding to the keypoints in the provided few-shot examples via the text-to-image diffusion model. These learned pseudo prompts capture semantic information of keypoints, which can then be used to locate the same type of keypoints from images. We also design a Category-shared Prompt Training (CPT) scheme, to further boost our PPM’s performance. Extensive experiments demonstrate the efficacy of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 717–732. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_44

    Chapter  Google Scholar 

  2. Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem?(and a dataset of 230,000 3D facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1021–1030 (2017)

    Google Scholar 

  3. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)

    Google Scholar 

  4. Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5933–5942 (2019)

    Google Scholar 

  5. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)

    Google Scholar 

  6. Gardner, M.W., Dorling, S.: Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences. Atmos. Environ. 32(14–15), 2627–2636 (1998)

    Article  Google Scholar 

  7. Ge, L., Cai, Y., Weng, J., Yuan, J.: Hand PointNet: 3D hand pose estimation using point sets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8417–8426 (2018)

    Google Scholar 

  8. Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10833–10842 (2019)

    Google Scholar 

  9. Ge, Y., Zhang, R., Wang, X., Tang, X., Luo, P.: DeepFashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5337–5345 (2019)

    Google Scholar 

  10. Gilroy, S., Glavin, M., Jones, E., Mullins, D.: Pedestrian occlusion level classification using keypoint detection and 2d body surface area estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3833–3839 (2021)

    Google Scholar 

  11. Gong, J., Fan, Z., Ke, Q., Rahmani, H., Liu, J.: Meta agent teaming active learning for pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11079–11089 (2022)

    Google Scholar 

  12. Graving, J.M., et al.: DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife 8, e47994 (2019)

    Article  Google Scholar 

  13. Guleryuz, O.G., Kaeser-Chen, C.: Fast lifting for 3D hand pose estimation in AR/VR applications. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 106–110 (2018)

    Google Scholar 

  14. Huang, M.H., Foo, L.G., Liu, J.: Learning to unlearn for robust machine unlearning. In: European Conference on Computer Vision. Springer, Heidelberg (2024)

    Google Scholar 

  15. Huang, S., Qi, S., Xiao, Y., Zhu, Y., Wu, Y.N., Zhu, S.C.: Cooperative holistic scene understanding: unifying 3D object, layout, and camera pose estimation. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  16. Hui, X., Wu, Q., Rahmani, H., Liu, J.: Class-agnostic object counting with text-to-image diffusion model. In: European Conference on Computer Vision. Springer, Heidelberg (2024)

    Google Scholar 

  17. Iftikhar, S., Zhang, Z., Asim, M., Muthanna, A., Koucheryavy, A., Abd El-Latif, A.A.: Deep learning-based pedestrian detection in autonomous vehicles: substantial issues and challenges. Electronics 11(21), 3551 (2022)

    Article  Google Scholar 

  18. Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023)

    Google Scholar 

  19. Khan, M.H., et al.: AnimalWeb: a large-scale hierarchical dataset of annotated animal faces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6939–6948 (2020)

    Google Scholar 

  20. Khani, A., Taghanaki, S.A., Sanghi, A., Amiri, A.M., Hamarneh, G.: SLiME: segment like me. arXiv preprint arXiv:2309.03179 (2023)

  21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  22. Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2144–2151. IEEE (2011)

    Google Scholar 

  23. Krauß, V., Boden, A., Oppermann, L., Reiners, R.: Current practices, challenges, and design implications for collaborative AR/VR application development. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2021)

    Google Scholar 

  24. Labuguen, R., et al.: MacaquePose: a novel “in the wild’’ macaque monkey pose dataset for markerless motion capture. Front. Behav. Neurosci. 14, 581154 (2021)

    Article  Google Scholar 

  25. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  26. Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)

    Google Scholar 

  27. Liu, Z., Chen, Z., Bai, J., Li, S., Lian, S.: Facial pose estimation by deep learning from label distributions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)

    Google Scholar 

  28. Nakamura, A., Harada, T.: Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216 (2019)

  29. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29

    Chapter  Google Scholar 

  30. Nguyen, L.X., Aung, P.S., Le, H.Q., Park, S.B., Hong, C.S.: A new chapter for medical image generation: the stable diffusion method. In: 2023 International Conference on Information Networking (ICOIN), pp. 483–486. IEEE (2023)

    Google Scholar 

  31. Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: Proceedings of the 39th International Conference on Machine Learning, vol. 162, pp. 16784–16804. PMLR (2022)

    Google Scholar 

  32. Peng, H., et al.: The multi-modal video reasoning and analyzing competition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 806–813 (2021)

    Google Scholar 

  33. Pereira, T.D., et al.: Fast animal pose estimation using deep neural networks. Nat. Methods 16(1), 117–125 (2019)

    Article  Google Scholar 

  34. Probst, T., Fossati, A., Van Gool, L.: Combining human body shape and pose estimation for robust upper body tracking using a depth sensor. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 285–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_20

    Chapter  Google Scholar 

  35. Ramakrishna, V., Munoz, D., Hebert, M., Andrew Bagnell, J., Sheikh, Y.: Pose machines: articulated pose estimation via inference machines. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 33–47. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_3

    Chapter  Google Scholar 

  36. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

  37. Reddy, N.D., Vo, M., Narasimhan, S.G.: CarFusion: combining point tracking and part detection for dynamic 3D reconstruction of vehicles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1906–1915 (2018)

    Google Scholar 

  38. Reddy, N.D., Vo, M., Narasimhan, S.G.: Occlusion-net: 2D/3D occluded keypoint localization using graph networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7326–7335 (2019)

    Google Scholar 

  39. Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  40. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  41. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015, Part III. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  42. Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: database and results. Image Vis. Comput. 47, 3–18 (2016)

    Article  Google Scholar 

  43. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)

    Google Scholar 

  44. Sánchez, H.C., Martínez, A.H., Gonzalo, R.I., Parra, N.H., Alonso, I.P., Fernandez-Llorca, D.: Simple baseline for vehicle pose estimation: Experimental validation. IEEE Access 8, 132539–132550 (2020)

    Article  Google Scholar 

  45. Shi, M., Huang, Z., Ma, X., Hu, X., Cao, Z.: Matching is not enough: a two-stage framework for category-agnostic pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7308–7317 (2023)

    Google Scholar 

  46. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  47. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703 (2019)

    Google Scholar 

  48. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3476–3483 (2013)

    Google Scholar 

  49. Tang, W., Wu, Y.: Does learning specific features for related parts help human pose estimation? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1107–1116 (2019)

    Google Scholar 

  50. Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)

    Google Scholar 

  51. Wang, J., Long, X., Gao, Y., Ding, E., Wen, S.: Graph-PCNN: two stage human pose estimation with graph pose refinement. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 492–508. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_29

    Chapter  Google Scholar 

  52. Wang, Y., Peng, C., Liu, Y.: Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Trans. Circuits Syst. Video Technol. 29(11), 3258–3268 (2018)

    Article  Google Scholar 

  53. Welinder, P., et al.: Caltech-UCSD birds 200 (2010)

    Google Scholar 

  54. Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_22

    Chapter  Google Scholar 

  55. Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: SmartBrush: text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22428–22437 (2023)

    Google Scholar 

  56. Xu, L., Huang, M.H., Shang, X., Yuan, Z., Sun, Y., Liu, J.: Meta compositional referring expression segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19478–19487 (2023)

    Google Scholar 

  57. Xu, L., et al.: Pose for everything: towards category-agnostic pose estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 398–416. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_23

    Chapter  Google Scholar 

  58. Xu, W., Su, P.c., Sen-ching, S.C.: Human pose estimation using two RGB-D sensors. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 1279–1283. IEEE (2016)

    Google Scholar 

  59. Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2012)

    Article  Google Scholar 

  60. Zhang, D., Guo, G., Huang, D., Han, J.: PoseFlow: a deep motion representation for understanding human behaviors in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6762–6770 (2018)

    Google Scholar 

  61. Zhang, J., Cai, Y., Yan, S., Feng, J., et al.: Direct multi-view multi-person 3D pose estimation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 13153–13164 (2021)

    Google Scholar 

  62. Zhang, Z., Xu, L., Peng, D., Rahmani, H., Liu, J.: Diff-tracker: text-to-image diffusion models are unsupervised trackers. In: European Conference on Computer Vision. Springer, Heidelberg (2024)

    Google Scholar 

  63. Zhang, Z., Zhou, C., Tu, Z.: Distilling inter-class distance for semantic segmentation. arXiv preprint arXiv:2205.03650 (2022)

  64. Zhao, C., Fu, C., Dolan, J.M., Wang, J.: L-shape fitting-based vehicle pose estimation and tracking using 3D-LiDAR. IEEE Trans. Intell. Veh. 6(4), 787–798 (2021)

    Article  Google Scholar 

  65. Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3D human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11656–11665 (2021)

    Google Scholar 

  66. Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4903–4911 (2017)

    Google Scholar 

Download references

Acknowledgements

This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2022-01-027[T]) and the Ministry of Education, Singapore, under the AcRF Tier 2 Projects (MOE-T2EP20222-0009 and MOE-T2EP20123-0014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Liu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 405 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Peng, D., Zhang, Z., Hu, P., Ke, Q., Yau, D.K.Y., Liu, J. (2025). Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72624-8_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72623-1

  • Online ISBN: 978-3-031-72624-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics