Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation

Peng, Duo; Zhang, Zhengbo; Hu, Ping; Ke, Qiuhong; Yau, David K. Y.; Liu, Jun

doi:10.1007/978-3-031-72624-8_20

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15071))

Included in the following conference series:

European Conference on Computer Vision

378 Accesses

Abstract

Category-Agnostic Pose Estimation (CAPE) aims to detect keypoints of an arbitrary unseen category in images, based on several provided examples of that category. This is a challenging task, as the limited data of unseen categories makes it difficult for models to generalize effectively. To address this challenge, previous methods typically train models on a set of predefined base categories with extensive annotations. In this work, we propose to harness rich knowledge in the off-the-shelf text-to-image diffusion model to effectively address CAPE, without training on carefully prepared base categories. To this end, we propose a Prompt Pose Matching (PPM) framework, which learns pseudo prompts corresponding to the keypoints in the provided few-shot examples via the text-to-image diffusion model. These learned pseudo prompts capture semantic information of keypoints, which can then be used to locate the same type of keypoints from images. We also design a Category-shared Prompt Training (CPT) scheme, to further boost our PPM’s performance. Extensive experiments demonstrate the efficacy of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SCAPE: A Simple and Strong Category-Agnostic Pose Estimator

X-Pose: Detecting Any Keypoints

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

References

Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 717–732. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_44
Chapter Google Scholar
Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem?(and a dataset of 230,000 3D facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1021–1030 (2017)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5933–5942 (2019)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
Google Scholar
Gardner, M.W., Dorling, S.: Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences. Atmos. Environ. 32(14–15), 2627–2636 (1998)
Article Google Scholar
Ge, L., Cai, Y., Weng, J., Yuan, J.: Hand PointNet: 3D hand pose estimation using point sets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8417–8426 (2018)
Google Scholar
Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10833–10842 (2019)
Google Scholar
Ge, Y., Zhang, R., Wang, X., Tang, X., Luo, P.: DeepFashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5337–5345 (2019)
Google Scholar
Gilroy, S., Glavin, M., Jones, E., Mullins, D.: Pedestrian occlusion level classification using keypoint detection and 2d body surface area estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3833–3839 (2021)
Google Scholar
Gong, J., Fan, Z., Ke, Q., Rahmani, H., Liu, J.: Meta agent teaming active learning for pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11079–11089 (2022)
Google Scholar
Graving, J.M., et al.: DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife 8, e47994 (2019)
Article Google Scholar
Guleryuz, O.G., Kaeser-Chen, C.: Fast lifting for 3D hand pose estimation in AR/VR applications. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 106–110 (2018)
Google Scholar
Huang, M.H., Foo, L.G., Liu, J.: Learning to unlearn for robust machine unlearning. In: European Conference on Computer Vision. Springer, Heidelberg (2024)
Google Scholar
Huang, S., Qi, S., Xiao, Y., Zhu, Y., Wu, Y.N., Zhu, S.C.: Cooperative holistic scene understanding: unifying 3D object, layout, and camera pose estimation. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Hui, X., Wu, Q., Rahmani, H., Liu, J.: Class-agnostic object counting with text-to-image diffusion model. In: European Conference on Computer Vision. Springer, Heidelberg (2024)
Google Scholar
Iftikhar, S., Zhang, Z., Asim, M., Muthanna, A., Koucheryavy, A., Abd El-Latif, A.A.: Deep learning-based pedestrian detection in autonomous vehicles: substantial issues and challenges. Electronics 11(21), 3551 (2022)
Article Google Scholar
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023)
Google Scholar
Khan, M.H., et al.: AnimalWeb: a large-scale hierarchical dataset of annotated animal faces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6939–6948 (2020)
Google Scholar
Khani, A., Taghanaki, S.A., Sanghi, A., Amiri, A.M., Hamarneh, G.: SLiME: segment like me. arXiv preprint arXiv:2309.03179 (2023)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2144–2151. IEEE (2011)
Google Scholar
Krauß, V., Boden, A., Oppermann, L., Reiners, R.: Current practices, challenges, and design implications for collaborative AR/VR application development. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2021)
Google Scholar
Labuguen, R., et al.: MacaquePose: a novel “in the wild’’ macaque monkey pose dataset for markerless motion capture. Front. Behav. Neurosci. 14, 581154 (2021)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)
Google Scholar
Liu, Z., Chen, Z., Bai, J., Li, S., Lian, S.: Facial pose estimation by deep learning from label distributions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Google Scholar
Nakamura, A., Harada, T.: Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216 (2019)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Nguyen, L.X., Aung, P.S., Le, H.Q., Park, S.B., Hong, C.S.: A new chapter for medical image generation: the stable diffusion method. In: 2023 International Conference on Information Networking (ICOIN), pp. 483–486. IEEE (2023)
Google Scholar
Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: Proceedings of the 39th International Conference on Machine Learning, vol. 162, pp. 16784–16804. PMLR (2022)
Google Scholar
Peng, H., et al.: The multi-modal video reasoning and analyzing competition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 806–813 (2021)
Google Scholar
Pereira, T.D., et al.: Fast animal pose estimation using deep neural networks. Nat. Methods 16(1), 117–125 (2019)
Article Google Scholar
Probst, T., Fossati, A., Van Gool, L.: Combining human body shape and pose estimation for robust upper body tracking using a depth sensor. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 285–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_20
Chapter Google Scholar
Ramakrishna, V., Munoz, D., Hebert, M., Andrew Bagnell, J., Sheikh, Y.: Pose machines: articulated pose estimation via inference machines. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 33–47. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_3
Chapter Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
Reddy, N.D., Vo, M., Narasimhan, S.G.: CarFusion: combining point tracking and part detection for dynamic 3D reconstruction of vehicles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1906–1915 (2018)
Google Scholar
Reddy, N.D., Vo, M., Narasimhan, S.G.: Occlusion-net: 2D/3D occluded keypoint localization using graph networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7326–7335 (2019)
Google Scholar
Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015, Part III. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: database and results. Image Vis. Comput. 47, 3–18 (2016)
Article Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)
Google Scholar
Sánchez, H.C., Martínez, A.H., Gonzalo, R.I., Parra, N.H., Alonso, I.P., Fernandez-Llorca, D.: Simple baseline for vehicle pose estimation: Experimental validation. IEEE Access 8, 132539–132550 (2020)
Article Google Scholar
Shi, M., Huang, Z., Ma, X., Hu, X., Cao, Z.: Matching is not enough: a two-stage framework for category-agnostic pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7308–7317 (2023)
Google Scholar
Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703 (2019)
Google Scholar
Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3476–3483 (2013)
Google Scholar
Tang, W., Wu, Y.: Does learning specific features for related parts help human pose estimation? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1107–1116 (2019)
Google Scholar
Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)
Google Scholar
Wang, J., Long, X., Gao, Y., Ding, E., Wen, S.: Graph-PCNN: two stage human pose estimation with graph pose refinement. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 492–508. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_29
Chapter Google Scholar
Wang, Y., Peng, C., Liu, Y.: Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Trans. Circuits Syst. Video Technol. 29(11), 3258–3268 (2018)
Article Google Scholar
Welinder, P., et al.: Caltech-UCSD birds 200 (2010)
Google Scholar
Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_22
Chapter Google Scholar
Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: SmartBrush: text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22428–22437 (2023)
Google Scholar
Xu, L., Huang, M.H., Shang, X., Yuan, Z., Sun, Y., Liu, J.: Meta compositional referring expression segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19478–19487 (2023)
Google Scholar
Xu, L., et al.: Pose for everything: towards category-agnostic pose estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 398–416. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_23
Chapter Google Scholar
Xu, W., Su, P.c., Sen-ching, S.C.: Human pose estimation using two RGB-D sensors. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 1279–1283. IEEE (2016)
Google Scholar
Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2012)
Article Google Scholar
Zhang, D., Guo, G., Huang, D., Han, J.: PoseFlow: a deep motion representation for understanding human behaviors in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6762–6770 (2018)
Google Scholar
Zhang, J., Cai, Y., Yan, S., Feng, J., et al.: Direct multi-view multi-person 3D pose estimation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 13153–13164 (2021)
Google Scholar
Zhang, Z., Xu, L., Peng, D., Rahmani, H., Liu, J.: Diff-tracker: text-to-image diffusion models are unsupervised trackers. In: European Conference on Computer Vision. Springer, Heidelberg (2024)
Google Scholar
Zhang, Z., Zhou, C., Tu, Z.: Distilling inter-class distance for semantic segmentation. arXiv preprint arXiv:2205.03650 (2022)
Zhao, C., Fu, C., Dolan, J.M., Wang, J.: L-shape fitting-based vehicle pose estimation and tracking using 3D-LiDAR. IEEE Trans. Intell. Veh. 6(4), 787–798 (2021)
Article Google Scholar
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3D human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11656–11665 (2021)
Google Scholar
Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4903–4911 (2017)
Google Scholar

Download references

Acknowledgements

This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2022-01-027[T]) and the Ministry of Education, Singapore, under the AcRF Tier 2 Projects (MOE-T2EP20222-0009 and MOE-T2EP20123-0014).

Author information

Authors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Duo Peng, Zhengbo Zhang, David K. Y. Yau & Jun Liu
University of Electronic Science and Technology of China, Chengdu, China
Ping Hu
Monash University, Melbourne, Australia
Qiuhong Ke
Lancaster University, Lancaster, UK
Jun Liu

Authors

Duo Peng
View author publications
You can also search for this author in PubMed Google Scholar
Zhengbo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ping Hu
View author publications
You can also search for this author in PubMed Google Scholar
Qiuhong Ke
View author publications
You can also search for this author in PubMed Google Scholar
David K. Y. Yau
View author publications
You can also search for this author in PubMed Google Scholar
Jun Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Liu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 405 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, D., Zhang, Z., Hu, P., Ke, Q., Yau, D.K.Y., Liu, J. (2025). Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15071. Springer, Cham. https://doi.org/10.1007/978-3-031-72624-8_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-72624-8_20
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72623-1
Online ISBN: 978-3-031-72624-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation