Abstract
3D human pose estimation from monocular images has shown great success due to the sophisticated deep network architectures and large 3D human pose datasets. However, it is still an open problem when such datasets are unavailable. Estimating 3D human poses from monocular images is an ill-posed inverse problem. In our work, we propose a novel self-supervised method, which effectively trains a 3D human pose estimation network without any extra 3D pose annotations. Different from the commonly used GAN-based technique, our method overcomes the projection ambiguity problem by fully disentangling the camera viewpoint information from the 3D human shape. Specifically, we design a factorization network to predict the coefficients of canonical 3D human pose and camera viewpoint in two separate channels. Here, we represent the canonical 3D human pose as a combination of pose basis from a dictionary. To guarantee consistent factorization, we design a simple yet effective loss function taking advantage of multi-view information. Besides, in order to generate robust canonical reconstruction from the 3D pose coefficient, we exploit the underlying 3D geometry of human poses to learn a novel hierarchical dictionary from 2D poses. The hierarchical dictionary has stronger 3D pose expressibility than the traditional single-level dictionary. We comprehensively evaluate the proposed method on two public 3D human pose datasets, Human3.6M and MPI-INF-3DHP. The experimental results show that our method can maximally disentangle 3D human shapes and camera viewpoints, as well as reconstruct 3D human poses accurately. Moreover, our method achieves state-of-the-art results compared with recent weakly/self-supervised methods.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Andriluka M, Pishchulin L, Gehler P et al (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Conference on computer vision and pattern recognition. IEEE, pp 3686–3693. https://doi.org/10.1109/cvpr.2014.471
Artacho B, Savakis A (2021) Unipose+: a unified framework for 2D and 3D human pose estimation in images and videos. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3124736
Bao C, Ji H, Quan Y et al (2016) Dictionary learning for sparse coding: algorithms and convergence analysis. IEEE Trans Pattern Anal Mach Intell 38(7):1356–1369. https://doi.org/10.1109/TPAMI.2015.2487966
Cai Y, Ge L, Liu J et al (2019) Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: International conference on computer vision (ICCV). IEEE/CVF, pp 2272–2281. https://doi.org/10.1109/ICCV.2019.00236
Chen CH, Ramanan D (2017) 3D human pose estimation= 2D pose estimation + matching. In: Conference on computer vision and pattern recognition (CVPR). IEEE, pp 5759–5767. https://doi.org/10.1109/cvpr.2017.610
Chen CH, Tyagi A, Agrawal A et al (2019) Unsupervised 3d pose estimation with geometric self-supervision. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 5707–5717. https://doi.org/10.1109/CVPR.2019.00586
Chen X, Lin KY, Liu W et al (2019) Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 10,887–10,896. https://doi.org/10.1109/CVPR.2019.01115
Chen Y, Wang Z, Peng Y et al (2018) Cascaded pyramid network for multi-person pose estimation. In: Conference on computer vision and pattern recognition. IEEE/CVF, pp 7103–7112. https://doi.org/10.1109/CVPR.2018.00742
Cheng Y, Wang B, Yang B et al (2021) Monocular 3D multi-person pose estimation by integrating top-down and bottom-up networks. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7645–7655. https://doi.org/10.1109/CVPR46437.2021.00756
Ci H, Ma X, Wang C et al (2022) Locally connected network for monocular 3D human pose estimation. IEEE Trans Pattern Anal Mach Intell 44(3):1429–1442. https://doi.org/10.1109/TPAMI.2020.3019139
Dong J, Fang Q, Jiang W et al (2021) Fast and robust multi-person 3D pose estimation and tracking from multiple views. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3098052
Fabbri M, Lanzi F, Calderara S et al (2020) Compressed volumetric heatmaps for multi-person 3D pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7202–7211. https://doi.org/10.1109/cvpr42600.2020.00723
Fang H, Xu Y, Wang W et al (2018) Learning pose grammar to encode human body configuration for 3D pose estimation. In: Proceedings of the AAAI conference on artificial intelligence, pp 6821–6828
Habibie I, Xu W, Mehta D et al (2019) In the wild human pose estimation using explicit 2D features and intermediate 3D representations. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 10,897–10,906. https://doi.org/10.1109/CVPR.2019.01116
He K, Zhang X, Ren S et al (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: International conference on computer vision (ICCV). IEEE, pp 1026–1034. https://doi.org/10.1109/ICCV.2015.123
Ionescu C, Papava D, Olaru V et al (2013) Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7):1325–1339. https://doi.org/10.1109/TPAMI.2013.248
Iqbal U, Doering A, Yasin H et al (2018) A dual-source approach for 3D human pose estimation from single images. Comput Vis Image Underst 172:37–49. https://doi.org/10.1016/j.cviu.2018.03.007
Iqbal U, Molchanov P, Kautz J (2020) Weakly-supervised 3D human pose learning via multi-view images in the wild. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 5242–5251. https://doi.org/10.1109/CVPR42600.2020.00529
Kanazawa A, Black MJ, Jacobs DW et al (2018) End-to-end recovery of human shape and pose. In: Conference on computer vision and pattern recognition. IEEE/CVF, pp 7122–7131. https://doi.org/10.1109/CVPR.2018.00744
Kocabas M, Karagoz S, Akbas E (2019) Self-supervised learning of 3D human pose using multi-view geometry. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 1077–1086. https://doi.org/10.1109/CVPR.2019.00117
Kolotouros N, Pavlakos G, Black M et al (2019) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: International conference on computer vision (ICCV). IEEE/CVF, pp 2252–2261. https://doi.org/10.1109/ICCV.2019.00234
Kong C, Lucey S (2019) Deep interpretable non-rigid structure from motion. In: International conference on computer vision (ICCV). IEEE/CVF, pp 1558–1567. https://doi.org/10.1109/iccv.2019.00164
Kundu JN, Seth S, Jampani V et al (2020) Self-supervised 3D human pose estimation via part guided novel image synthesis. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 6151–6161. https://doi.org/10.1109/CVPR42600.2020.00619
Li S, Ke L, Pratama K et al (2020) Cascaded deep monocular 3D human pose estimation with evolutionary training data. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 6172–6182. https://doi.org/10.1109/CVPR42600.2020.00621
Li Y, Li K, Jiang S et al (2020) Geometry-driven self-supervised method for 3D human pose estimation. Proceedings of the AAAI Conference on Artificial Intelligence 34(07):11,442–11,449. https://doi.org/10.1609/aaai.v34i07.6808
Li Z, Dekel T, Cole F et al (2021) Mannequinchallenge: learning the depths of moving people by watching frozen people. IEEE Trans Pattern Anal Mach Intell 43 (12):4229–4241. https://doi.org/10.1109/TPAMI.2020.2974454
Lin J, Lee GH (2021) Multi-view multi-person 3D pose estimation with plane sweep stereo. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 11,881–11,890. https://doi.org/10.1109/CVPR46437.2021.01171
Liu J, Ding H, Shahroudy A et al (2020) Feature boosting network for 3D pose estimation. IEEE Trans Pattern Anal Mach Intell 42(2):494–501. https://doi.org/10.1109/TPAMI.2019.2894422
Ma X, Su J, Wang C et al (2021) Context modeling in 3d human pose estimation: a unified perspective. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 6234–6243. https://doi.org/10.1109/CVPR46437.2021.00617
Martinez J, Hossain R, Romero J et al (2017) A simple yet effective baseline for 3d human pose estimation. In: International conference on computer vision (ICCV). IEEE, pp 2659–2668. https://doi.org/10.1109/ICCV.2017.288
Mehta D, Rhodin H, Casas D et al (2017) Monocular 3D human pose estimation in the wild using improved cnn supervision. In: International conference on 3d vision (3DV), pp 506–516. https://doi.org/10.1109/3DV.2017.00064
Mitra R, Gundavarapu NB, Sharma A et al (2020) Multiview-consistent semi-supervised learning for 3d human pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 6906–6915. https://doi.org/10.1109/cvpr42600.2020.00694
Novotny D, Ravi N, Graham B et al (2019) C3dpo: canonical 3d pose networks for non-rigid structure from motion. In: International conference on computer vision (ICCV). IEEE/CVF, pp 7687–7696. https://doi.org/10.1109/ICCV.2019.00778
Pavlakos G, Zhou X, Derpanis KG et al (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Conference on computer vision and pattern recognition (CVPR). IEEE, pp 1263–1272. https://doi.org/10.1109/CVPR.2017.139
Pavlakos G, Zhou X, Derpanis KG et al (2017) Harvesting multiple views for marker-less 3D human pose annotations. In: Conference on computer vision and pattern recognition (CVPR). IEEE, pp 1253–1262. https://doi.org/10.1109/CVPR.2017.138
Pavllo D, Feichtenhofer C, Grangier D et al (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7745–7754. https://doi.org/10.1109/CVPR.2019.00794
Rhodin H, Meyer F, Sporri J et al (2018) Learning monocular 3D human pose estimation from multi-view images. In: Conference on computer vision and pattern recognition. IEEE/CVF, pp 8437–8446. https://doi.org/10.1109/CVPR.2018.00880
Rhodin H, Salzmann M, Fua P (2018) Unsupervised geometry-aware representation for 3D human pose estimation. In: Computer vision ECCV 2018, pp 765–782. https://doi.org/10.1007/978-3-030-01249-6_46
Scetbon M, Elad M, Milanfar P (2021) Deep k-SVD denoising. IEEE Trans Image Process 30:5944–5955. https://doi.org/10.1109/tip.2021.3090531
Sun X, Xiao B, Wei F et al (2018) Integral human pose regression. In: Computer vision ECCV 2018, pp 536–553. https://doi.org/10.1007/978-3-030-01231-1_33
Tekin B, Marquez-Neila P, Salzmann M et al (2017) Learning to fuse 2D and 3D image cues for monocular body pose estimation. In: International conference on computer vision (ICCV). IEEE, pp 3961–3970. https://doi.org/10.1109/ICCV.2017.425
Tome D, Alldieck T, Peluse P et al (2020) Selfpose: 3D egocentric pose estimation from a headset mounted camera. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2020.3029700
Tung HYF, Harley AW, Seto W et al (2017) Adversarial inverse graphics networks: learning 2D-to-3D lifting and image-to-image translation from unpaired supervision. In: International conference on computer vision (ICCV). IEEE, pp 4364–4372. https://doi.org/10.1109/ICCV.2017.467
Wandt B, Rosenhahn B (2019) Repnet: weakly supervised training of an adversarial reprojection network for 3D human pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7774–7783. https://doi.org/10.1109/CVPR.2019.00797
Wandt B, Rudolph M, Zell P et al (2021) CanonPose: self-supervised monocular 3D human pose estimation in the wild. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 13,289–13,299. https://doi.org/10.1109/cvpr46437.2021.01309
Wang C, Kong C, Lucey S (2019) Distill knowledge from nrsfm for weakly supervised 3D pose learning. In: International conference on computer vision (ICCV). IEEE/CVF, pp 743–752. https://doi.org/10.1109/ICCV.2019.00083
Wang C, Qiu H, Yuille AL et al (2019) Learning basis representation to refine 3D human pose estimations. Proceedings of the AAAI Conference on Artificial Intelligence 33(01):8925–8932. https://doi.org/10.1609/aaai.v33i01.33018925
Wang C, Wang Y, Lin Z et al (2019) Robust 3D human pose estimation from single images or video sequences. IEEE Trans Pattern Anal Mach Intell 41(5):1227–1241. https://doi.org/10.1109/TPAMI.2018.2828427
Wang K, Lin L, Jiang C et al (2020) 3D human pose machines with self-supervised learning. IEEE IEEE Trans Pattern Anal Mach Intell 42(5):1069–1082. https://doi.org/10.1109/TPAMI.2019.2892452
Wehrbein T, Rudolph M, Rosenhahn B et al (2021) Probabilistic monocular 3D human pose estimation with normalizing flows. In: International conference on computer vision (ICCV). IEEE/CVF, pp 11,179–11,188. https://doi.org/10.1109/iccv48922.2021.01101
Xu Y, Wang W, Liu T et al (2021) Monocular 3d pose estimation via pose grammar and data augmentation. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3087695
Yang W, Ouyang W, Wang X et al (2018) 3D human pose estimation in the wild by adversarial learning. In: Conference on computer vision and pattern recognition. IEEE/CVF, pp 5255–5264. https://doi.org/10.1109/CVPR.2018.00551
Yuan Y, Wei SE, Simon T et al (2021) SimPoE: simulated character control for 3D human pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7155–7165. https://doi.org/10.1109/CVPR46437.2021.00708
Zhang Z, Wang C, Qin W et al (2020) Fusing wearable imus with multi-view images for human pose estimation: a geometric approach. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 2197–2206. https://doi.org/10.1109/CVPR42600.2020.00227
Zhang Z, Hu L, Deng X et al (2021) Sequential 3D human pose estimation using adaptive point cloud sampling strategy. In: Proceedings of the thirtieth international joint conference on artificial intelligence, pp 1330–1337. https://doi.org/10.24963/ijcai.2021/184
Zhao L, Peng X, Tian Y et al (2019) Semantic graph convolutional networks for 3D human pose regression. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 3420–3430. https://doi.org/10.1109/CVPR.2019.00354
Zheng C, Zhu S, Mendieta M et al (2021) 3D human pose estimation with spatial and temporal transformers. In: International conference on computer vision (ICCV). IEEE/CVF, pp 11,636–11,645. https://doi.org/10.1109/iccv48922.2021.01145
Zhou K, Han X, Jiang N et al (2021) HEMlets posh: learning part-centric heatmap triplets for 3D human pose and shape estimation. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3051173
Zhou X, Huang Q, Sun X et al (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: International conference on computer vision (ICCV). IEEE, pp 398–407. https://doi.org/10.1109/iccv.2017.51
Acknowledgements
This work is supported by Beijing Natural Science Foundation (No.4222037,L181010) and National Natural Science Foundation of China (No.61972035).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Kan Li and Yang Li contributed equally to this work.
Rights and permissions
About this article
Cite this article
Ma, Z., Li, K. & Li, Y. Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization. Appl Intell 53, 3864–3876 (2023). https://doi.org/10.1007/s10489-022-03714-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03714-x