PoseVR: Structure-Aware Hybrid Full-Body Pose Estimation in Virtual Reality

Yang, Yinghao; Zhang, Sanyi; Ye, Long; Rao, Neng; Luo, Xudong

doi:10.1007/978-981-97-8795-1_36

Yinghao Yang¹⁵,
Sanyi Zhang^15,16,
Long Ye^15,16,
Neng Rao¹⁶ &
…
Xudong Luo¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15041))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

67 Accesses

Abstract

Accurate full-body pose estimation plays a key role in enhancing the virtual reality interaction experience. Existing single-source solutions face some limitations, e.g. undesirable accuracy due to sensor sparsity (the sensor-based) and pose distortion caused by camera perspective (the vision-based). This motivates us to resort to hybrid-based solutions to address these shortages. However, the accuracy and robustness of hybrid ones still need improvement due to a lack of consideration of structure constraints and global temporal redundancy. To solve these problems, we present PoseVR, a novel architecture that leverages the fusion of 2D vision and 3D sensor information. To compensate for the shortage of the single-source solution, we propose a dual-branch fusion structure to eliminate the redundancy of global temporal information by integrating the continuity of local temporal information. Motivated by the prior knowledge of human physiological structure and joint location, a novel coarse-to-fine endpoint space strategy is introduced to formulate the edge point of the body as prior information for accurately predicting full-body pose. Furthermore, a spatial loss function is employed for hierarchical prediction to achieve better accuracy. We qualitatively and quantitatively evaluate the proposed PoseVR. Experimental results show that PoseVR achieves state-of-the-art performance. Code is available at https://github.com/wwwpkol/PoseVR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PVA-GCN: point-voxel absorbing graph convolutional network for 3D human pose estimation from monocular video

Article 16 February 2024

PE-DLS: a novel method for performing real-time full-body motion reconstruction in VR based on Vive trackers

Article 08 March 2022

Absolute-ROMP: Recovering Multi-person 3D Poses and Shapes with Absolute Scales from a Single RGB Image

References

Ashtari, N., Bunt, A., McGrenere, J., Nebeling, M., Chilana, P.K.: Creating augmented and virtual reality applications: Current practices, challenges, and opportunities. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13 (2020)
Google Scholar
Radianti, J., Majchrzak, T.A., Fromm, J., Wohlgenannt, I.: A systematic review of immersive virtual reality applications for higher education: Design elements, lessons learned, and research agenda. Comput. Educ. 147, 103778 (2020)
Article Google Scholar
Martin, D., Malpica, S., Gutierrez, D., Masia, B., Serrano, A.: Multimodality in vr: A survey. ACM Comput. Surv. (CSUR) 54(10s), 1–36 (2022)
Article Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7025–7034 (2017)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7753–7762 (2019)
Google Scholar
Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. (TOG) 37(6), 1–15 (2018)
Article Google Scholar
Li, S., Chan, A.B.: 3d human pose estimation from monocular images with deep convolutional neural network. In: Proceedings of the 12th Asian Conference on Computer Vision(ACCV), pp. 332–347. Springer (2015)
Google Scholar
Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.-P., Xu, W., Casas, D., Theobalt, C.: Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. (TOG) 36(4), 1–14 (2017)
Article Google Scholar
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Fua, P., Seidel, H.-P., Rhodin, H., Pons-Moll, G., Theobalt, C.: Xnect: Real-time multi-person 3d motion capture with a single rgb camera. ACM Trans. Graph. (TOG) 39(4), 82–1 (2020)
Article Google Scholar
Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., Kehtarnavaz, N., Shah, M.: Deep learning-based human pose estimation: A survey. ArXiv preprint arXiv:2012.13392 (2020)
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2640–2649 (2017)
Google Scholar
Zhou, K., Han, X., Jiang, N., Jia, K., Lu, J.: Hemlets pose: Learning part-centric heatmap triplets for accurate 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2344–2353 (2019)
Google Scholar
Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., Theobalt, C.: In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 905–10 914 (2019)
Google Scholar
Fang, H.-S., Xu, Y., Wang, W., Liu, X., Zhu, S.-C.: Learning pose grammar to encode human body configuration for 3d pose estimation. Proceed. AAAI Conf. Artif. Intell. (AAAI) 32(1) (2018)
Google Scholar
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11 656–11 665 (2021)
Google Scholar
Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, W.: Exploiting temporal contexts with strided transformer for 3d human pose estimation. In: IEEE Transactions on Multimedia (TMM) (2022)
Google Scholar
Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S., Gao, W.: P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In: Proceedings of the 17th European Conference on Computer Vision (ECCV), pp. 461–478. Springer (2022)
Google Scholar
Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 147–13 156 (2022)
Google Scholar
Movella.: Movella is the leading innovator in 3d motion tracking products. https://www.movella.com/ (2024). Last accessed 18 March 2024
Rokoko.: Full performance capture: track body, finger, face motions. https://www.rokoko.com/ (2024). Last accessed 18 March 2024
Von Marcard, T., Rosenhahn, B., Black, M.J., Pons-Moll, G.: Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. In: Computer Graphics Forum (CGF), vol. 36, no. 2. Wiley Online Library, pp. 349–360 (2017)
Google Scholar
Yi, X., Zhou, Y., Xu, F.: Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)
Article Google Scholar
Yi, X., Zhou, Y., Habermann, M., Shimada, S., Golyanik, V., Theobalt, C., Xu, F.: Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 167–13 178 (2022)
Google Scholar
Jiang, Y., Ye, Y., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. In: Proceedings of SIGGRAPH Asia 2022 (SIGGRAPH Asia), pp. 1–9 (2022)
Google Scholar
Jiang, J., Streli, P., Qiu, H., Fender, A., Laich, L., Snape, P., Holz, C.: Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In: European Conference on Computer Vision, pp. 443–460. Springer (2022)
Google Scholar
Zheng, X., Su, Z., Wen, C., Xue, Z., Jin, X.: Realistic full-body tracking from sparse observations via joint-level modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14 678–14 688 (2023)
Google Scholar
Du, Y., Kips, R., Pumarola, A., Starke, S., Thabet, A., Sanakoyeu, A.: Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 481–490 (2023)
Google Scholar
Pons-Moll, G., Baak, A., Gall, J., Leal-Taixe, L., Mueller, M., Seidel, H.-P., Rosenhahn, B.: Outdoor human motion capture using inverse kinematics and von mises-fisher sampling. In: 2011 International Conference on Computer Vision (ICCV), pp. 1243–1250. IEEE (2011)
Google Scholar
Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.: Total capture: 3d human pose estimation fusing video and inertial sensors. In: Proceedings of 28th British Machine Vision Conference (BMVC), pp. 1–13 (2017)
Google Scholar
Huang, F., Zeng, A., Liu, M., Lai, Q., Xu, Q.: Deepfuse: An imu-aware network for real-time 3d human pose estimation from multi-view image. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 429–438 (2020)
Google Scholar
Pan, S., Ma, Q., Yi, X., Hu, W., Wang, X., Zhou, X., Li, J., Xu, F.: Fusing monocular images and sparse imu signals for real-time human motion capture. In: SIGGRAPH Asia. Conf. Papers 2023, 1–11 (2023)
Google Scholar
Yang, J., Chen, T., Qin, F., Lam, M.S., Landay, J.A.: Hybridtrak: Adding full-body tracking to vr using an off-the-shelf webcam. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13 (2022)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 36(7), 1325–1339 (2013)
Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7103–7112 (2018)
Google Scholar
Zhao, Q., Zheng, C., Liu, M., Wang, P., Chen, C.: Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8877–8886 (2023)
Google Scholar
Tang, Z., Qiu, Z., Hao, Y., Hong, R., Yao, T.: 3d human pose estimation with spatio-temporal criss-cross attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4790–4799 (2023)
Google Scholar
Trumble, M., Gilbert, A., Hilton, A., Collomosse, J.: Deep autoencoder for combined human pose estimation and body model upscaling. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800 (2018)
Google Scholar
Bao, Y., Zhao, X., Qian, D.: Fusepose: Imu-vision sensor fusion in kinematic space for parametric human pose estimation. In: IEEE Transactions on Multimedia (TMM) (2022)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Key R&D Program of China under Grant No. 2021YFF0900501, the National Natural Science Foundation of China under Grant Nos. 62202461 and 61971383, and the Horizontal Research Project under Grant No. HG23002.

Author information

Authors and Affiliations

Key Laboratory of Media Audio and Video (Communication University of China), Ministry of Education, Beijing, 100024, China
Yinghao Yang, Sanyi Zhang, Long Ye & Xudong Luo
School of Data Science and Intelligent Media, Communication University of China, Beijing, 100024, China
Sanyi Zhang, Long Ye & Neng Rao

Authors

Yinghao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Sanyi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Long Ye
View author publications
You can also search for this author in PubMed Google Scholar
Neng Rao
View author publications
You can also search for this author in PubMed Google Scholar
Xudong Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Long Ye .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Y., Zhang, S., Ye, L., Rao, N., Luo, X. (2025). PoseVR: Structure-Aware Hybrid Full-Body Pose Estimation in Virtual Reality. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_36

Download citation

DOI: https://doi.org/10.1007/978-981-97-8795-1_36
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8794-4
Online ISBN: 978-981-97-8795-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PoseVR: Structure-Aware Hybrid Full-Body Pose Estimation in Virtual Reality

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

PVA-GCN: point-voxel absorbing graph convolutional network for 3D human pose estimation from monocular video

PE-DLS: a novel method for performing real-time full-body motion reconstruction in VR based on Vive trackers

Absolute-ROMP: Recovering Multi-person 3D Poses and Shapes with Absolute Scales from a Single RGB Image

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

PoseVR: Structure-Aware Hybrid Full-Body Pose Estimation in Virtual Reality

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

PVA-GCN: point-voxel absorbing graph convolutional network for 3D human pose estimation from monocular video

PE-DLS: a novel method for performing real-time full-body motion reconstruction in VR based on Vive trackers

Absolute-ROMP: Recovering Multi-person 3D Poses and Shapes with Absolute Scales from a Single RGB Image

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation