PoseVR: Structure-Aware Hybrid Full-Body Pose Estimation in Virtual Reality | SpringerLink
Skip to main content

PoseVR: Structure-Aware Hybrid Full-Body Pose Estimation in Virtual Reality

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15041))

Included in the following conference series:

  • 85 Accesses

Abstract

Accurate full-body pose estimation plays a key role in enhancing the virtual reality interaction experience. Existing single-source solutions face some limitations, e.g. undesirable accuracy due to sensor sparsity (the sensor-based) and pose distortion caused by camera perspective (the vision-based). This motivates us to resort to hybrid-based solutions to address these shortages. However, the accuracy and robustness of hybrid ones still need improvement due to a lack of consideration of structure constraints and global temporal redundancy. To solve these problems, we present PoseVR, a novel architecture that leverages the fusion of 2D vision and 3D sensor information. To compensate for the shortage of the single-source solution, we propose a dual-branch fusion structure to eliminate the redundancy of global temporal information by integrating the continuity of local temporal information. Motivated by the prior knowledge of human physiological structure and joint location, a novel coarse-to-fine endpoint space strategy is introduced to formulate the edge point of the body as prior information for accurately predicting full-body pose. Furthermore, a spatial loss function is employed for hierarchical prediction to achieve better accuracy. We qualitatively and quantitatively evaluate the proposed PoseVR. Experimental results show that PoseVR achieves state-of-the-art performance. Code is available at https://github.com/wwwpkol/PoseVR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ashtari, N., Bunt, A., McGrenere, J., Nebeling, M., Chilana, P.K.: Creating augmented and virtual reality applications: Current practices, challenges, and opportunities. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13 (2020)

    Google Scholar 

  2. Radianti, J., Majchrzak, T.A., Fromm, J., Wohlgenannt, I.: A systematic review of immersive virtual reality applications for higher education: Design elements, lessons learned, and research agenda. Comput. Educ. 147, 103778 (2020)

    Article  Google Scholar 

  3. Martin, D., Malpica, S., Gutierrez, D., Masia, B., Serrano, A.: Multimodality in vr: A survey. ACM Comput. Surv. (CSUR) 54(10s), 1–36 (2022)

    Article  Google Scholar 

  4. Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7025–7034 (2017)

    Google Scholar 

  5. Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7753–7762 (2019)

    Google Scholar 

  6. Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. (TOG) 37(6), 1–15 (2018)

    Article  Google Scholar 

  7. Li, S., Chan, A.B.: 3d human pose estimation from monocular images with deep convolutional neural network. In: Proceedings of the 12th Asian Conference on Computer Vision(ACCV), pp. 332–347. Springer (2015)

    Google Scholar 

  8. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.-P., Xu, W., Casas, D., Theobalt, C.: Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. (TOG) 36(4), 1–14 (2017)

    Article  Google Scholar 

  9. Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Fua, P., Seidel, H.-P., Rhodin, H., Pons-Moll, G., Theobalt, C.: Xnect: Real-time multi-person 3d motion capture with a single rgb camera. ACM Trans. Graph. (TOG) 39(4), 82–1 (2020)

    Article  Google Scholar 

  10. Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., Kehtarnavaz, N., Shah, M.: Deep learning-based human pose estimation: A survey. ArXiv preprint arXiv:2012.13392 (2020)

  11. Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2640–2649 (2017)

    Google Scholar 

  12. Zhou, K., Han, X., Jiang, N., Jia, K., Lu, J.: Hemlets pose: Learning part-centric heatmap triplets for accurate 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2344–2353 (2019)

    Google Scholar 

  13. Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., Theobalt, C.: In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 905–10 914 (2019)

    Google Scholar 

  14. Fang, H.-S., Xu, Y., Wang, W., Liu, X., Zhu, S.-C.: Learning pose grammar to encode human body configuration for 3d pose estimation. Proceed. AAAI Conf. Artif. Intell. (AAAI) 32(1) (2018)

    Google Scholar 

  15. Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11 656–11 665 (2021)

    Google Scholar 

  16. Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, W.: Exploiting temporal contexts with strided transformer for 3d human pose estimation. In: IEEE Transactions on Multimedia (TMM) (2022)

    Google Scholar 

  17. Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S., Gao, W.: P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In: Proceedings of the 17th European Conference on Computer Vision (ECCV), pp. 461–478. Springer (2022)

    Google Scholar 

  18. Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 147–13 156 (2022)

    Google Scholar 

  19. Movella.: Movella is the leading innovator in 3d motion tracking products. https://www.movella.com/ (2024). Last accessed 18 March 2024

  20. Rokoko.: Full performance capture: track body, finger, face motions. https://www.rokoko.com/ (2024). Last accessed 18 March 2024

  21. Von Marcard, T., Rosenhahn, B., Black, M.J., Pons-Moll, G.: Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. In: Computer Graphics Forum (CGF), vol. 36, no. 2. Wiley Online Library, pp. 349–360 (2017)

    Google Scholar 

  22. Yi, X., Zhou, Y., Xu, F.: Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)

    Article  Google Scholar 

  23. Yi, X., Zhou, Y., Habermann, M., Shimada, S., Golyanik, V., Theobalt, C., Xu, F.: Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 167–13 178 (2022)

    Google Scholar 

  24. Jiang, Y., Ye, Y., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. In: Proceedings of SIGGRAPH Asia 2022 (SIGGRAPH Asia), pp. 1–9 (2022)

    Google Scholar 

  25. Jiang, J., Streli, P., Qiu, H., Fender, A., Laich, L., Snape, P., Holz, C.: Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In: European Conference on Computer Vision, pp. 443–460. Springer (2022)

    Google Scholar 

  26. Zheng, X., Su, Z., Wen, C., Xue, Z., Jin, X.: Realistic full-body tracking from sparse observations via joint-level modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14 678–14 688 (2023)

    Google Scholar 

  27. Du, Y., Kips, R., Pumarola, A., Starke, S., Thabet, A., Sanakoyeu, A.: Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 481–490 (2023)

    Google Scholar 

  28. Pons-Moll, G., Baak, A., Gall, J., Leal-Taixe, L., Mueller, M., Seidel, H.-P., Rosenhahn, B.: Outdoor human motion capture using inverse kinematics and von mises-fisher sampling. In: 2011 International Conference on Computer Vision (ICCV), pp. 1243–1250. IEEE (2011)

    Google Scholar 

  29. Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.: Total capture: 3d human pose estimation fusing video and inertial sensors. In: Proceedings of 28th British Machine Vision Conference (BMVC), pp. 1–13 (2017)

    Google Scholar 

  30. Huang, F., Zeng, A., Liu, M., Lai, Q., Xu, Q.: Deepfuse: An imu-aware network for real-time 3d human pose estimation from multi-view image. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 429–438 (2020)

    Google Scholar 

  31. Pan, S., Ma, Q., Yi, X., Hu, W., Wang, X., Zhou, X., Li, J., Xu, F.: Fusing monocular images and sparse imu signals for real-time human motion capture. In: SIGGRAPH Asia. Conf. Papers 2023, 1–11 (2023)

    Google Scholar 

  32. Yang, J., Chen, T., Qin, F., Lam, M.S., Landay, J.A.: Hybridtrak: Adding full-body tracking to vr using an off-the-shelf webcam. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13 (2022)

    Google Scholar 

  33. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 36(7), 1325–1339 (2013)

    Google Scholar 

  34. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7103–7112 (2018)

    Google Scholar 

  35. Zhao, Q., Zheng, C., Liu, M., Wang, P., Chen, C.: Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8877–8886 (2023)

    Google Scholar 

  36. Tang, Z., Qiu, Z., Hao, Y., Hong, R., Yao, T.: 3d human pose estimation with spatio-temporal criss-cross attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4790–4799 (2023)

    Google Scholar 

  37. Trumble, M., Gilbert, A., Hilton, A., Collomosse, J.: Deep autoencoder for combined human pose estimation and body model upscaling. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800 (2018)

    Google Scholar 

  38. Bao, Y., Zhao, X., Qian, D.: Fusepose: Imu-vision sensor fusion in kinematic space for parametric human pose estimation. In: IEEE Transactions on Multimedia (TMM) (2022)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Key R&D Program of China under Grant No. 2021YFF0900501, the National Natural Science Foundation of China under Grant Nos. 62202461 and 61971383, and the Horizontal Research Project under Grant No. HG23002.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Long Ye .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Y., Zhang, S., Ye, L., Rao, N., Luo, X. (2025). PoseVR: Structure-Aware Hybrid Full-Body Pose Estimation in Virtual Reality. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_36

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8795-1_36

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8794-4

  • Online ISBN: 978-981-97-8795-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics