Abstract
Accurate full-body pose estimation plays a key role in enhancing the virtual reality interaction experience. Existing single-source solutions face some limitations, e.g. undesirable accuracy due to sensor sparsity (the sensor-based) and pose distortion caused by camera perspective (the vision-based). This motivates us to resort to hybrid-based solutions to address these shortages. However, the accuracy and robustness of hybrid ones still need improvement due to a lack of consideration of structure constraints and global temporal redundancy. To solve these problems, we present PoseVR, a novel architecture that leverages the fusion of 2D vision and 3D sensor information. To compensate for the shortage of the single-source solution, we propose a dual-branch fusion structure to eliminate the redundancy of global temporal information by integrating the continuity of local temporal information. Motivated by the prior knowledge of human physiological structure and joint location, a novel coarse-to-fine endpoint space strategy is introduced to formulate the edge point of the body as prior information for accurately predicting full-body pose. Furthermore, a spatial loss function is employed for hierarchical prediction to achieve better accuracy. We qualitatively and quantitatively evaluate the proposed PoseVR. Experimental results show that PoseVR achieves state-of-the-art performance. Code is available at https://github.com/wwwpkol/PoseVR.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ashtari, N., Bunt, A., McGrenere, J., Nebeling, M., Chilana, P.K.: Creating augmented and virtual reality applications: Current practices, challenges, and opportunities. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13 (2020)
Radianti, J., Majchrzak, T.A., Fromm, J., Wohlgenannt, I.: A systematic review of immersive virtual reality applications for higher education: Design elements, lessons learned, and research agenda. Comput. Educ. 147, 103778 (2020)
Martin, D., Malpica, S., Gutierrez, D., Masia, B., Serrano, A.: Multimodality in vr: A survey. ACM Comput. Surv. (CSUR) 54(10s), 1–36 (2022)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7025–7034 (2017)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7753–7762 (2019)
Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. (TOG) 37(6), 1–15 (2018)
Li, S., Chan, A.B.: 3d human pose estimation from monocular images with deep convolutional neural network. In: Proceedings of the 12th Asian Conference on Computer Vision(ACCV), pp. 332–347. Springer (2015)
Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.-P., Xu, W., Casas, D., Theobalt, C.: Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. (TOG) 36(4), 1–14 (2017)
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Fua, P., Seidel, H.-P., Rhodin, H., Pons-Moll, G., Theobalt, C.: Xnect: Real-time multi-person 3d motion capture with a single rgb camera. ACM Trans. Graph. (TOG) 39(4), 82–1 (2020)
Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., Kehtarnavaz, N., Shah, M.: Deep learning-based human pose estimation: A survey. ArXiv preprint arXiv:2012.13392 (2020)
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2640–2649 (2017)
Zhou, K., Han, X., Jiang, N., Jia, K., Lu, J.: Hemlets pose: Learning part-centric heatmap triplets for accurate 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2344–2353 (2019)
Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., Theobalt, C.: In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 905–10 914 (2019)
Fang, H.-S., Xu, Y., Wang, W., Liu, X., Zhu, S.-C.: Learning pose grammar to encode human body configuration for 3d pose estimation. Proceed. AAAI Conf. Artif. Intell. (AAAI) 32(1) (2018)
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11 656–11 665 (2021)
Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, W.: Exploiting temporal contexts with strided transformer for 3d human pose estimation. In: IEEE Transactions on Multimedia (TMM) (2022)
Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S., Gao, W.: P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In: Proceedings of the 17th European Conference on Computer Vision (ECCV), pp. 461–478. Springer (2022)
Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 147–13 156 (2022)
Movella.: Movella is the leading innovator in 3d motion tracking products. https://www.movella.com/ (2024). Last accessed 18 March 2024
Rokoko.: Full performance capture: track body, finger, face motions. https://www.rokoko.com/ (2024). Last accessed 18 March 2024
Von Marcard, T., Rosenhahn, B., Black, M.J., Pons-Moll, G.: Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. In: Computer Graphics Forum (CGF), vol. 36, no. 2. Wiley Online Library, pp. 349–360 (2017)
Yi, X., Zhou, Y., Xu, F.: Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)
Yi, X., Zhou, Y., Habermann, M., Shimada, S., Golyanik, V., Theobalt, C., Xu, F.: Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 167–13 178 (2022)
Jiang, Y., Ye, Y., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. In: Proceedings of SIGGRAPH Asia 2022 (SIGGRAPH Asia), pp. 1–9 (2022)
Jiang, J., Streli, P., Qiu, H., Fender, A., Laich, L., Snape, P., Holz, C.: Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In: European Conference on Computer Vision, pp. 443–460. Springer (2022)
Zheng, X., Su, Z., Wen, C., Xue, Z., Jin, X.: Realistic full-body tracking from sparse observations via joint-level modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14 678–14 688 (2023)
Du, Y., Kips, R., Pumarola, A., Starke, S., Thabet, A., Sanakoyeu, A.: Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 481–490 (2023)
Pons-Moll, G., Baak, A., Gall, J., Leal-Taixe, L., Mueller, M., Seidel, H.-P., Rosenhahn, B.: Outdoor human motion capture using inverse kinematics and von mises-fisher sampling. In: 2011 International Conference on Computer Vision (ICCV), pp. 1243–1250. IEEE (2011)
Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.: Total capture: 3d human pose estimation fusing video and inertial sensors. In: Proceedings of 28th British Machine Vision Conference (BMVC), pp. 1–13 (2017)
Huang, F., Zeng, A., Liu, M., Lai, Q., Xu, Q.: Deepfuse: An imu-aware network for real-time 3d human pose estimation from multi-view image. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 429–438 (2020)
Pan, S., Ma, Q., Yi, X., Hu, W., Wang, X., Zhou, X., Li, J., Xu, F.: Fusing monocular images and sparse imu signals for real-time human motion capture. In: SIGGRAPH Asia. Conf. Papers 2023, 1–11 (2023)
Yang, J., Chen, T., Qin, F., Lam, M.S., Landay, J.A.: Hybridtrak: Adding full-body tracking to vr using an off-the-shelf webcam. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13 (2022)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 36(7), 1325–1339 (2013)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7103–7112 (2018)
Zhao, Q., Zheng, C., Liu, M., Wang, P., Chen, C.: Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8877–8886 (2023)
Tang, Z., Qiu, Z., Hao, Y., Hong, R., Yao, T.: 3d human pose estimation with spatio-temporal criss-cross attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4790–4799 (2023)
Trumble, M., Gilbert, A., Hilton, A., Collomosse, J.: Deep autoencoder for combined human pose estimation and body model upscaling. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800 (2018)
Bao, Y., Zhao, X., Qian, D.: Fusepose: Imu-vision sensor fusion in kinematic space for parametric human pose estimation. In: IEEE Transactions on Multimedia (TMM) (2022)
Acknowledgement
This work was supported by the National Key R&D Program of China under Grant No. 2021YFF0900501, the National Natural Science Foundation of China under Grant Nos. 62202461 and 61971383, and the Horizontal Research Project under Grant No. HG23002.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yang, Y., Zhang, S., Ye, L., Rao, N., Luo, X. (2025). PoseVR: Structure-Aware Hybrid Full-Body Pose Estimation in Virtual Reality. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_36
Download citation
DOI: https://doi.org/10.1007/978-981-97-8795-1_36
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8794-4
Online ISBN: 978-981-97-8795-1
eBook Packages: Computer ScienceComputer Science (R0)