SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos

Zeng, Ailing; Yang, Lei; Ju, Xuan; Li, Jiefeng; Wang, Jianyi; Xu, Qiang

doi:10.1007/978-3-031-20065-6_36

Ailing Zeng¹²,
Lei Yang¹³,
Xuan Ju¹²,
Jiefeng Li¹⁴,
Jianyi Wang¹⁵ &
…
Qiang Xu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13665))

Included in the following conference series:

European Conference on Computer Vision

3149 Accesses
45 Citations

Abstract

When analyzing human motion videos, the output jitters from existing pose estimators are highly-unbalanced with varied estimation errors across frames. Most frames in a video are relatively easy to estimate and only suffer from slight jitters. In contrast, for rarely seen or occluded actions, the estimated positions of multiple joints largely deviate from the ground truth values for a consecutive sequence of frames, rendering significant jitters on them.

To tackle this problem, we propose to attach a dedicated temporal-only refinement network to existing pose estimators for jitter mitigation, named SmoothNet. Unlike existing learning-based solutions that employ spatio-temporal models to co-optimize per-frame precision and temporal smoothness at all the joints, SmoothNet models the natural smoothness characteristics in body movements by learning the long-range temporal relations of every joint without considering the noisy correlations among joints. With a simple yet effective motion-aware fully-connected network, SmoothNet improves the temporal smoothness of existing pose estimators significantly and enhances the estimation accuracy of those challenging frames as a side-effect. Moreover, as a temporal-only model, a unique advantage of SmoothNet is its strong transferability across various types of estimators, modalities, and datasets. Comprehensive experiments on five datasets with eleven popular backbone networks across 2D and 3D pose estimation and body recovery tasks demonstrate the efficacy of the proposed solution. Code is available at https://github.com/cure-lab/SmoothNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 12583; Price includes VAT (Japan)

Softcover Book: JPY 15729; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DANet: dual association network for human pose estimation in video

Article 06 October 2023

VH3D-LSFM: Video-Based Human 3D Pose Estimation with Long-Term and Short-Term Pose Fusion Mechanism

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Article 22 January 2024

References

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Google Scholar
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv arXiv:abs/1803.01271 (2018)
Brownrigg, D.R.: The weighted median filter. Commun. ACM 27(8), 807–818 (1984)
Article Google Scholar
Casiez, G., Roussel, N., Vogel, D.: 1€ filter: a simple speed-based low-pass filter for noisy input in interactive systems. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2527–2530 (2012)
Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Google Scholar
Choi, H., Moon, G., Chang, J.Y., Lee, K.M.: Beyond static features for temporally consistent 3D human pose and shape from a video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1964–1973 (2021)
Google Scholar
Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 20–40. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_2
Chapter Google Scholar
Coskun, H., Achilles, F., DiPietro, R.S., Navab, N., Tombari, F.: Long short-term memory kalman filters: recurrent neural estimators for pose regularization. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5525–5533 (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fischman, M.G.: Programming time as a function of number of movement parts and changes in movement direction. J. Mot. Behav. 16(4), 405–423 (1984)
Article Google Scholar
Gauss, J.F., Brandin, C., Heberle, A., Löwe, W.: Smoothing skeleton avatar visualizations using signal processing technology. SN Comput. Sci. 2(6), 1–17 (2021)
Article Google Scholar
Hunter, J.S.: The exponentially weighted moving average. J. Qual. Technol. 18(4), 203–210 (1986)
Article Google Scholar
Hyndman, R.J.: Moving averages (2011)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Google Scholar
Jiang, T., Camgoz, N.C., Bowden, R.: Skeletor: skeletal transformers for robust body-pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3394–3402 (2021)
Google Scholar
Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3D human pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp. 42–52. IEEE (2021)
Google Scholar
Kalman, R.E.: A new approach to linear filtering and prediction problems (1960)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)
Google Scholar
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019)
Google Scholar
Kim, D.Y., Chang, J.Y.: Attention-based 3D human pose sequence refinement network. Sensors 21(13), 4572 (2021)
Article Google Scholar
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)
Google Scholar
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019)
Google Scholar
Lee, C.H., Lin, C.R., Chen, M.S.: Sliding-window filtering: an efficient algorithm for incremental mining. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 263–270 (2001)
Google Scholar
Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: ICCV (2021)
Google Scholar
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13401–13412, October 2021
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. In: Proceedings of the Asian Conference on Computer Vision (2020)
Google Scholar
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)
Google Scholar
Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 2017 International Conference on 3D Vision (3DV), pp. 506–516. IEEE (2017)
Google Scholar
Mehta, D., et al.: XNect: real-time multi-person 3D motion capture with a single RGB camera. ACM Trans. Graph. (TOG) 39(4), 82-1 (2020)
Google Scholar
Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 2018 International Conference on 3D Vision (3DV), pp. 120–130 (2018)
Google Scholar
Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. (TOG) 36(4), 1–14 (2017)
Article Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Google Scholar
Press, W.H., Teukolsky, S.A.: Savitzky-Golay smoothing filters. Comput. Phys. 4(6), 669–672 (1990)
Article Google Scholar
So, D., Le, Q., Liang, C.: The evolved transformer. In: International Conference on Machine Learning, pp. 5877–5886. PMLR (2019)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Google Scholar
Tripathi, S., Ranade, S., Tyagi, A., Agrawal, A.: Posenet3d: learning temporally consistent 3D human pose via knowledge distillation. In: 2020 International Conference on 3D Vision (3DV), pp. 311–321. IEEE (2020)
Google Scholar
Tsuchida, S., Fukayama, S., Hamasaki, M., Goto, M.: AIST dance video database: multi-genre, multi-dancer, and multi-camera database for dance information processing. In: ISMIR, pp. 501–510 (2019)
Google Scholar
Van Loan, C.: Computational frameworks for the fast Fourier transform. SIAM (1992)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Véges, M., Lőrincz, A.: Temporal smoothing for 3D human pose estimation and localization for occluded people. In: Yang, H., Pasupa, K., Leung, A.C.-S., Kwok, J.T., Chan, J.H., King, I. (eds.) ICONIP 2020. LNCS, vol. 12532, pp. 557–568. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63830-6_47
Chapter Google Scholar
Wan, Z., Li, Z., Tian, M., Liu, J., Yi, S., Li, H.: Encoder-decoder with multi-level attention for 3D human shape and pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13033–13042 (2021)
Google Scholar
Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided 3D pose estimation from videos. arXiv abs/2004.13985 (2020)
Google Scholar
Young, I.T., Van Vliet, L.J.: Recursive implementation of the gaussian filter. Signal Process. 44(2), 139–151 (1995)
Article Google Scholar
Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., Lin, S.: SRNet: improving generalization in 3D human pose estimation with a split-and-recombine approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 507–523. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_30
Chapter Google Scholar
Zeng, A., Sun, X., Yang, L., Zhao, N., Liu, M., Xu, Q.: Learning skeletal graph neural networks for hard 3D pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (2021)
Google Scholar
Zhang, S., Zhang, Y., Bogo, F., Pollefeys, M., Tang, S.: Learning motion priors for 4D human body capture in 3D scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11343–11353 (2021)
Google Scholar
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3D human pose regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3425–3435 (2019)
Google Scholar
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3D human pose estimation with spatial and temporal transformers. arXiv preprint arXiv:2103.10455 (2021)
Zhou, H., et al.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of AAAI (2021)
Google Scholar
Zhou, K., Bhatnagar, B.L., Lenssen, J.E., Pons-Moll, G.: TOCH: spatio-temporal object correspondence to hand for motion refinement. arXiv, May 2022
Google Scholar
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746 (2019)
Google Scholar

Download references

Acknowledgement

This work is supported in part by Shenzhen-Hong Kong-Macau Science and Technology Program (Category C) of Shenzhen Science Technology and Innovation Commission under Grant No. SGDX2020110309500101.

Author information

Authors and Affiliations

The Chinese University of Hong Kong, Shatin, Hong Kong
Ailing Zeng, Xuan Ju & Qiang Xu
Sensetime Group Ltd., Shatin, Hong Kong
Lei Yang
Shanghai Jiao Tong University, Shanghai, China
Jiefeng Li
Nanyang Technological University, Singapore, Singapore
Jianyi Wang

Authors

Ailing Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Lei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Ju
View author publications
You can also search for this author in PubMed Google Scholar
Jiefeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianyi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiang Xu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 520 KB)

Supplementary material 2 (pdf 8014 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zeng, A., Yang, L., Ju, X., Li, J., Wang, J., Xu, Q. (2022). SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13665. Springer, Cham. https://doi.org/10.1007/978-3-031-20065-6_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-20065-6_36
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20064-9
Online ISBN: 978-3-031-20065-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos