Abstract
Estimating global human motion from moving cameras is challenging due to the entanglement of human and camera motions. To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose COIN, a control-inpainting motion diffusion prior that enables fine-grained control to disentangle human and camera motions. Although pre-trained motion diffusion models encode rich motion priors, we find it non-trivial to leverage such knowledge to guide global motion estimation from RGB videos. COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework. Furthermore, we introduce a new human-scene relation loss to alleviate the scale ambiguity by enforcing consistency among the humans, camera, and scene. Experiments on three challenging benchmarks demonstrate the effectiveness of COIN, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation. As an illustrative example, COIN outperforms the state-of-the-art method by 33% in world joint position error (W-MPJPE) on the RICH dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3D human pose reconstruction. In: CVPR (2015)
Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV (2019)
Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPR Workshops (2018)
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: ECCV (2016)
Cao, Z., Gao, H., Mangalam, K., Cai, Q.-Z., Vo, M., Malik, J.: Long-term human motion prediction with scene context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_23
Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: ECCV (2020)
Choi, H., Moon, G., Lee, K.M.: Beyond static features for temporally consistent 3D human pose and shape from a video. In: CVPR (2021)
Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. In: ECCV (2020)
Contributors, M.: OpenMMLab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose (2020)
Dabral, R., Shimada, S., Jain, A., Theobalt, C., Golyanik, V.: Gravity-aware monocular 3D human-object reconstruction. In: ICCV (2021)
Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., Cucchiara, R.: Compressed volumetric heatmaps for multi-person 3D pose estimation. In: CVPR (June 2020)
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
Gärtner, E., Andriluka, M., Xu, H., Sminchisescu, C.: Trajectory optimization for physics-based reconstruction of 3D human pose from monocular video. In: CVPR (2022)
Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4D: reconstructing and tracking humans with transformers. In: ICCV (2023)
Gopalakrishnan, A., Mali, A., Kifer, D., Giles, L., Ororbia, A.G.: A neural temporal model for human motion prediction. In: CVPR (2019)
Guler, R.A., Kokkinos, I.: HoloPose: holistic 3D human reconstruction in-the-wild. In: CVPR (2019)
Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human POSEitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors. In: CVPR (2021)
Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM Trans. Graph. (TOG) 39(4), 60–1 (2020)
Hassan, M., et al.: Stochastic scene-aware motion prediction. In: ICCV (2021)
Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV, pp. 2282–2292 (2019)
He, C., Saito, J., Zachary, J., Rushmeier, H., Zhou, Y.: NeMF: neural motion fields for kinematic animation. In: NeurIPS (2022)
Henning, D.F., Laidlow, T., Leutenegger, S.: Bodyslam: Joint camera localisation, mapping, and human motion tracking. In: ECCV (2022)
Hernandez, A., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. In: CVPR (2019)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPs (2020)
Huang, C.H.P., et al.: Capturing and inferring dense full-body human-scene contact. In: CVPR (2022)
Huang, S., et al.: Diffusion-based generation, optimization, and planning in 3D scenes. In: CVPR (2023)
Iqbal, U., Molchanov, P., Kautz, J.: Weakly-supervised 3D human pose learning via multi-view images in the wild. In: CVPR (2020)
Iqbal, U., Xie, K., Guo, Y., Kautz, J., Molchanov, P.: KAMA: 3D keypoint aware body mesh articulation. In: 3DV (2021)
Isogawa, M., Yuan, Y., O’Toole, M., Kitani, K.M.: Optical non-line-of-sight physics-based 3D human pose estimation. In: CVPR (2020)
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR (2016)
Jiang, W., Kolotouros, N., Pavlakos, G., Zhou, X., Daniilidis, K.: Coherent reconstruction of multiple humans from a single image. In: CVPR (2020)
Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In: 3DV (2021)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: CVPR (2019)
Karunratanakul, K., Preechakul, K., Aksan, E., Beeler, T., Suwajanakorn, S., Tang, S.: Optimizing diffusion noise can serve as universal motion priors. arXiv preprint arXiv:2312.11994 (2023)
Kaufmann, M., Aksan, E., Song, J., Pece, F., Ziegler, R., Hilliges, O.: Convolutional autoencoders for human motion infilling. In: 3DV (2020)
Kaufmann, M., et al.: EMDB: the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In: ICCV (2023)
Khurana, T., Dave, A., Ramanan, D.: Detecting invisible people. In: ICCV, pp. 3174–3184 (2021)
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: CVPR (2020)
Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: SPEC: seeing people in the wild with an estimated camera. In: ICCV (2021)
Kocabas, M., et al.: PACE: human and motion estimation from in-the-wild videos. In: 3DV (2024)
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: CVPR (2019)
Kolotouros, N., Pavlakos, G., Jayaraman, D., Daniilidis, K.: Probabilistic modeling for human mesh recovery. In: ICCV (2021)
Kundu, J.N., Rakesh, M., Jampani, V., Venkatesh, R.M., Babu1, R.V.: Appearance consensus driven self-supervised human mesh recovery. In: ECCV (2020)
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: CVPR (2017)
Li, J., Bian, S., Xu, C., Liu, G., Yu, G., Lu, C.: D &d: learning human dynamics from dynamic camera. In: ECCV (2022)
Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: HybrIK: a hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In: CVPR (2021)
Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: CLIFF: carrying location information in full frames into human pose and shape estimation. In: ECCV (2022)
Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363 (2017)
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR (2021)
Liu, M., Yang, D., Zhang, Y., Cui, Z., Rehg, J.M., Tang, S.: 4D human body capture from egocentric video via 3D scene grounding. In: 3DV (2021)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. SIGGRAPH Asia 34(6), 248:1–248:16 (2015)
Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. In: ACCV (2020)
Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic policy for egocentric pose estimation. NeurIPS 34 (2021)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: ICCV (2019)
von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: ECCV (2018)
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)
Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 3DV (2017)
Mehta, D., et al.: XNect: real-time multi-person 3D motion capture with a single RGB camera. In: SIGGRAPH (2020)
Mehta, D., et al.: VNect: Real-time 3D human pose estimation with a single RGB camera. In: SIGGRAPH (2017)
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-Lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: ECCV (2020)
Müller, L., Osman, A.A.A., Tang, S., Huang, C.H.P., Black, M.J.: On self contact and human pose. In: CVPR (2021)
Müller, L., Ye, V., Pavlakos, G., Black, M., Kanazawa, A.: Generative Proxemics: a prior for 3D social interaction from images. arXiv preprint arXiv:2306.09337 (2023)
OpenSfM - a structure from motion library. https://github.com/mapillary/OpenSfM (2021). https://github.com/mapillary/OpenSfM
Pavlakos, G., et al.: Expressive Body Capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Pavlakos, G., Kolotouros, N., Daniilidis, K.: TexturePose: supervising human mesh estimation with texture consistency. In: ICCV (2019)
Pavlakos, G., Weber, E., Tancik, M., Kanazawa, A.: The one where they reconstructed 3D humans and environments in TV shows. In: ECCV (2022)
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: CVPR (2018)
Pavllo, D., Grangier, D., Auli, M.: QuaterNet: a quaternion-based recurrent model for human motion. In: BMVC (2018)
Payer, C., Neff, T., Bischof, H., Urschler, M., Stern, D.: Simultaneous multi-person detection and single-person pose estimation with a single heatmap regression network. In: ICCV PoseTrack Workshop (2017)
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV (2021)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: Text-to-3D using 2D diffusion. In: ICLR (2023)
Reddy, N.D., Guigues, L., Pischulini, L., Eledath, J., Narasimhan, S.: TesseTrack: end-to-end learnable multi-person articulated 3D pose tracking. In: CVPR (2021)
Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3D human motion model for robust pose estimation. In: ICCV (2021)
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net: localization-classification-regression for human pose. In: CVPR (2017)
Rong, Y., Liu, Z., Li, C., Cao, K., Change Loy, C.: Delving deep into hybrid annotations for 3D human recovery in the wild. In: ICCV (2019)
Sárándi, I., Hermans, A., Leibe, B.: Learning 3D human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats. In: WACV (2023)
Shimada, S., Golyanik, V., Xu, W., Theobalt, C.: PhysCap: physically plausible monocular 3D motion capture in real time. In: SIGGRAPH (2020)
Shin, S., Kim, J., Halilaj, E., Black, M.J.: WHAM: reconstructing world-grounded humans with accurate 3D motion. In: CVPR (2024)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Song, J., Chen, X., Hilliges, O.: Human body model fitting by learned gradient descent. In: ECCV (2020)
Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: ICCV (2021)
Sun, Y., Bao, Q., Liu, W., Mei, T., Black, M.J.: TRACE: 5D temporal regression of avatars with dynamic cameras in 3D environments. In: CVPR (2023)
Sun, Y., Liu, W., Bao, Q., Fu, Y., Mei, T., Black, M.J.: Putting people in their place: Monocular regression of 3D people in depth. In: CVPR (2022)
Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., , Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: ICCV (2019)
Teed, Z., Deng, J.: DROID-SLAM: deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. In: NeurIPs (2021)
Teed, Z., Lipson, L., Deng, J.: Deep patch visual odometry. In: NeurIPS (2023)
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR 2023 (2022)
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML (2017)
Weng, Z., Yeung, S.: Holistic 3D human and scene mesh estimation from single view images. In: CVPR (2021)
Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: Posing face, body and hands in the wild. In: CVPR (2019)
Xie, K., Wang, T., Iqbal, U., Guo, Y., Fidler, S., Shkurti, F.: Physics-based human motion estimation and synthesis from videos. In: ICCV (2021)
Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: OmniControl: control any joint at any time for human motion generation. In: ICLR (2024)
Xu, Y., Zhu, S.C., Tung, T.: DenseRaC: joint 3D pose and shape estimation by dense render-and-compare. In: ICCV (2019)
Yan, X., et al.: MT-VAE: learning motion transformations to generate multimodal human dynamics. In: ECCV (2018)
Ye, V., Pavlakos, G., Malik, J., Kanazawa, A.: Decoupling human and camera motion from videos in the wild. In: CVPR (2023)
Yuan, Y., Iqbal, U., Molchanov, P., Kitani, K., Kautz, J.: GLAMR: global occlusion-aware human mesh recovery with dynamic cameras. In: CVPR (2022)
Yuan, Y., Kitani, K.: Diverse trajectory forecasting with determinantal point processes. In: ICLR 2020 (2019)
Yuan, Y., Kitani, K.: DLow: diversifying latent flows for diverse human motion prediction. In: ECCV (2020)
Yuan, Y., Kitani, K.: Residual force control for agile human behavior imitation and extended motion synthesis. In: NeurIPS (2020)
Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: PhysDiff: physics-guided human motion diffusion model. In: ICCV (2023)
Yuan, Y., Wei, S.E., Simon, T., Kitani, K., Saragih, J.: SimPoE: simulated character control for 3D human pose estimation. In: CVPR (2021)
Zanfir, A., Bazavan, E.G., Xu, H., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: Weakly supervised 3D human pose and shape reconstruction with normalizing flows. In: ECCV (2020)
Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3D pose and shape estimation of multiple people in natural scenes the importance of multiple scene constraints. In: CVPR (2018)
Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3D sensing of multiple people in natural images. In: NeurIPS (2018)
Zanfir, M., Zanfir, A., Bazavan, E.G., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: THUNDR: transformer-based 3D human reconstruction with markers. In: ICCV (2021)
Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV (2021)
Zhang, J., Yu, D., Liew, J.H., Nie, X., Feng, J.: Body meshes as points. In: CVPR (2021)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)
Zhang, M., et al.: MotionDiffuse: text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)
Zhang, S., et al.: RoHM: robust human motion reconstruction via diffusion. In: CVPR (2024)
Zhang, S., Zhang, Y., Bogo, F., Pollefeys, M., Tang, S.: Learning motion priors for 4D human body capture in 3D scenes. In: ICCV (2021)
Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: CVPR (2020)
Zhen, J., Fang, Q., Sun, J., Liu, W., Jiang, W., Bao, H., Zhou, X.: SMAP: single-shot multi-person absolute 3D pose estimation. In: ECCV (2020)
Zhou, Y., Habermann, M., Habibie, I., Tewari, A., Theobalt, C., Xu, F.: Monocular real-time full body capture with inter-part correlations. In: CVPR (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, J. et al. (2025). COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15074. Springer, Cham. https://doi.org/10.1007/978-3-031-72640-8_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-72640-8_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72639-2
Online ISBN: 978-3-031-72640-8
eBook Packages: Computer ScienceComputer Science (R0)