GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

Wang, Pengyuan; Ikeda, Takuya; Lee, Robert; Nishiwaki, Koichi

doi:10.1007/978-3-031-73383-3_7

Pengyuan Wang¹³,
Takuya Ikeda¹⁴,
Robert Lee¹⁴ &
…
Koichi Nishiwaki¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15085))

Included in the following conference series:

European Conference on Computer Vision

135 Accesses
1 Citations

Abstract

Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics. Recently, deep-learning-based approaches have made great progress, but are typically hindered by the need for large datasets of either pose-labelled real images or carefully tuned photorealistic simulators. This can be avoided by using only geometry inputs such as depth images to reduce the domain-gap but these approaches suffer from a lack of semantic information, which can be vital in the pose estimation problem. To resolve this conflict, we propose to utilize both geometric and semantic features obtained from a pre-trained foundation model. Our approach projects 2D semantic features into object models as 3D semantic point clouds. Based on the novel 3D representation, we further propose a self-supervision pipeline, and match the fused semantic point clouds against their synthetic rendered partial observations from synthetic object models. The learned knowledge from synthetic data generalizes to observations of unseen objects in the real scenes, without any fine-tuning. We demonstrate this with a rich evaluation on the NOCS, Wild6D and SUN RGB-D benchmarks, showing superior performance over geometric-only and semantic-only baselines with significantly fewer training objects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

FreeZe: Training-Free Zero-Shot 6D Pose Estimation with Geometric and Vision Foundation Models

Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

Self6D: Self-supervised Monocular 6D Object Pose Estimation

References

Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv abs/1512.03012 (2015). https://api.semanticscholar.org/CorpusID:2554264
Chen, D., Li, J., Xu, K.: Learning canonical shape space for category-level 6D object pose and size estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11970–11979 (2020). https://api.semanticscholar.org/CorpusID:210919925
Chen, K., Dou, Q.: SGPA: structure-guided prior adaptation for category-level 6D object pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2773–2782 (2021)
Google Scholar
Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1581–1590 (2021)
Google Scholar
Chen, X., Dong, Z., Song, J., Geiger, A., Hilliges, O.: Category level object pose estimation via neural analysis-by-synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 139–156. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_9
Chapter Google Scholar
Deng, H., Birdal, T., Ilic, S.: PPF-FoldNet: unsupervised learning of rotation invariant 3D local descriptors. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 620–638. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_37
Chapter Google Scholar
Deng, H., Birdal, T., Ilic, S.: PPFNet: global context aware local features for robust 3D point matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 195–205 (2018)
Google Scholar
Di, Y., et al.: GPV-Pose: category-level object pose estimation via geometry-guided point-wise voting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6781–6791 (2022)
Google Scholar
Fan, Z., et al.: ACR-Pose: adversarial canonical representation reconstruction network for category level 6D object pose estimation. arXiv preprint arXiv:2111.10524 (2021)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981). https://api.semanticscholar.org/CorpusID:972888
Gao, D., et al.: Polarimetric pose prediction. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13669, pp. 735–752. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_43
Gao, G., Lauri, M., Wang, Y., Hu, X., Zhang, J., Frintrop, S.: 6D object pose regression via supervised learning on point clouds. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3643–3649. IEEE (2020)
Google Scholar
Gao, G., Lauri, M., Wang, Y., Hu, X., Zhang, J., Frintrop, S.: 6D object pose regression via supervised learning on point clouds. In: International Conference on Robotics and Automation (ICRA), pp. 3643–3649 (2020). https://api.semanticscholar.org/CorpusID:210911622
Goodwin, W., Havoutis, I., Posner, I.: You only look at one: category-level object representations for pose estimation from a single example. arXiv preprint arXiv:2305.12626 (2023)
Goodwin, W., Vaze, S., Havoutis, I., Posner, I.: Zero-shot category-level object pose estimation. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13699, pp. 516–532. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19842-7_30
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2961–2969 (2017)
Google Scholar
He, Y., Huang, H., Fan, H., Chen, Q., Sun, J.: FFB6D: a full flow bidirectional fusion network for 6D pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3003–3013 (2021)
Google Scholar
He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6DoF pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11632–11641 (2020)
Google Scholar
Hodan, T., Barath, D., Matas, J.: EPOS: estimating 6D pose of objects with symmetries. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11703–11712. IEEE, June 2020
Google Scholar
Irshad, M.Z., Kollar, T., Laskey, M., Stone, K., Kira, Z.: CenterSnap: single-shot multi-object 3D shape reconstruction and categorical 6D pose and size estimation. In: International Conference on Robotics and Automation (ICRA), pp. 10632–10640. IEEE (2022)
Google Scholar
Karnati, M., Seal, A., Yazidi, A., Krejcar, O.: LieNet: a deep convolution neural network framework for detecting deception. IEEE Trans. Cogn. Dev. Syst. 14(3), 971–984 (2021)
Article Google Scholar
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. IEEE International Conference on Computer Vision (ICCV), pp. 1530–1538 (2017). https://api.semanticscholar.org/CorpusID:10655945
Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_34
Chapter Google Scholar
Lee, T., et al.: UDA-COPE: unsupervised domain adaptation for category-level object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14891–14900 (2022)
Google Scholar
Lee, T., et al.: TTA-COPE: test-time adaptation for category-level object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21285–21295 (2023)
Google Scholar
Li, X., et al.: Leveraging SE(3) equivariance for self-supervised category-level object pose estimation from point clouds. Adv. Neural. Inf. Process. Syst. 34, 15370–15381 (2021)
Google Scholar
Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: European Conference on Computer Vision (ECCV), pp. 683–698 (2018)
Google Scholar
Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7678–7687 (2019)
Google Scholar
Lin, J., Li, H., Chen, K., Lu, J., Jia, K.: Sparse steerable convolutions: an efficient learning of SE(3)-equivariant features for estimation and tracking of object poses in 3D space. Adv. Neural. Inf. Process. Syst. 34, 16779–16790 (2021)
Google Scholar
Lin, J., Wei, Z., Ding, C., Jia, K.: Category-level 6D object pose and size estimation using self-supervised deep prior deformation networks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV). LNCS, vol. 13669, pp. 19–34. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_2
Lin, J., Wei, Z., Li, Z., Xu, S., Jia, K., Li, Y.: DualPoseNet: category-level 6D object pose and size estimation using dual pose network with refined learning of pose consistency. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp. 3560–3569 (2021)
Google Scholar
Lin, J., Wei, Z., Zhang, Y., Jia, K.: VI-Net: boosting category-level 6D object pose estimation via learning decoupled rotations on the spherical representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14001–14011 (2023)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2999–3007 (2017). https://api.semanticscholar.org/CorpusID:47252984
Lindenberger, P., Sarlin, P.E., Pollefeys, M.: LightGlue: local feature matching at light speed. arXiv preprint arXiv:2306.13643 (2023)
Liu, X., Wang, G., Li, Y., Ji, X.: CATRE: iterative point clouds alignment for category-level object pose refinement. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV), pp. 499–516. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_29
Manhardt, F., et al: CPS++: improving class-level 6D pose and shape estimation from monocular images with self-supervised learning. arXiv preprint arXiv:2003.05848 (2020)
Mann, B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv:abs/2304.07193 (2023). https://api.semanticscholar.org/CorpusID:258170077
Pan, P., Fan, Z., Feng, B.Y., Wang, P., Li, C., Wang, Z.: Learning to estimate 6DoF pose from limited data: a few-shot, generalizable approach using RGB images. arXiv preprint arXiv:2306.07598 (2023)
Park, K., Patten, T., Vincze, M.: Pix2Pose: pixel-wise coordinate regression of objects for 6D pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7668–7677 (2019)
Google Scholar
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4561–4570 (2019)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–660 (2017)
Google Scholar
Qin, Z., Yu, H., Wang, C., Guo, Y., Peng, Y., Xu, K.: Geometric transformer for fast and robust point cloud registration. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11143–11152 (2022)
Google Scholar
Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836 (2017)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4938–4947 (2020)
Google Scholar
Song, C., Song, J., Huang, Q.: HybridPose: 6D object pose estimation under hybrid representations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 431–440 (2020)
Google Scholar
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 567–576 (2015). https://api.semanticscholar.org/CorpusID:6242669
Sundermeyer, M., Marton, Z.C., Durner, M., Brucker, M., Triebel, R.: Implicit 3D orientation learning for 6d object detection from RGB images. In: European Conference on Computer Vision (ECCV), pp. 699–715 (2018)
Google Scholar
Tian, M., Ang, M.H., Lee, G.H.: Shape prior deformation for categorical 6D object pose and size estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 530–546. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_32
Chapter Google Scholar
Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 13, 376–380 (1991). https://api.semanticscholar.org/CorpusID:206421766
Wang, C., et al.: DenseFusion: 6D object pose estimation by iterative dense fusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3343–3352 (2019)
Google Scholar
Wang, G., Manhardt, F., Shao, J., Ji, X., Navab, N., Tombari, F.: Self6D: self-supervised monocular 6D object pose estimation. In: European Conference on Computer Vision (ECCV) abs/2004.06468 (2020). https://api.semanticscholar.org/CorpusID:215754192
Wang, G., Manhardt, F., Tombari, F., Ji, X.: GDR-Net: geometry-guided direct regression network for monocular 6D object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16611–16621 (2021)
Google Scholar
Wang, H., Sridhar, S., Huang, J., Valentin, J.P.C., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6D object pose and size estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2637–2646 (2019). https://api.semanticscholar.org/CorpusID:57761160
Wang, J., Chen, K., Dou, Q.: Category-level 6D object pose estimation via cascaded relation and recurrent reconstruction networks. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4807–4814. IEEE (2021)
Google Scholar
Wang, P., Garattoni, L., Meier, S., Navab, N., Busam, B.: CroCPS: addressing photometric challenges in self-supervised category-level 6D object poses with cross-modal learning. In: British Machine Vision Conference (2022). https://api.semanticscholar.org/CorpusID:256903232
Wang, P., et al.: PhoCaL: a multi-modal dataset for category-level object pose estimation with photometrically challenging objects. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21222–21231 (2022)
Google Scholar
Wang, P., et al.: DemoGrasp: few-shot learning for robotic grasping with human demonstration. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5733–5740. IEEE (2021)
Google Scholar
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
Xu, Y., Lin, K.Y., Zhang, G., Wang, X., Li, H.: RNNPose: recurrent 6-DoF object pose refinement with robust correspondence field estimation and pose optimization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14880–14890. IEEE, June 2022
Google Scholar
You, Y., He, W., Liu, J., Xiong, H., Wang, W., Lu, C.: CPPF++: uncertainty-aware Sim2Real object pose estimation by vote aggregation. IEEE Trans. Pattern Anal. Mach. Intell. (2024)
Google Scholar
You, Y., Shi, R., Wang, W., Lu, C.: CPPF: towards robust category-level 9D pose estimation in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6856–6865 (2022). https://api.semanticscholar.org/CorpusID:247291938
Zakharov, S., Shugurov, I., Ilic, S.: DPOD: 6D pose object detector and refiner. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1941–1950 (2019)
Google Scholar
Ze, Y., Wang, X.: Category-level 6D object pose estimation in the wild: a semi-supervised learning approach and a new dataset. Adv. Neural. Inf. Process. Syst. 35, 27469–27483 (2022)
Google Scholar
Zhang, R., et al.: RBP-Pose: residual bounding box projection for category-level pose estimation. arXiv:abs/2208.00237 (2022). https://api.semanticscholar.org/CorpusID:251223949
Zhao, C., Hu, Y., Salzmann, M.: Fusing local similarities for retrieval-based 3D orientation estimation of unseen objects. arXiv:abs/2203.08472 (2022). https://api.semanticscholar.org/CorpusID:247475898

Download references

Author information

Authors and Affiliations

Technical University of Munich, Munich, Germany
Pengyuan Wang
Woven by Toyota, Tokyo, Japan
Takuya Ikeda, Robert Lee & Koichi Nishiwaki

Authors

Pengyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Takuya Ikeda
View author publications
You can also search for this author in PubMed Google Scholar
Robert Lee
View author publications
You can also search for this author in PubMed Google Scholar
Koichi Nishiwaki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengyuan Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9973 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, P., Ikeda, T., Lee, R., Nishiwaki, K. (2025). GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15085. Springer, Cham. https://doi.org/10.1007/978-3-031-73383-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-73383-3_7
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73382-6
Online ISBN: 978-3-031-73383-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

FreeZe: Training-Free Zero-Shot 6D Pose Estimation with Geometric and Vision Foundation Models

Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

Self6D: Self-supervised Monocular 6D Object Pose Estimation

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9973 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

FreeZe: Training-Free Zero-Shot 6D Pose Estimation with Geometric and Vision Foundation Models

Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

Self6D: Self-supervised Monocular 6D Object Pose Estimation

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9973 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation