GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence | SpringerLink
Skip to main content

GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15085))

Included in the following conference series:

Abstract

Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics. Recently, deep-learning-based approaches have made great progress, but are typically hindered by the need for large datasets of either pose-labelled real images or carefully tuned photorealistic simulators. This can be avoided by using only geometry inputs such as depth images to reduce the domain-gap but these approaches suffer from a lack of semantic information, which can be vital in the pose estimation problem. To resolve this conflict, we propose to utilize both geometric and semantic features obtained from a pre-trained foundation model. Our approach projects 2D semantic features into object models as 3D semantic point clouds. Based on the novel 3D representation, we further propose a self-supervision pipeline, and match the fused semantic point clouds against their synthetic rendered partial observations from synthetic object models. The learned knowledge from synthetic data generalizes to observations of unseen objects in the real scenes, without any fine-tuning. We demonstrate this with a rich evaluation on the NOCS, Wild6D and SUN RGB-D benchmarks, showing superior performance over geometric-only and semantic-only baselines with significantly fewer training objects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv abs/1512.03012 (2015). https://api.semanticscholar.org/CorpusID:2554264

  2. Chen, D., Li, J., Xu, K.: Learning canonical shape space for category-level 6D object pose and size estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11970–11979 (2020). https://api.semanticscholar.org/CorpusID:210919925

  3. Chen, K., Dou, Q.: SGPA: structure-guided prior adaptation for category-level 6D object pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2773–2782 (2021)

    Google Scholar 

  4. Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1581–1590 (2021)

    Google Scholar 

  5. Chen, X., Dong, Z., Song, J., Geiger, A., Hilliges, O.: Category level object pose estimation via neural analysis-by-synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 139–156. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_9

    Chapter  Google Scholar 

  6. Deng, H., Birdal, T., Ilic, S.: PPF-FoldNet: unsupervised learning of rotation invariant 3D local descriptors. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 620–638. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_37

    Chapter  Google Scholar 

  7. Deng, H., Birdal, T., Ilic, S.: PPFNet: global context aware local features for robust 3D point matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 195–205 (2018)

    Google Scholar 

  8. Di, Y., et al.: GPV-Pose: category-level object pose estimation via geometry-guided point-wise voting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6781–6791 (2022)

    Google Scholar 

  9. Fan, Z., et al.: ACR-Pose: adversarial canonical representation reconstruction network for category level 6D object pose estimation. arXiv preprint arXiv:2111.10524 (2021)

  10. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981). https://api.semanticscholar.org/CorpusID:972888

  11. Gao, D., et al.: Polarimetric pose prediction. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13669, pp. 735–752. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_43

  12. Gao, G., Lauri, M., Wang, Y., Hu, X., Zhang, J., Frintrop, S.: 6D object pose regression via supervised learning on point clouds. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3643–3649. IEEE (2020)

    Google Scholar 

  13. Gao, G., Lauri, M., Wang, Y., Hu, X., Zhang, J., Frintrop, S.: 6D object pose regression via supervised learning on point clouds. In: International Conference on Robotics and Automation (ICRA), pp. 3643–3649 (2020). https://api.semanticscholar.org/CorpusID:210911622

  14. Goodwin, W., Havoutis, I., Posner, I.: You only look at one: category-level object representations for pose estimation from a single example. arXiv preprint arXiv:2305.12626 (2023)

  15. Goodwin, W., Vaze, S., Havoutis, I., Posner, I.: Zero-shot category-level object pose estimation. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13699, pp. 516–532. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19842-7_30

  16. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2961–2969 (2017)

    Google Scholar 

  17. He, Y., Huang, H., Fan, H., Chen, Q., Sun, J.: FFB6D: a full flow bidirectional fusion network for 6D pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3003–3013 (2021)

    Google Scholar 

  18. He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6DoF pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11632–11641 (2020)

    Google Scholar 

  19. Hodan, T., Barath, D., Matas, J.: EPOS: estimating 6D pose of objects with symmetries. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11703–11712. IEEE, June 2020

    Google Scholar 

  20. Irshad, M.Z., Kollar, T., Laskey, M., Stone, K., Kira, Z.: CenterSnap: single-shot multi-object 3D shape reconstruction and categorical 6D pose and size estimation. In: International Conference on Robotics and Automation (ICRA), pp. 10632–10640. IEEE (2022)

    Google Scholar 

  21. Karnati, M., Seal, A., Yazidi, A., Krejcar, O.: LieNet: a deep convolution neural network framework for detecting deception. IEEE Trans. Cogn. Dev. Syst. 14(3), 971–984 (2021)

    Article  Google Scholar 

  22. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. IEEE International Conference on Computer Vision (ICCV), pp. 1530–1538 (2017). https://api.semanticscholar.org/CorpusID:10655945

  23. Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_34

    Chapter  Google Scholar 

  24. Lee, T., et al.: UDA-COPE: unsupervised domain adaptation for category-level object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14891–14900 (2022)

    Google Scholar 

  25. Lee, T., et al.: TTA-COPE: test-time adaptation for category-level object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21285–21295 (2023)

    Google Scholar 

  26. Li, X., et al.: Leveraging SE(3) equivariance for self-supervised category-level object pose estimation from point clouds. Adv. Neural. Inf. Process. Syst. 34, 15370–15381 (2021)

    Google Scholar 

  27. Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: European Conference on Computer Vision (ECCV), pp. 683–698 (2018)

    Google Scholar 

  28. Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7678–7687 (2019)

    Google Scholar 

  29. Lin, J., Li, H., Chen, K., Lu, J., Jia, K.: Sparse steerable convolutions: an efficient learning of SE(3)-equivariant features for estimation and tracking of object poses in 3D space. Adv. Neural. Inf. Process. Syst. 34, 16779–16790 (2021)

    Google Scholar 

  30. Lin, J., Wei, Z., Ding, C., Jia, K.: Category-level 6D object pose and size estimation using self-supervised deep prior deformation networks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV). LNCS, vol. 13669, pp. 19–34. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_2

  31. Lin, J., Wei, Z., Li, Z., Xu, S., Jia, K., Li, Y.: DualPoseNet: category-level 6D object pose and size estimation using dual pose network with refined learning of pose consistency. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp. 3560–3569 (2021)

    Google Scholar 

  32. Lin, J., Wei, Z., Zhang, Y., Jia, K.: VI-Net: boosting category-level 6D object pose estimation via learning decoupled rotations on the spherical representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14001–14011 (2023)

    Google Scholar 

  33. Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2999–3007 (2017). https://api.semanticscholar.org/CorpusID:47252984

  34. Lindenberger, P., Sarlin, P.E., Pollefeys, M.: LightGlue: local feature matching at light speed. arXiv preprint arXiv:2306.13643 (2023)

  35. Liu, X., Wang, G., Li, Y., Ji, X.: CATRE: iterative point clouds alignment for category-level object pose refinement. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV), pp. 499–516. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_29

  36. Manhardt, F., et al: CPS++: improving class-level 6D pose and shape estimation from monocular images with self-supervised learning. arXiv preprint arXiv:2003.05848 (2020)

  37. Mann, B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  38. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv:abs/2304.07193 (2023). https://api.semanticscholar.org/CorpusID:258170077

  39. Pan, P., Fan, Z., Feng, B.Y., Wang, P., Li, C., Wang, Z.: Learning to estimate 6DoF pose from limited data: a few-shot, generalizable approach using RGB images. arXiv preprint arXiv:2306.07598 (2023)

  40. Park, K., Patten, T., Vincze, M.: Pix2Pose: pixel-wise coordinate regression of objects for 6D pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7668–7677 (2019)

    Google Scholar 

  41. Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4561–4570 (2019)

    Google Scholar 

  42. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–660 (2017)

    Google Scholar 

  43. Qin, Z., Yu, H., Wang, C., Guo, Y., Peng, Y., Xu, K.: Geometric transformer for fast and robust point cloud registration. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11143–11152 (2022)

    Google Scholar 

  44. Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836 (2017)

    Google Scholar 

  45. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  46. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4938–4947 (2020)

    Google Scholar 

  47. Song, C., Song, J., Huang, Q.: HybridPose: 6D object pose estimation under hybrid representations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 431–440 (2020)

    Google Scholar 

  48. Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 567–576 (2015). https://api.semanticscholar.org/CorpusID:6242669

  49. Sundermeyer, M., Marton, Z.C., Durner, M., Brucker, M., Triebel, R.: Implicit 3D orientation learning for 6d object detection from RGB images. In: European Conference on Computer Vision (ECCV), pp. 699–715 (2018)

    Google Scholar 

  50. Tian, M., Ang, M.H., Lee, G.H.: Shape prior deformation for categorical 6D object pose and size estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 530–546. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_32

    Chapter  Google Scholar 

  51. Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 13, 376–380 (1991). https://api.semanticscholar.org/CorpusID:206421766

  52. Wang, C., et al.: DenseFusion: 6D object pose estimation by iterative dense fusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3343–3352 (2019)

    Google Scholar 

  53. Wang, G., Manhardt, F., Shao, J., Ji, X., Navab, N., Tombari, F.: Self6D: self-supervised monocular 6D object pose estimation. In: European Conference on Computer Vision (ECCV) abs/2004.06468 (2020). https://api.semanticscholar.org/CorpusID:215754192

  54. Wang, G., Manhardt, F., Tombari, F., Ji, X.: GDR-Net: geometry-guided direct regression network for monocular 6D object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16611–16621 (2021)

    Google Scholar 

  55. Wang, H., Sridhar, S., Huang, J., Valentin, J.P.C., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6D object pose and size estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2637–2646 (2019). https://api.semanticscholar.org/CorpusID:57761160

  56. Wang, J., Chen, K., Dou, Q.: Category-level 6D object pose estimation via cascaded relation and recurrent reconstruction networks. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4807–4814. IEEE (2021)

    Google Scholar 

  57. Wang, P., Garattoni, L., Meier, S., Navab, N., Busam, B.: CroCPS: addressing photometric challenges in self-supervised category-level 6D object poses with cross-modal learning. In: British Machine Vision Conference (2022). https://api.semanticscholar.org/CorpusID:256903232

  58. Wang, P., et al.: PhoCaL: a multi-modal dataset for category-level object pose estimation with photometrically challenging objects. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21222–21231 (2022)

    Google Scholar 

  59. Wang, P., et al.: DemoGrasp: few-shot learning for robotic grasping with human demonstration. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5733–5740. IEEE (2021)

    Google Scholar 

  60. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)

  61. Xu, Y., Lin, K.Y., Zhang, G., Wang, X., Li, H.: RNNPose: recurrent 6-DoF object pose refinement with robust correspondence field estimation and pose optimization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14880–14890. IEEE, June 2022

    Google Scholar 

  62. You, Y., He, W., Liu, J., Xiong, H., Wang, W., Lu, C.: CPPF++: uncertainty-aware Sim2Real object pose estimation by vote aggregation. IEEE Trans. Pattern Anal. Mach. Intell. (2024)

    Google Scholar 

  63. You, Y., Shi, R., Wang, W., Lu, C.: CPPF: towards robust category-level 9D pose estimation in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6856–6865 (2022). https://api.semanticscholar.org/CorpusID:247291938

  64. Zakharov, S., Shugurov, I., Ilic, S.: DPOD: 6D pose object detector and refiner. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1941–1950 (2019)

    Google Scholar 

  65. Ze, Y., Wang, X.: Category-level 6D object pose estimation in the wild: a semi-supervised learning approach and a new dataset. Adv. Neural. Inf. Process. Syst. 35, 27469–27483 (2022)

    Google Scholar 

  66. Zhang, R., et al.: RBP-Pose: residual bounding box projection for category-level pose estimation. arXiv:abs/2208.00237 (2022). https://api.semanticscholar.org/CorpusID:251223949

  67. Zhao, C., Hu, Y., Salzmann, M.: Fusing local similarities for retrieval-based 3D orientation estimation of unseen objects. arXiv:abs/2203.08472 (2022). https://api.semanticscholar.org/CorpusID:247475898

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pengyuan Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9973 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, P., Ikeda, T., Lee, R., Nishiwaki, K. (2025). GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15085. Springer, Cham. https://doi.org/10.1007/978-3-031-73383-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73383-3_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73382-6

  • Online ISBN: 978-3-031-73383-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics