Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics. Recently, deep-learning-based approaches have made great progress, but are typically hindered by the need for large datasets of either pose-labelled real images or carefully tuned photorealistic simulators. This can be avoided by using only geometry inputs such as depth images to reduce the domain-gap but these approaches suffer from a lack of semantic information, which can be vital in the pose estimation problem. To resolve this conflict, we propose to utilize both geometric and semantic features obtained from a pre-trained foundation model. Our approach projects 2D semantic features into object models as 3D semantic point clouds. Based on the novel 3D representation, we further propose a self-supervision pipeline, and match the fused semantic point clouds against their synthetic rendered partial observations from synthetic object models. The learned knowledge from synthetic data generalizes to observations of unseen objects in the real scenes, without any fine-tuning. We demonstrate this with a rich evaluation on the NOCS, Wild6D and SUN RGB-D benchmarks, showing superior performance over geometric-only and semantic-only baselines with significantly fewer training objects.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv abs/1512.03012 (2015). https://api.semanticscholar.org/CorpusID:2554264
Chen, D., Li, J., Xu, K.: Learning canonical shape space for category-level 6D object pose and size estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11970–11979 (2020). https://api.semanticscholar.org/CorpusID:210919925
Chen, K., Dou, Q.: SGPA: structure-guided prior adaptation for category-level 6D object pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2773–2782 (2021)
Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1581–1590 (2021)
Chen, X., Dong, Z., Song, J., Geiger, A., Hilliges, O.: Category level object pose estimation via neural analysis-by-synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 139–156. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_9
Deng, H., Birdal, T., Ilic, S.: PPF-FoldNet: unsupervised learning of rotation invariant 3D local descriptors. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 620–638. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_37
Deng, H., Birdal, T., Ilic, S.: PPFNet: global context aware local features for robust 3D point matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 195–205 (2018)
Di, Y., et al.: GPV-Pose: category-level object pose estimation via geometry-guided point-wise voting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6781–6791 (2022)
Fan, Z., et al.: ACR-Pose: adversarial canonical representation reconstruction network for category level 6D object pose estimation. arXiv preprint arXiv:2111.10524 (2021)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981). https://api.semanticscholar.org/CorpusID:972888
Gao, D., et al.: Polarimetric pose prediction. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13669, pp. 735–752. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_43
Gao, G., Lauri, M., Wang, Y., Hu, X., Zhang, J., Frintrop, S.: 6D object pose regression via supervised learning on point clouds. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3643–3649. IEEE (2020)
Gao, G., Lauri, M., Wang, Y., Hu, X., Zhang, J., Frintrop, S.: 6D object pose regression via supervised learning on point clouds. In: International Conference on Robotics and Automation (ICRA), pp. 3643–3649 (2020). https://api.semanticscholar.org/CorpusID:210911622
Goodwin, W., Havoutis, I., Posner, I.: You only look at one: category-level object representations for pose estimation from a single example. arXiv preprint arXiv:2305.12626 (2023)
Goodwin, W., Vaze, S., Havoutis, I., Posner, I.: Zero-shot category-level object pose estimation. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13699, pp. 516–532. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19842-7_30
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2961–2969 (2017)
He, Y., Huang, H., Fan, H., Chen, Q., Sun, J.: FFB6D: a full flow bidirectional fusion network for 6D pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3003–3013 (2021)
He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6DoF pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11632–11641 (2020)
Hodan, T., Barath, D., Matas, J.: EPOS: estimating 6D pose of objects with symmetries. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11703–11712. IEEE, June 2020
Irshad, M.Z., Kollar, T., Laskey, M., Stone, K., Kira, Z.: CenterSnap: single-shot multi-object 3D shape reconstruction and categorical 6D pose and size estimation. In: International Conference on Robotics and Automation (ICRA), pp. 10632–10640. IEEE (2022)
Karnati, M., Seal, A., Yazidi, A., Krejcar, O.: LieNet: a deep convolution neural network framework for detecting deception. IEEE Trans. Cogn. Dev. Syst. 14(3), 971–984 (2021)
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. IEEE International Conference on Computer Vision (ICCV), pp. 1530–1538 (2017). https://api.semanticscholar.org/CorpusID:10655945
Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_34
Lee, T., et al.: UDA-COPE: unsupervised domain adaptation for category-level object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14891–14900 (2022)
Lee, T., et al.: TTA-COPE: test-time adaptation for category-level object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21285–21295 (2023)
Li, X., et al.: Leveraging SE(3) equivariance for self-supervised category-level object pose estimation from point clouds. Adv. Neural. Inf. Process. Syst. 34, 15370–15381 (2021)
Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: European Conference on Computer Vision (ECCV), pp. 683–698 (2018)
Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7678–7687 (2019)
Lin, J., Li, H., Chen, K., Lu, J., Jia, K.: Sparse steerable convolutions: an efficient learning of SE(3)-equivariant features for estimation and tracking of object poses in 3D space. Adv. Neural. Inf. Process. Syst. 34, 16779–16790 (2021)
Lin, J., Wei, Z., Ding, C., Jia, K.: Category-level 6D object pose and size estimation using self-supervised deep prior deformation networks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV). LNCS, vol. 13669, pp. 19–34. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_2
Lin, J., Wei, Z., Li, Z., Xu, S., Jia, K., Li, Y.: DualPoseNet: category-level 6D object pose and size estimation using dual pose network with refined learning of pose consistency. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp. 3560–3569 (2021)
Lin, J., Wei, Z., Zhang, Y., Jia, K.: VI-Net: boosting category-level 6D object pose estimation via learning decoupled rotations on the spherical representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14001–14011 (2023)
Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2999–3007 (2017). https://api.semanticscholar.org/CorpusID:47252984
Lindenberger, P., Sarlin, P.E., Pollefeys, M.: LightGlue: local feature matching at light speed. arXiv preprint arXiv:2306.13643 (2023)
Liu, X., Wang, G., Li, Y., Ji, X.: CATRE: iterative point clouds alignment for category-level object pose refinement. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV), pp. 499–516. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_29
Manhardt, F., et al: CPS++: improving class-level 6D pose and shape estimation from monocular images with self-supervised learning. arXiv preprint arXiv:2003.05848 (2020)
Mann, B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv:abs/2304.07193 (2023). https://api.semanticscholar.org/CorpusID:258170077
Pan, P., Fan, Z., Feng, B.Y., Wang, P., Li, C., Wang, Z.: Learning to estimate 6DoF pose from limited data: a few-shot, generalizable approach using RGB images. arXiv preprint arXiv:2306.07598 (2023)
Park, K., Patten, T., Vincze, M.: Pix2Pose: pixel-wise coordinate regression of objects for 6D pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7668–7677 (2019)
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4561–4570 (2019)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–660 (2017)
Qin, Z., Yu, H., Wang, C., Guo, Y., Peng, Y., Xu, K.: Geometric transformer for fast and robust point cloud registration. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11143–11152 (2022)
Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836 (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4938–4947 (2020)
Song, C., Song, J., Huang, Q.: HybridPose: 6D object pose estimation under hybrid representations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 431–440 (2020)
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 567–576 (2015). https://api.semanticscholar.org/CorpusID:6242669
Sundermeyer, M., Marton, Z.C., Durner, M., Brucker, M., Triebel, R.: Implicit 3D orientation learning for 6d object detection from RGB images. In: European Conference on Computer Vision (ECCV), pp. 699–715 (2018)
Tian, M., Ang, M.H., Lee, G.H.: Shape prior deformation for categorical 6D object pose and size estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 530–546. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_32
Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 13, 376–380 (1991). https://api.semanticscholar.org/CorpusID:206421766
Wang, C., et al.: DenseFusion: 6D object pose estimation by iterative dense fusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3343–3352 (2019)
Wang, G., Manhardt, F., Shao, J., Ji, X., Navab, N., Tombari, F.: Self6D: self-supervised monocular 6D object pose estimation. In: European Conference on Computer Vision (ECCV) abs/2004.06468 (2020). https://api.semanticscholar.org/CorpusID:215754192
Wang, G., Manhardt, F., Tombari, F., Ji, X.: GDR-Net: geometry-guided direct regression network for monocular 6D object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16611–16621 (2021)
Wang, H., Sridhar, S., Huang, J., Valentin, J.P.C., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6D object pose and size estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2637–2646 (2019). https://api.semanticscholar.org/CorpusID:57761160
Wang, J., Chen, K., Dou, Q.: Category-level 6D object pose estimation via cascaded relation and recurrent reconstruction networks. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4807–4814. IEEE (2021)
Wang, P., Garattoni, L., Meier, S., Navab, N., Busam, B.: CroCPS: addressing photometric challenges in self-supervised category-level 6D object poses with cross-modal learning. In: British Machine Vision Conference (2022). https://api.semanticscholar.org/CorpusID:256903232
Wang, P., et al.: PhoCaL: a multi-modal dataset for category-level object pose estimation with photometrically challenging objects. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21222–21231 (2022)
Wang, P., et al.: DemoGrasp: few-shot learning for robotic grasping with human demonstration. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5733–5740. IEEE (2021)
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
Xu, Y., Lin, K.Y., Zhang, G., Wang, X., Li, H.: RNNPose: recurrent 6-DoF object pose refinement with robust correspondence field estimation and pose optimization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14880–14890. IEEE, June 2022
You, Y., He, W., Liu, J., Xiong, H., Wang, W., Lu, C.: CPPF++: uncertainty-aware Sim2Real object pose estimation by vote aggregation. IEEE Trans. Pattern Anal. Mach. Intell. (2024)
You, Y., Shi, R., Wang, W., Lu, C.: CPPF: towards robust category-level 9D pose estimation in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6856–6865 (2022). https://api.semanticscholar.org/CorpusID:247291938
Zakharov, S., Shugurov, I., Ilic, S.: DPOD: 6D pose object detector and refiner. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1941–1950 (2019)
Ze, Y., Wang, X.: Category-level 6D object pose estimation in the wild: a semi-supervised learning approach and a new dataset. Adv. Neural. Inf. Process. Syst. 35, 27469–27483 (2022)
Zhang, R., et al.: RBP-Pose: residual bounding box projection for category-level pose estimation. arXiv:abs/2208.00237 (2022). https://api.semanticscholar.org/CorpusID:251223949
Zhao, C., Hu, Y., Salzmann, M.: Fusing local similarities for retrieval-based 3D orientation estimation of unseen objects. arXiv:abs/2203.08472 (2022). https://api.semanticscholar.org/CorpusID:247475898
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, P., Ikeda, T., Lee, R., Nishiwaki, K. (2025). GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15085. Springer, Cham. https://doi.org/10.1007/978-3-031-73383-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-73383-3_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73382-6
Online ISBN: 978-3-031-73383-3
eBook Packages: Computer ScienceComputer Science (R0)