Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data | SpringerLink
Skip to main content

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Current 3D self-supervised learning methods of 3D scenes face a data desert issue, resulting from the time-consuming and expensive collecting process of 3D scene data. Conversely, 3D shape datasets are easier to collect. Despite this, existing pre-training strategies on shape data offer limited potential for 3D scene understanding due to significant disparities in point quantities. To tackle these challenges, we propose Shape2Scene (S2S), a novel method that learns representations of large-scale 3D scenes from 3D shape data. We first design multi-scale and high-resolution backbones for shape and scene level 3D tasks, i.e., MH-P (point-based) and MH-V (voxel-based). MH-P/V establishes direct paths to high-resolution features that capture deep semantic information across multiple scales. This pivotal nature makes them suitable for a wide range of 3D downstream tasks that tightly rely on high-resolution features. We then employ a Shape-to-Scene strategy (S2SS) to amalgamate points from various shapes, creating a random pseudo scene (comprising multiple objects) for training data, mitigating disparities between shapes and scenes. Finally, a point-point contrastive loss (PPC) is applied for the pre-training of MH-P/V. In PPC, the inherent correspondence (i.e., point pairs) is naturally obtained in S2SS. Extensive experiments have demonstrated the transferability of 3D representations learned by MH-P/V across shape-level and scene-level 3D tasks. MH-P achieves notable performance on well-known point cloud datasets (93.8% OA on ScanObjectNN and 87.6% instance mIoU on ShapeNetPart). MH-V also achieves promising performance in 3D semantic segmentation and 3D object detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. In: Preprint. OpenAI (2018)

    Google Scholar 

  2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  3. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  4. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022)

    Google Scholar 

  5. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

    Google Scholar 

  6. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)

    Google Scholar 

  7. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)

    Google Scholar 

  8. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  9. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  10. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)

    Google Scholar 

  11. Dong, R., et al.: Autoencoders as cross-modal teachers: can pretrained 2D image transformers help 3D representation learning? arXiv preprint arXiv:2212.08320 (2022)

  12. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset (2013)

    Google Scholar 

  13. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017)

    Google Scholar 

  14. Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: CVPR (2016)

    Google Scholar 

  15. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D ShapeNets: a deep representation for volumetric shapes. In: CVPR (2015)

    Google Scholar 

  16. Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: ICCV (2019)

    Google Scholar 

  17. GitHub. https://github.com/

  18. Thingiverse. https://www.thingiverse.com/

  19. Sketchfab. https://sketchfab.com/

  20. Polycam. https://poly.cam/

  21. Smithsonian 3D Digitization. https://3d.si.edu//

  22. Guo, Y.C., et al.: threestudio: a unified framework for 3D content generation. https://github.com/threestudio-project/threestudio (2023)

  23. Wu, C.Y., Johnson, J., Malik, J., Feichtenhofer, C., Gkioxari, G.: Multiview compressive coding for 3D reconstruction. arXiv:2301.08247 (2023)

  24. Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-BERT: pre-training 3D point cloud transformers with masked point modeling. In: CVPR (2022)

    Google Scholar 

  25. Liu, H., Cai, M., Lee, Y.J.: Masked discrimination for self-supervised learning on point clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13662, pp. 657–675. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_38

    Chapter  Google Scholar 

  26. Pang, Y., Wang, W., Tay, F.E., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13662, pp. 604–621. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_35

    Chapter  Google Scholar 

  27. Zhang, R., et al.: Point-M2AE: multi-scale masked autoencoders for hierarchical point cloud pre-training. In: NeurIPS (2022)

    Google Scholar 

  28. Chen, G., Wang, M., Yang, Y., Yu, K., Yuan, L., Yue, Y.: PointGPT: auto-regressively generative pre-training from point clouds. In: NeurIPS (2023)

    Google Scholar 

  29. Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_34

    Chapter  Google Scholar 

  30. Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3d features on any point-cloud. In: ICCV (2021)

    Google Scholar 

  31. Meng, Q., Wang, W., Zhou, T., Shen, J., Jia, Y., Van Gool, L.: Towards a weakly supervised framework for 3D point cloud object detection and annotation. IEEE TPAMI 44(8), 4454–4468 (2021)

    Google Scholar 

  32. Meng, Q., Wang, W., Zhou, T., Shen, J., Van Gool, L., Dai, D.: Weakly supervised 3D object detection from lidar point cloud. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 515–531. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_31

    Chapter  Google Scholar 

  33. Yin, J., et al.: Semi-supervised 3D object detection with proficient teachers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13698, pp. 727–743. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_42

    Chapter  Google Scholar 

  34. Yin, J., et al.: Is-fusion: instance-scene collaborative fusion for multimodal 3d object detection. In: CVPR (2024)

    Google Scholar 

  35. Choy, C., Park, J., Koltun, V.: Fully convolutional geometric features. In: ICCV (2019)

    Google Scholar 

  36. Yi, L., et al.: A scalable active framework for region annotation in 3D shape collections. ACM TOG 35(6), 1–12 (2016)

    Article  Google Scholar 

  37. Behley, J., et al.: SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In: ICCV (2019)

    Google Scholar 

  38. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)

    Google Scholar 

  39. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)

    Google Scholar 

  40. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)

    Google Scholar 

  41. Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: PointCNN: convolution on X-transformed points. In: NeurIPS (2018)

    Google Scholar 

  42. Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: ICCV (2019)

    Google Scholar 

  43. Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network design and local geometry in point cloud: a simple residual MLP framework. In: ICLR (2021)

    Google Scholar 

  44. Qian, G., et al.: PointNext: revisiting PointNet++ with improved training and scaling strategies. In: NeurIPS (2022)

    Google Scholar 

  45. Feng, T., Quan, R., Wang, X., Wang, W., Yang, Y.: Interpretable3D: an ad-hoc interpretable classifier for 3D point clouds. In: AAAI (2024)

    Google Scholar 

  46. Wu, B., Wan, A., Yue, X., Keutzer, K.: SqueezeSeg: convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D LiDAR point cloud. In: ICRA (2018)

    Google Scholar 

  47. Zhang, Y., et al.: PolarNet: an improved grid representation for online lidar point clouds semantic segmentation. In: CVPR (2020)

    Google Scholar 

  48. Xu, Y., Fan, T., Xu, M., Zeng, L., Qiao, Y.: SpiderCNN: deep learning on point sets with parameterized convolutional filters. In: ECCV (2018)

    Google Scholar 

  49. Tatarchenko, M., Park, J., Koltun, V., Zhou, Q.Y.: Tangent convolutions for dense prediction in 3D. In: CVPR (2018)

    Google Scholar 

  50. Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: minkowski convolutional neural networks. In: CVPR (2019)

    Google Scholar 

  51. Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)

    Article  Google Scholar 

  52. Riegler, G., Osman Ulusoy, A., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: CVPR (2017)

    Google Scholar 

  53. Graham, B., Engelcke, M., van der Maaten, L.: 3D semantic segmentation with submanifold sparse convolutional networks. In: CVPR (2018)

    Google Scholar 

  54. Klokov, R., Lempitsky, V.: Escape from cells: deep Kd-networks for the recognition of 3D point cloud models. In: ICCV (2017)

    Google Scholar 

  55. Zhu, X., et al.: Cylindrical and asymmetrical 3D convolution networks for lidar segmentation. In: CVPR (2021)

    Google Scholar 

  56. Afham, M., Dissanayake, I., Dissanayake, D., Dharmasiri, A., Thilakarathna, K., Rodrigo, R.: Crosspoint: self-supervised cross-modal contrastive learning for 3D point cloud understanding. In: CVPR (2022)

    Google Scholar 

  57. Jing, L., Chen, Y., Zhang, L., He, M., Tian, Y.: Self-supervised modal and view invariant feature learning. arXiv preprint arXiv:2005.14169 (2020)

  58. Xue, L., et al.: ULIP: learning a unified representation of language, images, and point clouds for 3D understanding. In: CVPR (2023)

    Google Scholar 

  59. Xue, L., et al.: ULIP-2: towards scalable multimodal pre-training for 3D understanding. arXiv preprint arXiv:2305.08275 (2023)

  60. Sun, S., Pang, J., Shi, J., Yi, S., Ouyang, W.: FishNet: a versatile backbone for image, region, and pixel level prediction. In: NeurIPS (2018)

    Google Scholar 

  61. Wang, H., Liu, Q., Yue, X., Lasenby, J., Kusner, M.J.: Unsupervised point cloud pre-training via occlusion completion. In: ICCV (2021)

    Google Scholar 

  62. Yin, J., et al.: ProposalContrast: unsupervised pre-training for lidar-based 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13699, pp. 17–33. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19842-7_2

    Chapter  Google Scholar 

  63. Sauder, J., Sievers, B.: Self-supervised deep learning on point clouds by reconstructing space. In: NeurIPS (2019)

    Google Scholar 

  64. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: ICML (2018)

    Google Scholar 

  65. Li, J., Chen, B.M., Lee, G.H.: So-net: self-organizing network for point cloud analysis. In: CVPR (2018)

    Google Scholar 

  66. Yang, Y., Feng, C., Shen, Y., Tian, D.: FoldingNet: point cloud auto-encoder via deep grid deformation. In: CVPR (2018)

    Google Scholar 

  67. Rao, Y., Lu, J., Zhou, J.: Global-local bidirectional reasoning for unsupervised representation learning of 3D point clouds. In: CVPR (2020)

    Google Scholar 

  68. Eckart, B., Yuan, W., Liu, C., Kautz, J.: Self-supervised learning on 3D point clouds by learning discrete generative models. In: CVPR, pp. 8248–8257 (2021)

    Google Scholar 

  69. Yan, S., et al.: IAE: implicit autoencoder for point cloud self-supervised representation learning. arXiv preprint arXiv:2201.00785 (2022)

  70. Feng, T., Wang, W., Wang, X., Yang, Y., Zheng, Q.: Clustering based point cloud representation learning for 3D analysis. In: ICCV (2023)

    Google Scholar 

  71. Min, C., Zhao, D., Xiao, L., Nie, Y., Dai, B.: Voxel-MAE: masked autoencoders for pre-training large-scale point clouds. arXiv preprint arXiv:2206.09900 (2022)

  72. Irshad, M.Z., et al.: NeRF-MAE: masked autoencoders for self-supervised 3D representation learning for neural radiance fields (2024)

    Google Scholar 

  73. Rolfe, J.T.: Discrete variational autoencoders. arXiv preprint arXiv:1609.02200 (2016)

  74. Long, F., Yao, T., Qiu, Z., Li, L., Mei, T.: PointClustering: unsupervised point cloud pre-training using transformation invariance in clustering. In: CVPR (2023)

    Google Scholar 

  75. Chen, Y., Nießner, M., Dai, A.: 4DContrast: contrastive learning with dynamic correspondences for 3D scene understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13692, pp. 543–560. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_32

    Chapter  Google Scholar 

  76. Liu, X., Han, Z., Liu, Y.S., Zwicker, M.: Point2Sequence: learning the shape representation of 3D point clouds with an attention-based sequence to sequence network. In: AAAI (2019)

    Google Scholar 

  77. Tang, H., et al.: Searching efficient 3D architectures with sparse point-voxel convolution. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 685–702. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_41

    Chapter  Google Scholar 

  78. Chen, Y., Li, Y., Zhang, X., Sun, J., Jia, J.: Focal sparse convolutional networks for 3D object detection. In: CVPR (2022)

    Google Scholar 

  79. Feng, T., Wang, W., Ma, F., Yang, Y.: LSK3DNet: towards effective and efficient 3D perception with large sparse kernels. In: CVPR (2024)

    Google Scholar 

  80. Yvette, K.S.: The Noether Theorems. Invariance and Conservation Laws in the Twentieth Century. Springer, New York (2011). https://doi.org/10.1007/978-0-387-87868-3. Translated by Bertram E. Schwarzbach

    Book  Google Scholar 

  81. Kong, X., Zhang, X.: Understanding masked image modeling via learning occlusion invariant feature. In: CVPR (2023)

    Google Scholar 

  82. Dangovski, R., et al.: Equivariant self-supervised learning: encouraging equivariance in representations. In: ICLR (2021)

    Google Scholar 

  83. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)

    Google Scholar 

  84. Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., Van Gool, L.: Exploring cross-image pixel contrast for semantic segmentation. In: ICCV (2021)

    Google Scholar 

  85. Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)

  86. Hackel, T., Savinov, N., Ladicky, L., Wegner, J.D., Schindler, K., Pollefeys, M.: Semantic3D. net: a new large-scale point cloud classification benchmark. arXiv preprint arXiv:1704.03847 (2017)

  87. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  88. Loshchilov, I., Hutter, F.: Stochastic gradient descent with warm restarts. In: ICLR (2016)

    Google Scholar 

  89. Phan, A.V., Le Nguyen, M., Nguyen, Y.L.H., Bui, L.T.: DGCNN: a convolutional neural network over large-scale labeled graphs. Neural Netw. 108, 533–543 (2018)

    Article  Google Scholar 

  90. Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural network for point cloud analysis. In: CVPR (2019)

    Google Scholar 

  91. Zheng, X., et al.: Point cloud pre-training with diffusion models. arXiv preprint arXiv:2311.14960 (2023)

  92. Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3D object detection in point clouds. In: ICCV (2019)

    Google Scholar 

  93. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3D object detection. In: ICCV (2021)

    Google Scholar 

  94. Wang, Z., Yu, X., Rao, Y., Zhou, J., Lu, J.: Take-a-photo: 3D-to-2D generative pre-training of point cloud models. In: ICCV (2023)

    Google Scholar 

  95. Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3D point clouds. In: ICCV (2021)

    Google Scholar 

Download references

Acknowledgements.

This work was supported by a China Scholarship Council (CSC) scholarship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Yang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 262 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Feng, T., Wang, W., Quan, R., Yang, Y. (2025). Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15113. Springer, Cham. https://doi.org/10.1007/978-3-031-73001-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73001-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73000-9

  • Online ISBN: 978-3-031-73001-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics