Abstract
Current 3D self-supervised learning methods of 3D scenes face a data desert issue, resulting from the time-consuming and expensive collecting process of 3D scene data. Conversely, 3D shape datasets are easier to collect. Despite this, existing pre-training strategies on shape data offer limited potential for 3D scene understanding due to significant disparities in point quantities. To tackle these challenges, we propose Shape2Scene (S2S), a novel method that learns representations of large-scale 3D scenes from 3D shape data. We first design multi-scale and high-resolution backbones for shape and scene level 3D tasks, i.e., MH-P (point-based) and MH-V (voxel-based). MH-P/V establishes direct paths to high-resolution features that capture deep semantic information across multiple scales. This pivotal nature makes them suitable for a wide range of 3D downstream tasks that tightly rely on high-resolution features. We then employ a Shape-to-Scene strategy (S2SS) to amalgamate points from various shapes, creating a random pseudo scene (comprising multiple objects) for training data, mitigating disparities between shapes and scenes. Finally, a point-point contrastive loss (PPC) is applied for the pre-training of MH-P/V. In PPC, the inherent correspondence (i.e., point pairs) is naturally obtained in S2SS. Extensive experiments have demonstrated the transferability of 3D representations learned by MH-P/V across shape-level and scene-level 3D tasks. MH-P achieves notable performance on well-known point cloud datasets (93.8% OA on ScanObjectNN and 87.6% instance mIoU on ShapeNetPart). MH-V also achieves promising performance in 3D semantic segmentation and 3D object detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. In: Preprint. OpenAI (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
Dong, R., et al.: Autoencoders as cross-modal teachers: can pretrained 2D image transformers help 3D representation learning? arXiv preprint arXiv:2212.08320 (2022)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset (2013)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017)
Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: CVPR (2016)
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D ShapeNets: a deep representation for volumetric shapes. In: CVPR (2015)
Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: ICCV (2019)
GitHub. https://github.com/
Thingiverse. https://www.thingiverse.com/
Sketchfab. https://sketchfab.com/
Polycam. https://poly.cam/
Smithsonian 3D Digitization. https://3d.si.edu//
Guo, Y.C., et al.: threestudio: a unified framework for 3D content generation. https://github.com/threestudio-project/threestudio (2023)
Wu, C.Y., Johnson, J., Malik, J., Feichtenhofer, C., Gkioxari, G.: Multiview compressive coding for 3D reconstruction. arXiv:2301.08247 (2023)
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-BERT: pre-training 3D point cloud transformers with masked point modeling. In: CVPR (2022)
Liu, H., Cai, M., Lee, Y.J.: Masked discrimination for self-supervised learning on point clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13662, pp. 657–675. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_38
Pang, Y., Wang, W., Tay, F.E., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13662, pp. 604–621. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_35
Zhang, R., et al.: Point-M2AE: multi-scale masked autoencoders for hierarchical point cloud pre-training. In: NeurIPS (2022)
Chen, G., Wang, M., Yang, Y., Yu, K., Yuan, L., Yue, Y.: PointGPT: auto-regressively generative pre-training from point clouds. In: NeurIPS (2023)
Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_34
Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3d features on any point-cloud. In: ICCV (2021)
Meng, Q., Wang, W., Zhou, T., Shen, J., Jia, Y., Van Gool, L.: Towards a weakly supervised framework for 3D point cloud object detection and annotation. IEEE TPAMI 44(8), 4454–4468 (2021)
Meng, Q., Wang, W., Zhou, T., Shen, J., Van Gool, L., Dai, D.: Weakly supervised 3D object detection from lidar point cloud. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 515–531. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_31
Yin, J., et al.: Semi-supervised 3D object detection with proficient teachers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13698, pp. 727–743. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_42
Yin, J., et al.: Is-fusion: instance-scene collaborative fusion for multimodal 3d object detection. In: CVPR (2024)
Choy, C., Park, J., Koltun, V.: Fully convolutional geometric features. In: ICCV (2019)
Yi, L., et al.: A scalable active framework for region annotation in 3D shape collections. ACM TOG 35(6), 1–12 (2016)
Behley, J., et al.: SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In: ICCV (2019)
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: PointCNN: convolution on X-transformed points. In: NeurIPS (2018)
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: ICCV (2019)
Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network design and local geometry in point cloud: a simple residual MLP framework. In: ICLR (2021)
Qian, G., et al.: PointNext: revisiting PointNet++ with improved training and scaling strategies. In: NeurIPS (2022)
Feng, T., Quan, R., Wang, X., Wang, W., Yang, Y.: Interpretable3D: an ad-hoc interpretable classifier for 3D point clouds. In: AAAI (2024)
Wu, B., Wan, A., Yue, X., Keutzer, K.: SqueezeSeg: convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D LiDAR point cloud. In: ICRA (2018)
Zhang, Y., et al.: PolarNet: an improved grid representation for online lidar point clouds semantic segmentation. In: CVPR (2020)
Xu, Y., Fan, T., Xu, M., Zeng, L., Qiao, Y.: SpiderCNN: deep learning on point sets with parameterized convolutional filters. In: ECCV (2018)
Tatarchenko, M., Park, J., Koltun, V., Zhou, Q.Y.: Tangent convolutions for dense prediction in 3D. In: CVPR (2018)
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: minkowski convolutional neural networks. In: CVPR (2019)
Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Riegler, G., Osman Ulusoy, A., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: CVPR (2017)
Graham, B., Engelcke, M., van der Maaten, L.: 3D semantic segmentation with submanifold sparse convolutional networks. In: CVPR (2018)
Klokov, R., Lempitsky, V.: Escape from cells: deep Kd-networks for the recognition of 3D point cloud models. In: ICCV (2017)
Zhu, X., et al.: Cylindrical and asymmetrical 3D convolution networks for lidar segmentation. In: CVPR (2021)
Afham, M., Dissanayake, I., Dissanayake, D., Dharmasiri, A., Thilakarathna, K., Rodrigo, R.: Crosspoint: self-supervised cross-modal contrastive learning for 3D point cloud understanding. In: CVPR (2022)
Jing, L., Chen, Y., Zhang, L., He, M., Tian, Y.: Self-supervised modal and view invariant feature learning. arXiv preprint arXiv:2005.14169 (2020)
Xue, L., et al.: ULIP: learning a unified representation of language, images, and point clouds for 3D understanding. In: CVPR (2023)
Xue, L., et al.: ULIP-2: towards scalable multimodal pre-training for 3D understanding. arXiv preprint arXiv:2305.08275 (2023)
Sun, S., Pang, J., Shi, J., Yi, S., Ouyang, W.: FishNet: a versatile backbone for image, region, and pixel level prediction. In: NeurIPS (2018)
Wang, H., Liu, Q., Yue, X., Lasenby, J., Kusner, M.J.: Unsupervised point cloud pre-training via occlusion completion. In: ICCV (2021)
Yin, J., et al.: ProposalContrast: unsupervised pre-training for lidar-based 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13699, pp. 17–33. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19842-7_2
Sauder, J., Sievers, B.: Self-supervised deep learning on point clouds by reconstructing space. In: NeurIPS (2019)
Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: ICML (2018)
Li, J., Chen, B.M., Lee, G.H.: So-net: self-organizing network for point cloud analysis. In: CVPR (2018)
Yang, Y., Feng, C., Shen, Y., Tian, D.: FoldingNet: point cloud auto-encoder via deep grid deformation. In: CVPR (2018)
Rao, Y., Lu, J., Zhou, J.: Global-local bidirectional reasoning for unsupervised representation learning of 3D point clouds. In: CVPR (2020)
Eckart, B., Yuan, W., Liu, C., Kautz, J.: Self-supervised learning on 3D point clouds by learning discrete generative models. In: CVPR, pp. 8248–8257 (2021)
Yan, S., et al.: IAE: implicit autoencoder for point cloud self-supervised representation learning. arXiv preprint arXiv:2201.00785 (2022)
Feng, T., Wang, W., Wang, X., Yang, Y., Zheng, Q.: Clustering based point cloud representation learning for 3D analysis. In: ICCV (2023)
Min, C., Zhao, D., Xiao, L., Nie, Y., Dai, B.: Voxel-MAE: masked autoencoders for pre-training large-scale point clouds. arXiv preprint arXiv:2206.09900 (2022)
Irshad, M.Z., et al.: NeRF-MAE: masked autoencoders for self-supervised 3D representation learning for neural radiance fields (2024)
Rolfe, J.T.: Discrete variational autoencoders. arXiv preprint arXiv:1609.02200 (2016)
Long, F., Yao, T., Qiu, Z., Li, L., Mei, T.: PointClustering: unsupervised point cloud pre-training using transformation invariance in clustering. In: CVPR (2023)
Chen, Y., Nießner, M., Dai, A.: 4DContrast: contrastive learning with dynamic correspondences for 3D scene understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13692, pp. 543–560. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_32
Liu, X., Han, Z., Liu, Y.S., Zwicker, M.: Point2Sequence: learning the shape representation of 3D point clouds with an attention-based sequence to sequence network. In: AAAI (2019)
Tang, H., et al.: Searching efficient 3D architectures with sparse point-voxel convolution. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 685–702. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_41
Chen, Y., Li, Y., Zhang, X., Sun, J., Jia, J.: Focal sparse convolutional networks for 3D object detection. In: CVPR (2022)
Feng, T., Wang, W., Ma, F., Yang, Y.: LSK3DNet: towards effective and efficient 3D perception with large sparse kernels. In: CVPR (2024)
Yvette, K.S.: The Noether Theorems. Invariance and Conservation Laws in the Twentieth Century. Springer, New York (2011). https://doi.org/10.1007/978-0-387-87868-3. Translated by Bertram E. Schwarzbach
Kong, X., Zhang, X.: Understanding masked image modeling via learning occlusion invariant feature. In: CVPR (2023)
Dangovski, R., et al.: Equivariant self-supervised learning: encouraging equivariance in representations. In: ICLR (2021)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., Van Gool, L.: Exploring cross-image pixel contrast for semantic segmentation. In: ICCV (2021)
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Hackel, T., Savinov, N., Ladicky, L., Wegner, J.D., Schindler, K., Pollefeys, M.: Semantic3D. net: a new large-scale point cloud classification benchmark. arXiv preprint arXiv:1704.03847 (2017)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Loshchilov, I., Hutter, F.: Stochastic gradient descent with warm restarts. In: ICLR (2016)
Phan, A.V., Le Nguyen, M., Nguyen, Y.L.H., Bui, L.T.: DGCNN: a convolutional neural network over large-scale labeled graphs. Neural Netw. 108, 533–543 (2018)
Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural network for point cloud analysis. In: CVPR (2019)
Zheng, X., et al.: Point cloud pre-training with diffusion models. arXiv preprint arXiv:2311.14960 (2023)
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3D object detection in point clouds. In: ICCV (2019)
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3D object detection. In: ICCV (2021)
Wang, Z., Yu, X., Rao, Y., Zhou, J., Lu, J.: Take-a-photo: 3D-to-2D generative pre-training of point cloud models. In: ICCV (2023)
Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3D point clouds. In: ICCV (2021)
Acknowledgements.
This work was supported by a China Scholarship Council (CSC) scholarship.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Feng, T., Wang, W., Quan, R., Yang, Y. (2025). Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15113. Springer, Cham. https://doi.org/10.1007/978-3-031-73001-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-73001-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73000-9
Online ISBN: 978-3-031-73001-6
eBook Packages: Computer ScienceComputer Science (R0)