Abstract
Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their motion behaviors in a highly dynamic world. To address this difficulty, this paper pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to trigger an automated meta labeling pipeline to achieve automatic supervision. 3D detection experiments on the Waymo Open Dataset show that our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. We further show that our approach generates highly promising results in open-set 3D detection and trajectory prediction, confirming its potential in closing the safety gap of fully supervised systems.
M. Najibi and J. Ji—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, S., Snavely, N., Seitz, S.M., Szeliski, R.: Bundle adjustment in the large. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 29–42. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15552-9_3
Bansal, M., Krizhevsky, A., Ogale, A.: ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079 (2018)
Bau, D., et al.: GAN dissection: visualizing and understanding generative adversarial networks. In: ICLR (2019)
Besl, P.J., McKay, N.D.: Method for registration of 3-D shapes. In: Schenker, P.S. (eds.) Sensor Fusion IV: Ccontrol Paradigms and data Structures. vol. 1611, pp. 586–606. SPIE, Bellingham Wash (1992)
Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range conditioned dilated convolutions for scale invariant 3D object detection (2020)
Caesar, H., et al.: Nuscenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Caine, B., et al.: Pseudo-labeling for scalable 3D object detection. arXiv preprint arXiv:2103.02093 (2021)
Casas, S., Luo, W., Urtasun, R.: IntentNet: learning to predict intention from raw sensor data. In: CoRL (2018)
Cen, J., Yun, P., Cai, J., Wang, M.Y., Liu, M.: Open-set 3D object detection. In: 3DV (2021)
Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. In: CoRL (2019)
Chang, M.F., et al.: Argoverse: 3D tracking and forecasting with rich maps. In: CVPR (2019)
Chen, Y., Liu, S., Shen, X., Jia, J.: Fast point r-CNN. In: ICCV (2019)
Chen, Y., Medioni, G.: Object modelling by registration of multiple range images. Image Vis. Comput. 10(3), 145–155 (1992)
Chen, Y., et al.: GeoSim: realistic video simulation via geometry-aware composition for self-driving. In: CVPR (2021)
Cho, M., Kwak, S., Schmid, C., Ponce, J.: Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In: CVPR (2015)
Cui, H., et al.: Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In: ICRA (2019)
Deng, B., Qi, C.R., Najibi, M., Funkhouser, T., Zhou, Y., Anguelov, D.: Revisiting 3D object detection from an egocentric perspective. In: NeurIPS (2021)
Dewan, A., Caselitz, T., Tipaldi, G.D., Burgard, W.: Motion-based detection and tracking in 3D lidar scans. In: ICRA (2016)
Djuric, N., et al.: Short-term motion prediction of traffic actors for autonomous driving using deep convolutional networks (2018)
Duggal, S., et al.: Mending neural implicit modeling for 3D vehicle reconstruction in the wild. In: WACV (2022)
Engelcke, M., Rao, D., Wang, D.Z., Tong, C.H., Posner, I.: Vote3deep: fast object detection in 3D point clouds using efficient convolutional neural networks. In: ICRA (2017)
Engelmann, F., Stückler, J., Leibe, B.: Joint object pose estimation and shape reconstruction in urban street scenes using 3D shape priors. In: Rosenhahn, B., Andres, B. (eds.) GCPR 2016. LNCS, vol. 9796, pp. 219–230. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45886-1_18
Engelmann, F., Stückler, J., Leibe, B.: SAMP: shape and motion priors for 4d vehicle reconstruction. In: WACV (2017)
Ester, M., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD (1996)
Ettinger, S., et al.: Large scale interactive motion forecasting for autonomous driving: the Waymo open motion dataset. In: ICCV (2021)
Faktor, A., Irani, M.: “Clustering by composition"-unsupervised discovery of image categories. In: ECCV (2012)
Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.: RangeDet: in defense of range view for lidar-based 3D object detection. In: ICCV (2021)
Gao, J., et al.: Encoding HD maps and agent dynamics from vectorized representation. In: CVPR (2020)
Grauman, K., Darrell, T.: Unsupervised learning of categories from sets of partially matching image features. In: CVPR (2006)
Groß, J., Ošep, A., Leibe, B.: AlignNet-3D: fast point cloud registration of partially observed objects. In: 3DV (2019)
Gu, J., et al.: Weakly-supervised 3D shape completion in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 283–299. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_17
Gu, J., Sun, C., Zhao, H.: DenseTNT: end-to-end trajectory prediction from dense goal sets. In: ICCV (2021)
Gu, X., Wang, Y., Wu, C., Lee, Y.J., Wang, P.: HplflowNet: hierarchical permutohedral lattice flowNet for scene flow estimation on large-scale point clouds. In: CVPR (2019)
He, C., Zeng, H., Huang, J., Hua, X.S., Zhang, L.: Structure aware single-stage 3D object detection from point cloud. In: CVPR, June 2020
Hong, J., Sapp, B., Philbin, J.: Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In: CVPR (2019)
Houston, J., et al.: One thousand and one hours: Self-driving motion prediction dataset. arXiv preprint arXiv:2006.14480 (2020)
Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: NeurIPS (2018)
Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. pp. 559–568 (2011)
Jerripothula, K.R., Cai, J., Yuan, J.: CATS: co-saliency activated tracklet selection for video co-localization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 187–202. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_12
Joulin, A., Bach, F., Ponce, J.: Discriminative clustering for image co-segmentation. In: CVPR (2010)
Jund, P., Sweeney, C., Abdo, N., Chen, Z., Shlens, J.: Scalable scene flow from point clouds in the real world. IEEE Rob. Autom. Lett. 7(2), 1589–1596 (2022). https://doi.org/10.1109/LRA.2021.3139542
Kim, G., Torralba, A.: Unsupervised detection of regions of interest using iterative link analysis. In: NIPS (2009)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR (2019)
Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML (2009)
Li, X., Pontes, J.K., Lucey, S.: Neural scene flow prior. In: NeurIPS (2021)
Li, Z., Wang, F., Wang, N.: Lidar r-CNN: An efficient and universal 3d object detector. In: CVPR (2021)
Liu, X., Qi, C.R., Guibas, L.J.: Flownet3d: learning scene flow in 3d point clouds. In: CVPR (2019)
Liu, Y., et al.: Opening up open-world tracking. In: CVPR (2022)
Liu, Y., Zhang, J., Fang, L., Jiang, Q., Zhou, B.: Multimodal motion prediction with stacked transformers. In: CVPR (2021)
Luo, C., Yang, X., Yuille, A.: Self-supervised pillar motion learning for autonomous driving. In: CVPR (2021)
Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In: CVPR (2018)
Manivasagam, S., et al.: LiDARSim: realistic lidar simulation by leveraging the real world. In: CVPR (2020)
Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.: LaserNet: an efficient probabilistic 3D object detector for autonomous driving. In: CVPR (2019)
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3D object detection. In: ICCV (2021)
Mittal, H., Okorn, B., Held, D.: Just go with the flow: self-supervised scene flow estimation. In: CVPR (2020)
Najibi, M., et al.: DOPS: learning to detect 3D objects and predict their 3D shapes. In: CVPR (2020)
Pang, Z., Li, Z., Wang, N.: Model-free vehicle tracking and state estimation in point cloud sequences. In: IROS (2021)
Phan-Minh, T., Grigore, E.C., Boulton, F.A., Beijbom, O., Wolff, E.M.: CoverNet: Multimodal behavior prediction using trajectory sets. In: CVPR (2020)
Pontes, J.K., Hays, J., Lucey, S.: Scene flow from point clouds with or without learning. In: 2020 International Conference on 3D Vision (3DV). pp. 261–270 (2020). https://doi.org/10.1109/3DV50981.2020.00036
Puy, G., Boulch, A., Marlet, R.: FLOT: scene flow on point clouds guided by optimal transport. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 527–544. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_32
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep Hough voting for 3D object detection in point clouds. In: ICCV (2019)
Qi, C.R., et al.: Offboard 3D object detection from point cloud sequences. In: CVPR (2021)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR (2015)
Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: Proceedings Third International Conference on 3-D Digital Imaging and Modeling, pp. 145–152. IEEE (2001)
Russell, B.C., Freeman, W.T., Efros, A.A., Sivic, J., Zisserman, A.: Using multiple segmentations to discover objects and their extent in image collections. In: CVPR (2006)
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: CVPR (2020)
Shi, S., Wang, X., Li, H.: PointRCNN : 3D object proposal generation and detection from point cloud. In: CVPR (2019)
Shi, W., Rajkumar, R.R.: Point-GNN: graph neural network for 3D object detection in a point cloud. In: CVPR (2020)
Simon, M., Milz, S., Amende, K., Gross, H.-M.: Complex-YOLO: an Euler-region-proposal for real-time 3D object detection on point clouds. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 197–209. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_11
Xia, S., Hancock, E.R.: Graph-based object class discovery. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS, vol. 5702, pp. 385–393. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03767-2_47
Sohn, K., Zhou, G., Lee, C., Lee, H.: Learning and selecting features jointly with point-wise gated boltzmann machines. In: ICML (2013)
Song, S., Xiao, J.: Deep sliding shapes for Amodal 3D object detection in RGB-D images images. In: CVPR (2016)
Stutz, D., Geiger, A.: Learning 3d shape completion from laser scan data with weak supervision. In: CVPR (2018)
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)
Sun, P., et al.: RSN: range sparse net for efficient, accurate lidar 3d object detection. In: CVPR, pp. 5725–5734 (2021)
Tang, C., Tan, P.: Ba-Net: dense bundle adjustment network. In: ICLR (2019)
Tian, H., Chen, Y., Dai, J., Zhang, Z., Zhu, X.: Unsupervised object detection with lidar clues. In: CVPR (2021)
Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment — a modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) IWVA 1999. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-44480-7_21
Tulsiani, S., Efros, A.A., Malik, J.: Multi-view consistency as supervisory signal for learning shape and pose prediction. In: CVPR (2018)
Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR (2017)
Ummenhofer, B., et al.: Demon: depth and motion network for learning monocular stereo. In: CVPR (2017)
Varadarajan, B., et al.: Multipath++: efficient information fusion and trajectory aggregation for behavior prediction. CoRR arXiv:2111.14973 (2021)
Vo, H.V., et al.: Unsupervised image matching and object discovery as optimization. In: CVPR (2019)
Vo, H.V., Pérez, P., Ponce, J.: Toward unsupervised, multi-object discovery in large-scale image collections. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 779–795. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_46
Vo, V.H., Sizikova, E., Schmid, C., Pérez, P., Ponce, J.: Large-scale unsupervised object discovery. In: NeurIPS (2021)
Wang, D.Z., Posner, I.: Voting for voting in online point cloud object detection. In: Proceedings of Robotics: Science and Systems. Rome, Italy, July 2015
Wang, R., Yang, N., Stückler, J., Cremers, D.: Directshape: direct photometric alignment of shape priors for visual vehicle pose and shape estimation. In: ICRA (2020)
Wang, Y., et al.: Pillar-based object detection for autonomous driving. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_2
Wang, Y., Solomon, J.M.: Deep closest point: Learning representations for point cloud registration. In: ICCV (2019)
Wang, Z., Li, S., Howard-Jenkins, H., Prisacariu, V., Chen, M.: Flownet3d++: Geometric losses for deep scene flow estimation. In: WACV (2020)
Wei, X., Zhang, Y., Li, Z., Fu, Y., Xue, X.: DeepSFM: structure from motion via deep bundle adjustment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 230–247. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_14
Weng, X., Kitani, K.: A baseline for 3D multi-object tracking. arXiv preprint arXiv:1907.03961 (2019)
Wong, K., Wang, S., Ren, M., Liang, M., Urtasun, R.: Identifying unknown instances for autonomous driving. In: CoRL. PMLR (2020)
Wu, P., Chen, S., Metaxas, D.N.: MotionNet: joint perception and motion prediction for autonomous driving based on bird’s eye view maps. In: CVPR (2020)
Wu, W., Wang, Z.Y., Li, Z., Liu, W., Fuxin, L.: PointPWC-Net: cost volume on point clouds for (Self-)supervised scene flow estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 88–107. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_6
Yan, X., et al.: Learning 6-DOF grasping interaction via deep geometry-aware 3D representations. In: ICRA (2018)
Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. In: NIPS (2016)
Yang, B., Bai, M., Liang, M., Zeng, W., Urtasun, R.: Auto4d: learning to label 4D objects from sequential point clouds. arXiv preprint arXiv:2101.06586 (2021)
Yang, B., Luo, W., Urtasun, R.: PIXOR: real-time 3D object detection from point clouds. In: CVPR (2018)
Yang, H., Shi, J., Carlone, L.: Teaser: fast and certifiable point cloud registration. IEEE Trans. Rob. 37(2), 314–333 (2020)
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: point-based 3D single stage object detector. In: CVPR (2020)
Ye, M., Xu, S., Cao, T.: HvNet: hybrid voxel network for lidar based 3d object detection. In: CVPR (2020)
Ye, M., Cao, T., Chen, Q.: TPCN: temporal point cloud networks for motion forecasting. In: CVPR (2021)
Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: CVPR (2009)
Yuan, Y., Weng, X., Ou, Y., Kitani, K.M.: AgentFormer: agent-aware transformers for socio-temporal multi-agent forecasting. In: ICCV (2021)
Zakharov, S., Kehl, W., Bhargava, A., Gaidon, A.: Autolabeling 3D objects with differentiable rendering of SDF shape priors. In: CVPR (2020)
Zeng, W., et al.: End-to-end interpretable neural motion planner. In: CVPR (2019)
Zheng, W., Tang, W., Jiang, L., Fu, C.W.: SE-SSD: self-ensembling single-stage object detector from point cloud. In: CVPR (2021)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)
Zhou, Y., et al.: End-to-end multi-view fusion for 3D object detection in lidar point clouds. In: CoRL (2020)
Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: CVPR (2018)
Zhu, R., Kiani Galoogahi, H., Wang, C., Lucey, S.: Rethinking reprojection: closing the loop for pose-aware shape reconstruction from a single image. In: ICCV (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Najibi, M. et al. (2022). Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13698. Springer, Cham. https://doi.org/10.1007/978-3-031-19839-7_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-19839-7_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19838-0
Online ISBN: 978-3-031-19839-7
eBook Packages: Computer ScienceComputer Science (R0)