Adapting Depth Distribution for 3D Object Detection with a Two-Stage Training Paradigm

Luo, Yixin; Huang, Zhangjin; Bao, Zhongkui

doi:10.1007/978-981-97-5612-4_6

Yixin Luo^10,11,
Zhangjin Huang^10,11 &
Zhongkui Bao¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14872))

Included in the following conference series:

International Conference on Intelligent Computing

642 Accesses

Abstract

Lift-Splat-Shoot based 3D object detection systems aim to predict the targets’ bounding boxes from images, by leveraging an explicit depth distribution that facilitates coherence between the depth and detection modules. Contrary to conventional end-to-end models that prioritize minimizing the disparity between estimated and ground-truth depth maps, our study underscores the intrinsic value of the depth distribution itself. To exploit this perspective, we introduce a novel two-stage training paradigm designed to optimize the depth and detection module separately, adopting a targeted approach to refine the depth distribution for 3D object detection. Specifically, the first stage involves training the depth module for precise depth estimation, which is supplemented by an auxiliary detection module that provides additional supervisory feedback for detection accuracy. This auxiliary component is designed to be discarded once it has served its purpose in improving the depth distribution. For the second stage, with the depth module’s parameters now fixed, we train a fresh detection module from scratch under direct detection supervision. Additionally, a trainable and lightweight depth adapter is incorporated post the depth module to further adapt and polish the depth distribution, aligning it more closely with the detection objectives. Our experiments on the nuScenes dataset reveal that our approach significantly surpasses baseline models, achieving a notable 1.13% improvement on the NDS metric.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

Stereo 3D object detection via instance depth prior guidance and adaptive spatial feature aggregation

Article 22 July 2022

Region-Aware Distribution Contrast: A Novel Approach to Multi-task Partially Supervised Learning

References

Bae, G., Budvytis, I., Cipolla, R.: Multi-view depth estimation by fusing single-view depth probability with multi-view geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2842–2851 (2022)
Google Scholar
Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011 (2018)
Google Scholar
Huang, J., Huang, G.: BEVDet4D: Exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV), pp. 239–248. IEEE (2016)
Google Scholar
Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: BEVStereo: Enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 1486–1494 (2023)
Google Scholar
Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 1477–1485 (2023)
Google Scholar
Li, Z., et al.: BEVFormer: learning birds-eye-view representation from multi-camera images via spatiotemporal transformers. In: European conference on computer vision, pp. 1–18. Springer (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017)
Google Scholar
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3d object detection. In: European Conference on Computer Vision. pp. 531–548. Springer (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Park, J., et al.: Time will tell: New outlooks and a baseline for temporal multi-view 3D object detection. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 194–210. Springer (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1623–1637 (2020)
Article Google Scholar
Tu, Z., Chen, X., Ren, P., Wang, Y.: AdaBin: improving binary neural networks with adaptive binary sets. In: European conference on computer vision. pp. 379–395. Springer (2022). https://doi.org/10.1007/978-3-031-20083-0_23
Wang, T., Zhu, X., Pang, J., Lin, D.: FCOS3D: fully convolutional one-stage monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021)
Google Scholar
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022)
Google Scholar
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: MVSNet: depth inference for unstructured multi-view stereo. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783 (2018)
Google Scholar
Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent MVSNet for high-resolution multi-view stereo depth inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5525–5534 (2019)
Google Scholar
Yu, Z., Gao, S.: Fast-MVSNet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1949–1958 (2020)
Google Scholar
Zhou, B., Krähenbühl, P.: Cross-view transformers for real-time map-view semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13760–13769 (2022)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the Anhui Provincial Major Science and Technology Project (No. 202203a05020016), the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (No. 2023C01143), and the National Key R&D Program of China (No. 2022YFB3303400).

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, 230027, China
Yixin Luo & Zhangjin Huang
Deqing Alpha Innovation Institute, Huzhou, 313299, China
Yixin Luo & Zhangjin Huang
Anhui University, Hefei, 230601, China
Zhongkui Bao

Authors

Yixin Luo
View author publications
You can also search for this author in PubMed Google Scholar
Zhangjin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongkui Bao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhangjin Huang .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Eastern Institute of Technology, Ningbo, China
Yijie Pan
Eastern Institute of Technology, Ningbo, China
Qinhu Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luo, Y., Huang, Z., Bao, Z. (2024). Adapting Depth Distribution for 3D Object Detection with a Two-Stage Training Paradigm. In: Huang, DS., Pan, Y., Zhang, Q. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14872. Springer, Singapore. https://doi.org/10.1007/978-981-97-5612-4_6

Download citation

DOI: https://doi.org/10.1007/978-981-97-5612-4_6
Published: 31 July 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5611-7
Online ISBN: 978-981-97-5612-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics