DAttNet: monocular depth estimation network based on attention mechanisms | Neural Computing and Applications Skip to main content

Advertisement

Log in

DAttNet: monocular depth estimation network based on attention mechanisms

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

As autonomous vehicles get closer to our daily lives, the need for architectures that function as redundant pipelines is becoming increasingly critical. To address this issue without compromising the budget, researchers aim to avoid duplicating high-cost sensors such as LiDARs. In this work, we propose using monocular cameras, which are already essential for some modules of the autonomous platform, for 3D scene understanding. While many methods for depth estimation using single images have been proposed in the literature, they usually rely on complex neural network ensembles that extract dense feature maps, resulting in a high computational cost. Instead, we propose a novel and inherently efficient method for obtaining depth images that replace tangled neural architectures with attention mechanisms applied to basic encoder–decoder models. We evaluate our method on the KITTI public dataset and in real-world experiments on our automated vehicle. The obtained results prove the viability of our approach, which can compete with intricate state-of-the-art methods while outperforming most alternatives based on attention mechanisms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The authors declare that the dataset used for training and validating the results presented in this study is openly accessible and available at: https://www.cvlibs.net/datasets/kitti/eval_depth.php?benchmark=depth_prediction [26].

Notes

  1. Additional results: https://www.youtube.com/watch?v=pQDc_AimYiU.

References

  1. Beltrán J, Guindel C, Cortés I, Barrera A, Astudillo A, Urdiales J, Álvarez M, Bekka F, Milanés V, García F (2020) Towards autonomous driving: a multi-modal 360° perception proposal. In: 2020 IEEE 23rd international conference on intelligent transportation systems (ITSC), pp 3295–3300 (2020). https://doi.org/10.1109/ITSC45102.2020.9294494

  2. Liang M, Yang B, Zeng W, Chen Y, Hu R, Casas S, Urtasun R (2020) PnPNet: End-to-end perception and prediction with tracking in the loop. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 11550–11559. https://doi.org/10.1109/CVPR42600.2020.01157

  3. Astudillo A, Molina N, Cortés I, Mahtout I, González D, Beltrán J, Guindel C, Barrera A, Álvarez M, Zinoune C, Milanés V, García F (2021) Visibility-aware adaptative speed planner for human-like navigation in roundabouts. In: 2021 IEEE International intelligent transportation systems conference (ITSC), pp. 885–890. https://doi.org/10.1109/ITSC48978.2021.9564451

  4. Pei L, Rui Z (2015) The analysis of stereo vision 3D point cloud data of autonomous vehicle obstacle recognition. In: 2015 7th International conference on intelligent human-machine systems and cybernetics, vol. 2, pp. 207–210. https://doi.org/10.1109/IHMSC.2015.192

  5. Doval GN, Al-Kaff A, Beltrán J, Fernández FG, Fernández López G (2019) Traffic sign detection and 3D localization via deep convolutional neural networks and stereo vision. In: 2019 IEEE intelligent transportation systems conference (ITSC), pp. 1411–1416. https://doi.org/10.1109/ITSC.2019.8916958

  6. Cheng B, Collins MD, Zhu Y, Liu T, Huang TS, Adam H, Chen LC (2020) Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 12472–12482. https://doi.org/10.1109/CVPR42600.2020.01249

  7. Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018. Springer, Cham, pp 833–851

    Chapter  Google Scholar 

  8. Miguel MA, Moreno FM, Marín-Plaza P, Al-Kaff A, Palos M, Martín Gómez D, Encinar-Martín R, Garcia F (2020) A research platform for autonomous vehicles technologies research in the insurance sector. Appl Sci 10:5655. https://doi.org/10.3390/app10165655

    Article  CAS  Google Scholar 

  9. Scharstein D, Szeliski R, Zabih R (2001) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In: Proceedings IEEE workshop on stereo and multi-baseline vision (SMBV 2001), pp. 131–140. https://doi.org/10.1109/SMBV.2001.988771

  10. Khamis S, Fanello S, Rhemann C, Kowdle A, Valentin J, Izadi S (2018) StereoNet: guided hierarchical refinement for real-time edge-aware depth prediction. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018. Springer, Cham, pp 596–613

    Chapter  Google Scholar 

  11. Xu H, Zhang J (2020) AANet: Adaptive aggregation network for efficient stereo matching. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1956–1965. https://doi.org/10.1109/CVPR42600.2020.00203

  12. Godard C, Aodha OM, Firman M, Brostow G (2019) Digging into self-supervised monocular depth estimation. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp. 3827–3837. https://doi.org/10.1109/ICCV.2019.00393

  13. Lee JH, Han MK, Ko DW, Suh IH (2021) From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv. arXiv:1907.10326 [cs]. https://doi.org/10.48550/arXiv.1907.10326

  14. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Patt Anal Mach Intell 40(4):834–848. https://doi.org/10.1109/TPAMI.2017.2699184

    Article  Google Scholar 

  15. Galassi A, Lippi M, Torroni P (2021) Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 32(10):4291–4308. https://doi.org/10.1109/TNNLS.2020.3019893

    Article  PubMed  Google Scholar 

  16. Lu Y, Hao X, Li Y, Chai W, Sun S, Velipasalar S (2022) Range-aware attention network for lidar-based 3d object detection with auxiliary point density level estimation. arXiv. arXiv:2111.09515 [cs]. https://doi.org/10.48550/arXiv.2111.09515

  17. Chen Y, Zhao H, Hu Z, Peng J (2021) Attention-based context aggregation network for monocular depth estimation. Int J Mach Learn Cybern 12:1583–1596. https://doi.org/10.1007/s13042-020-01251-y

    Article  Google Scholar 

  18. Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp. 2002–2011. https://doi.org/10.1109/CVPR.2018.00214

  19. Song X, Li W, Zhou D, Dai Y, Fang J, Li H, Zhang L (2021) MLDA-Net: multi-level dual attention-based network for self-supervised monocular depth estimation. IEEE Trans. Image Process. 30:4691–4705. https://doi.org/10.1109/TIP.2021.3074306

    Article  PubMed  ADS  Google Scholar 

  20. Nah S, Kim TH, Lee KM (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 257–265. https://doi.org/10.1109/CVPR.2017.35

  21. Wang Y, Ying X., Wang L, Yang J, An W, Guo Y (2021) Symmetric parallax attention for stereo image super-resolution. In: 2021 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp. 766–775. https://doi.org/10.1109/CVPRW53098.2021.00086

  22. Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: International conference on machine learning, PMLR, pp. 7354–7363

  23. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27

  24. Chiang TH, Chiang MH, Tsai MH, Chang CC (2022) Attention-based background/foreground monocular depth prediction model using image segmentation, vol. 12. https://doi.org/10.3390/app122111186. https://www.mdpi.com/2076-3417/12/21/11186

  25. Yan J, Zhao H, Bu P, Jin Y (2021) Channel-wise attention-based network for self-supervised monocular depth estimation. In: 2021 International conference on 3d vision (3DV), pp. 464–473. https://doi.org/10.1109/3DV53792.2021.00056

  26. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In: 2012 IEEE Conference on computer vision and pattern recognition, pp. 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074

  27. Xiao P, Shao Z, Hao S, Zhang Z, Chai X, Jiao J, Li Z, Wu J, Sun K, Jiang K, Wang Y, Yang D (2021) Pandaset: Advanced sensor suite dataset for autonomous driving. In: 2021 IEEE International intelligent transportation systems conference (ITSC), pp. 3095–3101. https://doi.org/10.1109/ITSC48978.2021.9565009

  28. Hormann K (2014) Barycentric interpolation. In: Fasshauer GE, Schumaker LL (eds) Approximation Theory XIV: San Antonio 2013. Springer, Cham, pp 197–218

    Chapter  Google Scholar 

Download references

Acknowledgements

This work has been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M (“Fostering Young Doctors Research”, APBI-CM-UC3M), and in the context of the V PRICIT (Research and Technological Innovation Regional Programme). Carlos Guindel acknowledges the support of the Ministry of Universities and the Universidad Carlos III de Madrid’s Call for Grants for the requalification of the Spanish University System for 2021-2023, based on Royal Decree 289/2021 of April 20, 2021, which regulates the direct granting of subsidies to public universities for the requalification of the Spanish university system. This work has been supported by the Spanish Government through the projects ID2021-128327OA-I00, PID2021-124335OB-C21 and TED2021-129374A-I00 funded by MCIN/AEI/10.13039/501100011033, by the European Union NextGenerationEU/PRTR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Armando Astudillo.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Astudillo, A., Barrera, A., Guindel, C. et al. DAttNet: monocular depth estimation network based on attention mechanisms. Neural Comput & Applic 36, 3347–3356 (2024). https://doi.org/10.1007/s00521-023-09210-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09210-8

Keywords

Navigation