PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation | SpringerLink
Skip to main content

PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction. Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning. However, they suffer from high computational cost when attempting to support such high-level predictions with GCN-like models. In this work, we propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories, which refers to a path from the initial node to the candidate locations on a directed graph without detours. This planning strategy leads to an efficient model while achieving strong performance. Specifically, we introduce a directed graph to illustrate the explored area of the environment, emphasizing directionality. Then, we firstly define the trajectory representation as a sequence of directed edge features, which are extracted from the panorama based on the corresponding orientation. Ultimately, we assess and compare the alignment between instruction and different trajectories during navigation to determine the next navigation target. Our method outperforms previous SOTA method BEVBert on RxR dataset and is comparable on R2R dataset while largely reducing the computational cost. Code is available: https://github.com/iSEE-Laboratory/VLN-PRET.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8007
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10009
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. An, D., Qi, Y., Huang, Y., Wu, Q., Wang, L., Tan, T.: Neighbor-view enhanced model for vision and language navigation. In: ACM MM (2021)

    Google Scholar 

  2. An, D., et al.: BEVBert: multimodal map pre-training for language-guided navigation. In: ICCV (2023)

    Google Scholar 

  3. Anderson, P., et al.: On evaluation of embodied navigation agents. CoRR (2018)

    Google Scholar 

  4. Anderson, P., Wu, Q., et al.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)

    Google Scholar 

  5. Chaplot, D.S., Gandhi, D., Gupta, A., Salakhutdinov, R.: Object goal navigation using goal-oriented semantic exploration. In: NeurIPS (2020)

    Google Scholar 

  6. Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural SLAM. In: ICLR (2020)

    Google Scholar 

  7. Chen, S., Guhur, P.L., Schmid, C., Laptev, I.: History aware multimodal transformer for vision-and-language navigation. In: NeurIPS (2021)

    Google Scholar 

  8. Chen, S., Guhur, P., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: dual-scale graph transformer for vision-and-language navigation. In: CVPR (2022)

    Google Scholar 

  9. Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Yu., Liu, J.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7

    Chapter  Google Scholar 

  10. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: ACL (2020)

    Google Scholar 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  12. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  13. Dou, Z., Gao, F., Peng, N.: Masked path modeling for vision-and-language navigation. CoRR (2023)

    Google Scholar 

  14. Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: NeurIPS (2018)

    Google Scholar 

  15. Fuentes-Pacheco, J., Ascencio, J.R., Rendón-Mancha, J.M.: Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev. (2015)

    Google Scholar 

  16. Gao, C., et al.: Adaptive zone-aware hierarchical planner for vision-language navigation. In: CVPR (2023)

    Google Scholar 

  17. Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: CVPR (2020)

    Google Scholar 

  18. Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: VLN BERT: a recurrent vision-and-language BERT for navigation. In: CVPR, pp. 1643–1653 (2021)

    Google Scholar 

  19. Huo, J., Sun, Q., Jiang, B., Lin, H., Fu, Y.: GeoVLN: learning geometry-enhanced visual representation with slot attention for vision-and-language navigation. In: CVPR (2023)

    Google Scholar 

  20. Hwang, M., Jeong, J., Kim, M., Oh, Y., Oh, S.: Meta-explore: exploratory hierarchical vision-and-language navigation using scene object spectrum grounding. In: CVPR (2023)

    Google Scholar 

  21. Ilharco, G., Jain, V., Ku, A., Ie, E., Baldridge, J.: General evaluation for instruction conditioned navigation using dynamic time warping. In: NeurIPS Workshop (2019)

    Google Scholar 

  22. Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML (2021)

    Google Scholar 

  23. Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: EMNLP (2020)

    Google Scholar 

  24. Li, J., Tan, H., Bansal, M.: EnvEdit: environment editing for vision-and-language navigation. In: CVPR (2022)

    Google Scholar 

  25. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)

    Google Scholar 

  26. Li, X., et al.: Robust navigation with language pretraining and stochastic sampling. In: EMNLP-IJCNLP (2019)

    Google Scholar 

  27. Lin, C., Jiang, Y., Cai, J., Qu, L., Haffari, G., Yuan, Z.: Multimodal transformer with variable-length memory for vision-and-language navigation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 380–397. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_22

    Chapter  Google Scholar 

  28. Liu, C., Zhu, F., Chang, X., Liang, X., Ge, Z., Shen, Y.: Vision-language navigation with random environmental mixup. In: ICCV (2021)

    Google Scholar 

  29. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  30. Ma, C.Y., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: ICLR (2019)

    Google Scholar 

  31. Ma, C., Wu, Z., AlRegib, G., Xiong, C., Kira, Z.: The regretful agent: heuristic-aided navigation through progress estimation. In: CVPR (2019)

    Google Scholar 

  32. Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16

    Chapter  Google Scholar 

  33. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. CoRR (2023)

    Google Scholar 

  34. Pope, R., et al.: Efficiently scaling transformer inference. CoRR (2022)

    Google Scholar 

  35. Qi, Y., et al.: REVERIE: remote embodied visual referring expression in real indoor environments. In: CVPR (2020)

    Google Scholar 

  36. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  37. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: ACL (2016)

    Google Scholar 

  38. Shen, S., et al.: How much can CLIP benefit vision-and-language tasks? In: ICLR (2022)

    Google Scholar 

  39. Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL (2019)

    Google Scholar 

  40. Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: CoRL (2020)

    Google Scholar 

  41. Thrun, S.: Learning metric-topological maps for indoor mobile robot navigation. Artif. Intell. (1998)

    Google Scholar 

  42. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  43. Wang, H., Wang, W., Liang, W., Xiong, C., Shen, J.: Structured scene memory for vision-language navigation. In: CVPR (2021)

    Google Scholar 

  44. Wang, S., et al.: Less is more: generating grounded navigation instructions from landmarks. In: CVPR (2022)

    Google Scholar 

  45. Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: CVPR (2019)

    Google Scholar 

  46. Wang, Z., Li, X., Yang, J., Liu, Y., Jiang, S.: GridMM: grid memory map for vision-and-language navigation. In: ICCV (2023)

    Google Scholar 

  47. Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: CVPR (2020)

    Google Scholar 

Download references

Acknowledgments

This work was supported partially by the National Key Research and Development Program of China (2023YFA1008503), NSFC (U21A20471, 62206315), Guangdong NSF Project (No. 2023B1515040025, 2020B1515120085, 2024A1515-010101), Guangzhou Basic and Applied Basic Research Scheme (2024A04J4067).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingke Meng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lu, R., Meng, J., Zheng, WS. (2025). PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72848-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72847-1

  • Online ISBN: 978-3-031-72848-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics