PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation

Lu, Renjie; Meng, Jingke; Zheng, Wei-Shi

doi:10.1007/978-3-031-72848-8_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15124))

Included in the following conference series:

European Conference on Computer Vision

200 Accesses

Abstract

Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction. Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning. However, they suffer from high computational cost when attempting to support such high-level predictions with GCN-like models. In this work, we propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories, which refers to a path from the initial node to the candidate locations on a directed graph without detours. This planning strategy leads to an efficient model while achieving strong performance. Specifically, we introduce a directed graph to illustrate the explored area of the environment, emphasizing directionality. Then, we firstly define the trajectory representation as a sequence of directed edge features, which are extracted from the panorama based on the corresponding orientation. Ultimately, we assess and compare the alignment between instruction and different trajectories during navigation to determine the next navigation target. Our method outperforms previous SOTA method BEVBert on RxR dataset and is comparable on R2R dataset while largely reducing the computational cost. Code is available: https://github.com/iSEE-Laboratory/VLN-PRET.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8007; Price includes VAT (Japan)

Softcover Book: JPY 10009; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Sub-Instruction and Local Map Relationship Enhanced Model for Vision and Language Navigation

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Article Open access 31 August 2020

References

An, D., Qi, Y., Huang, Y., Wu, Q., Wang, L., Tan, T.: Neighbor-view enhanced model for vision and language navigation. In: ACM MM (2021)
Google Scholar
An, D., et al.: BEVBert: multimodal map pre-training for language-guided navigation. In: ICCV (2023)
Google Scholar
Anderson, P., et al.: On evaluation of embodied navigation agents. CoRR (2018)
Google Scholar
Anderson, P., Wu, Q., et al.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)
Google Scholar
Chaplot, D.S., Gandhi, D., Gupta, A., Salakhutdinov, R.: Object goal navigation using goal-oriented semantic exploration. In: NeurIPS (2020)
Google Scholar
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural SLAM. In: ICLR (2020)
Google Scholar
Chen, S., Guhur, P.L., Schmid, C., Laptev, I.: History aware multimodal transformer for vision-and-language navigation. In: NeurIPS (2021)
Google Scholar
Chen, S., Guhur, P., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: dual-scale graph transformer for vision-and-language navigation. In: CVPR (2022)
Google Scholar
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Yu., Liu, J.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: ACL (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Dou, Z., Gao, F., Peng, N.: Masked path modeling for vision-and-language navigation. CoRR (2023)
Google Scholar
Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: NeurIPS (2018)
Google Scholar
Fuentes-Pacheco, J., Ascencio, J.R., Rendón-Mancha, J.M.: Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev. (2015)
Google Scholar
Gao, C., et al.: Adaptive zone-aware hierarchical planner for vision-language navigation. In: CVPR (2023)
Google Scholar
Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: CVPR (2020)
Google Scholar
Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: VLN BERT: a recurrent vision-and-language BERT for navigation. In: CVPR, pp. 1643–1653 (2021)
Google Scholar
Huo, J., Sun, Q., Jiang, B., Lin, H., Fu, Y.: GeoVLN: learning geometry-enhanced visual representation with slot attention for vision-and-language navigation. In: CVPR (2023)
Google Scholar
Hwang, M., Jeong, J., Kim, M., Oh, Y., Oh, S.: Meta-explore: exploratory hierarchical vision-and-language navigation using scene object spectrum grounding. In: CVPR (2023)
Google Scholar
Ilharco, G., Jain, V., Ku, A., Ie, E., Baldridge, J.: General evaluation for instruction conditioned navigation using dynamic time warping. In: NeurIPS Workshop (2019)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML (2021)
Google Scholar
Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: EMNLP (2020)
Google Scholar
Li, J., Tan, H., Bansal, M.: EnvEdit: environment editing for vision-and-language navigation. In: CVPR (2022)
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Google Scholar
Li, X., et al.: Robust navigation with language pretraining and stochastic sampling. In: EMNLP-IJCNLP (2019)
Google Scholar
Lin, C., Jiang, Y., Cai, J., Qu, L., Haffari, G., Yuan, Z.: Multimodal transformer with variable-length memory for vision-and-language navigation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 380–397. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_22
Chapter Google Scholar
Liu, C., Zhu, F., Chang, X., Liang, X., Ge, Z., Shen, Y.: Vision-language navigation with random environmental mixup. In: ICCV (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Ma, C.Y., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: ICLR (2019)
Google Scholar
Ma, C., Wu, Z., AlRegib, G., Xiong, C., Kira, Z.: The regretful agent: heuristic-aided navigation through progress estimation. In: CVPR (2019)
Google Scholar
Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16
Chapter Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. CoRR (2023)
Google Scholar
Pope, R., et al.: Efficiently scaling transformer inference. CoRR (2022)
Google Scholar
Qi, Y., et al.: REVERIE: remote embodied visual referring expression in real indoor environments. In: CVPR (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: ACL (2016)
Google Scholar
Shen, S., et al.: How much can CLIP benefit vision-and-language tasks? In: ICLR (2022)
Google Scholar
Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL (2019)
Google Scholar
Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: CoRL (2020)
Google Scholar
Thrun, S.: Learning metric-topological maps for indoor mobile robot navigation. Artif. Intell. (1998)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, H., Wang, W., Liang, W., Xiong, C., Shen, J.: Structured scene memory for vision-language navigation. In: CVPR (2021)
Google Scholar
Wang, S., et al.: Less is more: generating grounded navigation instructions from landmarks. In: CVPR (2022)
Google Scholar
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: CVPR (2019)
Google Scholar
Wang, Z., Li, X., Yang, J., Liu, Y., Jiang, S.: GridMM: grid memory map for vision-and-language navigation. In: ICCV (2023)
Google Scholar
Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: CVPR (2020)
Google Scholar

Download references

Acknowledgments

This work was supported partially by the National Key Research and Development Program of China (2023YFA1008503), NSFC (U21A20471, 62206315), Guangdong NSF Project (No. 2023B1515040025, 2020B1515120085, 2024A1515-010101), Guangzhou Basic and Applied Basic Research Scheme (2024A04J4067).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Renjie Lu, Jingke Meng & Wei-Shi Zheng
Peng Cheng Laboratory, Shenzhen, China
Wei-Shi Zheng
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Guangzhou, China
Wei-Shi Zheng

Authors

Renjie Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jingke Meng
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Shi Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingke Meng .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, R., Meng, J., Zheng, WS. (2025). PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-72848-8_5
Published: 29 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72847-1
Online ISBN: 978-3-031-72848-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation