Abstract
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction. Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning. However, they suffer from high computational cost when attempting to support such high-level predictions with GCN-like models. In this work, we propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories, which refers to a path from the initial node to the candidate locations on a directed graph without detours. This planning strategy leads to an efficient model while achieving strong performance. Specifically, we introduce a directed graph to illustrate the explored area of the environment, emphasizing directionality. Then, we firstly define the trajectory representation as a sequence of directed edge features, which are extracted from the panorama based on the corresponding orientation. Ultimately, we assess and compare the alignment between instruction and different trajectories during navigation to determine the next navigation target. Our method outperforms previous SOTA method BEVBert on RxR dataset and is comparable on R2R dataset while largely reducing the computational cost. Code is available: https://github.com/iSEE-Laboratory/VLN-PRET.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
An, D., Qi, Y., Huang, Y., Wu, Q., Wang, L., Tan, T.: Neighbor-view enhanced model for vision and language navigation. In: ACM MM (2021)
An, D., et al.: BEVBert: multimodal map pre-training for language-guided navigation. In: ICCV (2023)
Anderson, P., et al.: On evaluation of embodied navigation agents. CoRR (2018)
Anderson, P., Wu, Q., et al.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)
Chaplot, D.S., Gandhi, D., Gupta, A., Salakhutdinov, R.: Object goal navigation using goal-oriented semantic exploration. In: NeurIPS (2020)
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural SLAM. In: ICLR (2020)
Chen, S., Guhur, P.L., Schmid, C., Laptev, I.: History aware multimodal transformer for vision-and-language navigation. In: NeurIPS (2021)
Chen, S., Guhur, P., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: dual-scale graph transformer for vision-and-language navigation. In: CVPR (2022)
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Yu., Liu, J.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: ACL (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)
Dou, Z., Gao, F., Peng, N.: Masked path modeling for vision-and-language navigation. CoRR (2023)
Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: NeurIPS (2018)
Fuentes-Pacheco, J., Ascencio, J.R., Rendón-Mancha, J.M.: Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev. (2015)
Gao, C., et al.: Adaptive zone-aware hierarchical planner for vision-language navigation. In: CVPR (2023)
Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: CVPR (2020)
Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: VLN BERT: a recurrent vision-and-language BERT for navigation. In: CVPR, pp. 1643–1653 (2021)
Huo, J., Sun, Q., Jiang, B., Lin, H., Fu, Y.: GeoVLN: learning geometry-enhanced visual representation with slot attention for vision-and-language navigation. In: CVPR (2023)
Hwang, M., Jeong, J., Kim, M., Oh, Y., Oh, S.: Meta-explore: exploratory hierarchical vision-and-language navigation using scene object spectrum grounding. In: CVPR (2023)
Ilharco, G., Jain, V., Ku, A., Ie, E., Baldridge, J.: General evaluation for instruction conditioned navigation using dynamic time warping. In: NeurIPS Workshop (2019)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML (2021)
Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: EMNLP (2020)
Li, J., Tan, H., Bansal, M.: EnvEdit: environment editing for vision-and-language navigation. In: CVPR (2022)
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Li, X., et al.: Robust navigation with language pretraining and stochastic sampling. In: EMNLP-IJCNLP (2019)
Lin, C., Jiang, Y., Cai, J., Qu, L., Haffari, G., Yuan, Z.: Multimodal transformer with variable-length memory for vision-and-language navigation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 380–397. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_22
Liu, C., Zhu, F., Chang, X., Liang, X., Ge, Z., Shen, Y.: Vision-language navigation with random environmental mixup. In: ICCV (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Ma, C.Y., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: ICLR (2019)
Ma, C., Wu, Z., AlRegib, G., Xiong, C., Kira, Z.: The regretful agent: heuristic-aided navigation through progress estimation. In: CVPR (2019)
Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. CoRR (2023)
Pope, R., et al.: Efficiently scaling transformer inference. CoRR (2022)
Qi, Y., et al.: REVERIE: remote embodied visual referring expression in real indoor environments. In: CVPR (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: ACL (2016)
Shen, S., et al.: How much can CLIP benefit vision-and-language tasks? In: ICLR (2022)
Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL (2019)
Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: CoRL (2020)
Thrun, S.: Learning metric-topological maps for indoor mobile robot navigation. Artif. Intell. (1998)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, H., Wang, W., Liang, W., Xiong, C., Shen, J.: Structured scene memory for vision-language navigation. In: CVPR (2021)
Wang, S., et al.: Less is more: generating grounded navigation instructions from landmarks. In: CVPR (2022)
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: CVPR (2019)
Wang, Z., Li, X., Yang, J., Liu, Y., Jiang, S.: GridMM: grid memory map for vision-and-language navigation. In: ICCV (2023)
Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: CVPR (2020)
Acknowledgments
This work was supported partially by the National Key Research and Development Program of China (2023YFA1008503), NSFC (U21A20471, 62206315), Guangdong NSF Project (No. 2023B1515040025, 2020B1515120085, 2024A1515-010101), Guangzhou Basic and Applied Basic Research Scheme (2024A04J4067).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lu, R., Meng, J., Zheng, WS. (2025). PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-72848-8_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72847-1
Online ISBN: 978-3-031-72848-8
eBook Packages: Computer ScienceComputer Science (R0)