LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model | SpringerLink
Skip to main content

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15132))

Included in the following conference series:

  • 79 Accesses

Abstract

The revolutionary capabilities of large language models (LLMs) have paved the way for multimodal large language models (MLLMs) and fostered diverse applications across various specialized domains. In the remote sensing (RS) field, however, the diverse geographical landscapes and varied objects in RS imagery are not adequately considered in recent MLLM endeavors. To bridge this gap, we construct a large-scale RS image-text dataset, LHRS (LHRS stands for ‘Language Helps Remote Sensing’.)-Align, and an informative RS-specific instruction dataset, LHRS-Instruct, leveraging the extensive volunteered geographic information (VGI) and globally available RS images. Building on this foundation, we introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel multi-level vision-language alignment strategy and a curriculum learning method. Additionally, we introduce LHRS-Bench, a benchmark for thoroughly evaluating MLLMs’ abilities in RS image understanding. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain (Data, Code and model are available at https://github.com/NJU-LHRS/LHRS-Bot).

D. Muhtar and Z. Li—Equal contribution, listed in random order.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.openstreetmap.org/.

  2. 2.

    https://www.google.com/earth/.

  3. 3.

    https://wiki.openstreetmap.org/wiki/Map_features.

  4. 4.

    However, we have no means of confirming whether these data appeared in the LHRS-Align dataset, considering that images in LHRS-Align, along with several of the classification datasets (AID, WHU-RS19, and SIRI-WHU), are collected from Google Earth.

References

  1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems 35, pp. 23716–23736 (2022)

    Google Scholar 

  2. Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  3. Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  4. Bai, S., et al.: TouchStone: evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890 (2023)

  5. Bashmal, L., Bazi, Y., Melgani, F., Al Rahhal, M.M., Al Zuair, M.A.: Language integration in remote sensing: tasks, datasets, and future directions. IEEE Geosci. Remote Sens. Mag. 11(4), 63–93 (2023)

    Article  Google Scholar 

  6. Bitton, Y., et al.: Visit-bench: a benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595 (2023)

  7. Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: your ViT but faster. arXiv preprint arXiv:2210.09461 (2022)

  8. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33, pp. 1877–1901 (2020)

    Google Scholar 

  9. Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 105(10), 1865–1883 (2017)

    Article  Google Scholar 

  10. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org. Accessed 14 Apr 2023

  11. Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)

    Google Scholar 

  12. Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018)

    Google Scholar 

  13. Dai, D., Yang, W.: Satellite image classification via two-layer sparse coding with biased image representation. IEEE Geosci. Remote Sens. Lett. 8(1), 173–176 (2011)

    Article  Google Scholar 

  14. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)

  15. Driess, D., et al.: PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  16. Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  17. Ghiasi, A., et al.: What do vision transformers learn? A visual exploration. arXiv preprint arXiv:2212.06727 (2022)

  18. Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 12(7), 2217–2226 (2019)

    Article  Google Scholar 

  19. Hossain, M.D., Chen, D.: Segmentation for object-based image analysis (OBIA): a review of algorithms and challenges from remote sensing perspective. ISPRS J. Photogramm. Remote. Sens. 150, 115–134 (2019)

    Article  Google Scholar 

  20. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  21. Hu, Y., Yuan, J., Wen, C., Lu, X., Li, X.: RSGPT: a remote sensing vision language model and benchmark. arXiv preprint arXiv:2307.15266 (2023)

  22. Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)

  23. Jiang, A.Q., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

  24. Ju, C., et al.: Turbo: informativity-driven acceleration plug-in for vision-language models. arXiv preprint arXiv:2312.07408 (2023)

  25. Kuckreja, K., Danish, M.S., Naseer, M., Das, A., Khan, S., Khan, F.S.: GeoChat: grounded large vision-language model for remote sensing. arXiv preprint arXiv:2311.15826 (2023)

  26. Li, B., et al.: SEED-Bench-2: benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092 (2023)

  27. Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-Bench: benchmarking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)

  28. Li, C., et al.: Multimodal foundation models: from specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)

  29. Li, C., et al.: LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890 (2023)

  30. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  31. Li, M., Wang, S., Zhang, Q.: Visualizing the emergence of intermediate visual patterns in DNNs. In: Advances in Neural Information Processing Systems 34, pp. 6594–6607 (2021)

    Google Scholar 

  32. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)

  33. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

  34. Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

  35. Lobry, S., Marcos, D., Murray, J., Tuia, D.: RSVQA: visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 58(12), 8555–8566 (2020)

    Article  Google Scholar 

  36. Ma, Y., Cao, Y., Sun, J., Pavone, M., Xiao, C.: Dolphins: multimodal language model for driving. arXiv preprint arXiv:2312.00438 (2023)

  37. Muhtar, D., Zhang, X., Xiao, P., Li, Z., Gu, F.: CMID: a unified self-supervised learning framework for remote sensing image understanding. IEEE Trans. Geosci. Remote Sens. 61, 1–17 (2023)

    Article  Google Scholar 

  38. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  39. Park, N., Kim, W., Heo, B., Kim, T., Yun, S.: What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729 (2023)

  40. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  41. Ratledge, N., Cadamuro, G., de la Cuesta, B., Stigler, M., Burke, M.: Using machine learning to assess the livelihood impact of electricity access. Nature 611(7936), 491–495 (2022)

    Article  Google Scholar 

  42. Reed, C.J., et al.: Scale-MAE: a scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023)

    Google Scholar 

  43. Saygin Seyfioglu, M., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt-LLaVA: visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv e-prints, arXiv–2312 (2023)

    Google Scholar 

  44. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems 35, pp. 25278–25294 (2022)

    Google Scholar 

  45. Sun, Y., Feng, S., Li, X., Ye, Y., Kang, J., Huang, X.: Visual grounding in remote sensing images. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 404–412 (2022)

    Google Scholar 

  46. Team, M.N.: Introducing MPT-7B: a new standard for open-source, commercially usable LLMs (2023). www.mosaicml.com/blog/mpt-7b. Accessed 05 May 2023

  47. Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  48. Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  49. Wang, D., et al.: SAMRS: scaling-up remote sensing segmentation dataset with segment anything model. In: Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)

    Google Scholar 

  50. Wang, Z., Prabha, R., Huang, T., Wu, J., Rajagopal, R.: SkyScript: a large and semantically diverse vision-language dataset for remote sensing. arXiv preprint arXiv:2312.12856 (2023)

  51. Wen, C., Hu, Y., Li, X., Yuan, Z., Zhu, X.X.: Vision-language models in remote sensing: current progress and future trends. arXiv preprint arXiv:2305.05726 (2023)

  52. Xia, G.S., et al.: DOTA: a large-scale dataset for object detection in aerial images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974–3983 (2018)

    Google Scholar 

  53. Xia, G.S., et al.: AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 55(7), 3965–3981 (2017)

    Article  Google Scholar 

  54. Xu, H., et al.: Demystifying CLIP data. arXiv preprint arXiv:2309.16671 (2023)

  55. Yang, J., et al.: The role of satellite remote sensing in climate change studies. Nat. Clim. Change 3(10), 875–883 (2013)

    Article  Google Scholar 

  56. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  57. Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)

  58. Yuan, Z., et al.: Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv preprint arXiv:2204.09868 (2022)

  59. Zhan, Y., Xiong, Z., Yuan, Y.: RSVG: exploring data and models for visual grounding on remote sensing data. IEEE Trans. Geosci. Remote Sens. 61, 1–13 (2023)

    Google Scholar 

  60. Zhan, Y., Xiong, Z., Yuan, Y.: SkyEyeGPT: unifying remote sensing vision-language tasks via instruction tuning with large language model. arXiv preprint arXiv:2401.09712 (2024)

  61. Zhang, P., et al.: InternLM-XComposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)

  62. Zhang, Z., Zhao, T., Guo, Y., Yin, J.: RS5M: a large scale vision-language dataset for remote sensing vision-language foundation model. arXiv preprint arXiv:2306.11300 (2023)

  63. Zhu, B., et al.: METER-ML: a multi-sensor earth observation benchmark for automated methane source mapping. arXiv preprint arXiv:2207.11166 (2022)

  64. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  65. Zhu, Q., Zhong, Y., Zhao, B., Xia, G.S., Zhang, L.: Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 13(6), 747–751 (2016)

    Article  Google Scholar 

Download references

Acknowledgement

This research was supported by the National Natural Science Foundation of China under Grant 42071297, and in part by the AI & AI for Science Project of Nanjing University under Grant 02091480605203. We are grateful to High Performance Computing Center of Nanjing University for their help on GPU resources. We also would like to thank the anonymous reviewers for their constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xueliang Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12217 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Muhtar, D., Li, Z., Gu, F., Zhang, X., Xiao, P. (2025). LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72904-1_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72903-4

  • Online ISBN: 978-3-031-72904-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics