Abstract
The revolutionary capabilities of large language models (LLMs) have paved the way for multimodal large language models (MLLMs) and fostered diverse applications across various specialized domains. In the remote sensing (RS) field, however, the diverse geographical landscapes and varied objects in RS imagery are not adequately considered in recent MLLM endeavors. To bridge this gap, we construct a large-scale RS image-text dataset, LHRS (LHRS stands for ‘Language Helps Remote Sensing’.)-Align, and an informative RS-specific instruction dataset, LHRS-Instruct, leveraging the extensive volunteered geographic information (VGI) and globally available RS images. Building on this foundation, we introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel multi-level vision-language alignment strategy and a curriculum learning method. Additionally, we introduce LHRS-Bench, a benchmark for thoroughly evaluating MLLMs’ abilities in RS image understanding. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain (Data, Code and model are available at https://github.com/NJU-LHRS/LHRS-Bot).
D. Muhtar and Z. Li—Equal contribution, listed in random order.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
However, we have no means of confirming whether these data appeared in the LHRS-Align dataset, considering that images in LHRS-Align, along with several of the classification datasets (AID, WHU-RS19, and SIRI-WHU), are collected from Google Earth.
References
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems 35, pp. 23716–23736 (2022)
Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bai, S., et al.: TouchStone: evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890 (2023)
Bashmal, L., Bazi, Y., Melgani, F., Al Rahhal, M.M., Al Zuair, M.A.: Language integration in remote sensing: tasks, datasets, and future directions. IEEE Geosci. Remote Sens. Mag. 11(4), 63–93 (2023)
Bitton, Y., et al.: Visit-bench: a benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595 (2023)
Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: your ViT but faster. arXiv preprint arXiv:2210.09461 (2022)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33, pp. 1877–1901 (2020)
Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 105(10), 1865–1883 (2017)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org. Accessed 14 Apr 2023
Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018)
Dai, D., Yang, W.: Satellite image classification via two-layer sparse coding with biased image representation. IEEE Geosci. Remote Sens. Lett. 8(1), 173–176 (2011)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Driess, D., et al.: PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Ghiasi, A., et al.: What do vision transformers learn? A visual exploration. arXiv preprint arXiv:2212.06727 (2022)
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 12(7), 2217–2226 (2019)
Hossain, M.D., Chen, D.: Segmentation for object-based image analysis (OBIA): a review of algorithms and challenges from remote sensing perspective. ISPRS J. Photogramm. Remote. Sens. 150, 115–134 (2019)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Hu, Y., Yuan, J., Wen, C., Lu, X., Li, X.: RSGPT: a remote sensing vision language model and benchmark. arXiv preprint arXiv:2307.15266 (2023)
Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)
Jiang, A.Q., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)
Ju, C., et al.: Turbo: informativity-driven acceleration plug-in for vision-language models. arXiv preprint arXiv:2312.07408 (2023)
Kuckreja, K., Danish, M.S., Naseer, M., Das, A., Khan, S., Khan, F.S.: GeoChat: grounded large vision-language model for remote sensing. arXiv preprint arXiv:2311.15826 (2023)
Li, B., et al.: SEED-Bench-2: benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092 (2023)
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-Bench: benchmarking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)
Li, C., et al.: Multimodal foundation models: from specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)
Li, C., et al.: LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, M., Wang, S., Zhang, Q.: Visualizing the emergence of intermediate visual patterns in DNNs. In: Advances in Neural Information Processing Systems 34, pp. 6594–6607 (2021)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Lobry, S., Marcos, D., Murray, J., Tuia, D.: RSVQA: visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 58(12), 8555–8566 (2020)
Ma, Y., Cao, Y., Sun, J., Pavone, M., Xiao, C.: Dolphins: multimodal language model for driving. arXiv preprint arXiv:2312.00438 (2023)
Muhtar, D., Zhang, X., Xiao, P., Li, Z., Gu, F.: CMID: a unified self-supervised learning framework for remote sensing image understanding. IEEE Trans. Geosci. Remote Sens. 61, 1–17 (2023)
OpenAI: GPT-4 technical report (2023)
Park, N., Kim, W., Heo, B., Kim, T., Yun, S.: What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ratledge, N., Cadamuro, G., de la Cuesta, B., Stigler, M., Burke, M.: Using machine learning to assess the livelihood impact of electricity access. Nature 611(7936), 491–495 (2022)
Reed, C.J., et al.: Scale-MAE: a scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023)
Saygin Seyfioglu, M., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt-LLaVA: visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv e-prints, arXiv–2312 (2023)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems 35, pp. 25278–25294 (2022)
Sun, Y., Feng, S., Li, X., Ye, Y., Kang, J., Huang, X.: Visual grounding in remote sensing images. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 404–412 (2022)
Team, M.N.: Introducing MPT-7B: a new standard for open-source, commercially usable LLMs (2023). www.mosaicml.com/blog/mpt-7b. Accessed 05 May 2023
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, D., et al.: SAMRS: scaling-up remote sensing segmentation dataset with segment anything model. In: Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)
Wang, Z., Prabha, R., Huang, T., Wu, J., Rajagopal, R.: SkyScript: a large and semantically diverse vision-language dataset for remote sensing. arXiv preprint arXiv:2312.12856 (2023)
Wen, C., Hu, Y., Li, X., Yuan, Z., Zhu, X.X.: Vision-language models in remote sensing: current progress and future trends. arXiv preprint arXiv:2305.05726 (2023)
Xia, G.S., et al.: DOTA: a large-scale dataset for object detection in aerial images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974–3983 (2018)
Xia, G.S., et al.: AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 55(7), 3965–3981 (2017)
Xu, H., et al.: Demystifying CLIP data. arXiv preprint arXiv:2309.16671 (2023)
Yang, J., et al.: The role of satellite remote sensing in climate change studies. Nat. Clim. Change 3(10), 875–883 (2013)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Yuan, Z., et al.: Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv preprint arXiv:2204.09868 (2022)
Zhan, Y., Xiong, Z., Yuan, Y.: RSVG: exploring data and models for visual grounding on remote sensing data. IEEE Trans. Geosci. Remote Sens. 61, 1–13 (2023)
Zhan, Y., Xiong, Z., Yuan, Y.: SkyEyeGPT: unifying remote sensing vision-language tasks via instruction tuning with large language model. arXiv preprint arXiv:2401.09712 (2024)
Zhang, P., et al.: InternLM-XComposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)
Zhang, Z., Zhao, T., Guo, Y., Yin, J.: RS5M: a large scale vision-language dataset for remote sensing vision-language foundation model. arXiv preprint arXiv:2306.11300 (2023)
Zhu, B., et al.: METER-ML: a multi-sensor earth observation benchmark for automated methane source mapping. arXiv preprint arXiv:2207.11166 (2022)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, Q., Zhong, Y., Zhao, B., Xia, G.S., Zhang, L.: Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 13(6), 747–751 (2016)
Acknowledgement
This research was supported by the National Natural Science Foundation of China under Grant 42071297, and in part by the AI & AI for Science Project of Nanjing University under Grant 02091480605203. We are grateful to High Performance Computing Center of Nanjing University for their help on GPU resources. We also would like to thank the anonymous reviewers for their constructive comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Muhtar, D., Li, Z., Gu, F., Zhang, X., Xiao, P. (2025). LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-72904-1_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)