LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

Muhtar, Dilxat; Li, Zhenshi; Gu, Feng; Zhang, Xueliang; Xiao, Pengfeng

doi:10.1007/978-3-031-72904-1_26

Dilxat Muhtar¹³,
Zhenshi Li¹³,
Feng Gu¹³,
Xueliang Zhang¹³ &
…
Pengfeng Xiao¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15132))

Included in the following conference series:

European Conference on Computer Vision

226 Accesses

Abstract

The revolutionary capabilities of large language models (LLMs) have paved the way for multimodal large language models (MLLMs) and fostered diverse applications across various specialized domains. In the remote sensing (RS) field, however, the diverse geographical landscapes and varied objects in RS imagery are not adequately considered in recent MLLM endeavors. To bridge this gap, we construct a large-scale RS image-text dataset, LHRS (LHRS stands for ‘Language Helps Remote Sensing’.)-Align, and an informative RS-specific instruction dataset, LHRS-Instruct, leveraging the extensive volunteered geographic information (VGI) and globally available RS images. Building on this foundation, we introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel multi-level vision-language alignment strategy and a curriculum learning method. Additionally, we introduce LHRS-Bench, a benchmark for thoroughly evaluating MLLMs’ abilities in RS image understanding. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain (Data, Code and model are available at https://github.com/NJU-LHRS/LHRS-Bot).

D. Muhtar and Z. Li—Equal contribution, listed in random order.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

UMG-CLIP: A Unified Multi-granularity Vision Generalist for Open-World Understanding

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

DOCCI: Descriptions of Connected and Contrasting Images

Notes

1.
https://www.openstreetmap.org/.
2.
https://www.google.com/earth/.
3.
https://wiki.openstreetmap.org/wiki/Map_features.
4.
However, we have no means of confirming whether these data appeared in the LHRS-Align dataset, considering that images in LHRS-Align, along with several of the classification datasets (AID, WHU-RS19, and SIRI-WHU), are collected from Google Earth.

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems 35, pp. 23716–23736 (2022)
Google Scholar
Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bai, S., et al.: TouchStone: evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890 (2023)
Bashmal, L., Bazi, Y., Melgani, F., Al Rahhal, M.M., Al Zuair, M.A.: Language integration in remote sensing: tasks, datasets, and future directions. IEEE Geosci. Remote Sens. Mag. 11(4), 63–93 (2023)
Article Google Scholar
Bitton, Y., et al.: Visit-bench: a benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595 (2023)
Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: your ViT but faster. arXiv preprint arXiv:2210.09461 (2022)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33, pp. 1877–1901 (2020)
Google Scholar
Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 105(10), 1865–1883 (2017)
Article Google Scholar
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org. Accessed 14 Apr 2023
Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Google Scholar
Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018)
Google Scholar
Dai, D., Yang, W.: Satellite image classification via two-layer sparse coding with biased image representation. IEEE Geosci. Remote Sens. Lett. 8(1), 173–176 (2011)
Article Google Scholar
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Driess, D., et al.: PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Ghiasi, A., et al.: What do vision transformers learn? A visual exploration. arXiv preprint arXiv:2212.06727 (2022)
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 12(7), 2217–2226 (2019)
Article Google Scholar
Hossain, M.D., Chen, D.: Segmentation for object-based image analysis (OBIA): a review of algorithms and challenges from remote sensing perspective. ISPRS J. Photogramm. Remote. Sens. 150, 115–134 (2019)
Article Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Hu, Y., Yuan, J., Wen, C., Lu, X., Li, X.: RSGPT: a remote sensing vision language model and benchmark. arXiv preprint arXiv:2307.15266 (2023)
Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)
Jiang, A.Q., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)
Ju, C., et al.: Turbo: informativity-driven acceleration plug-in for vision-language models. arXiv preprint arXiv:2312.07408 (2023)
Kuckreja, K., Danish, M.S., Naseer, M., Das, A., Khan, S., Khan, F.S.: GeoChat: grounded large vision-language model for remote sensing. arXiv preprint arXiv:2311.15826 (2023)
Li, B., et al.: SEED-Bench-2: benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092 (2023)
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-Bench: benchmarking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)
Li, C., et al.: Multimodal foundation models: from specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)
Li, C., et al.: LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, M., Wang, S., Zhang, Q.: Visualizing the emergence of intermediate visual patterns in DNNs. In: Advances in Neural Information Processing Systems 34, pp. 6594–6607 (2021)
Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Lobry, S., Marcos, D., Murray, J., Tuia, D.: RSVQA: visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 58(12), 8555–8566 (2020)
Article Google Scholar
Ma, Y., Cao, Y., Sun, J., Pavone, M., Xiao, C.: Dolphins: multimodal language model for driving. arXiv preprint arXiv:2312.00438 (2023)
Muhtar, D., Zhang, X., Xiao, P., Li, Z., Gu, F.: CMID: a unified self-supervised learning framework for remote sensing image understanding. IEEE Trans. Geosci. Remote Sens. 61, 1–17 (2023)
Article Google Scholar
OpenAI: GPT-4 technical report (2023)
Google Scholar
Park, N., Kim, W., Heo, B., Kim, T., Yun, S.: What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ratledge, N., Cadamuro, G., de la Cuesta, B., Stigler, M., Burke, M.: Using machine learning to assess the livelihood impact of electricity access. Nature 611(7936), 491–495 (2022)
Article Google Scholar
Reed, C.J., et al.: Scale-MAE: a scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023)
Google Scholar
Saygin Seyfioglu, M., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt-LLaVA: visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv e-prints, arXiv–2312 (2023)
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems 35, pp. 25278–25294 (2022)
Google Scholar
Sun, Y., Feng, S., Li, X., Ye, Y., Kang, J., Huang, X.: Visual grounding in remote sensing images. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 404–412 (2022)
Google Scholar
Team, M.N.: Introducing MPT-7B: a new standard for open-source, commercially usable LLMs (2023). www.mosaicml.com/blog/mpt-7b. Accessed 05 May 2023
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, D., et al.: SAMRS: scaling-up remote sensing segmentation dataset with segment anything model. In: Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)
Google Scholar
Wang, Z., Prabha, R., Huang, T., Wu, J., Rajagopal, R.: SkyScript: a large and semantically diverse vision-language dataset for remote sensing. arXiv preprint arXiv:2312.12856 (2023)
Wen, C., Hu, Y., Li, X., Yuan, Z., Zhu, X.X.: Vision-language models in remote sensing: current progress and future trends. arXiv preprint arXiv:2305.05726 (2023)
Xia, G.S., et al.: DOTA: a large-scale dataset for object detection in aerial images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974–3983 (2018)
Google Scholar
Xia, G.S., et al.: AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 55(7), 3965–3981 (2017)
Article Google Scholar
Xu, H., et al.: Demystifying CLIP data. arXiv preprint arXiv:2309.16671 (2023)
Yang, J., et al.: The role of satellite remote sensing in climate change studies. Nat. Clim. Change 3(10), 875–883 (2013)
Article Google Scholar
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Yuan, Z., et al.: Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv preprint arXiv:2204.09868 (2022)
Zhan, Y., Xiong, Z., Yuan, Y.: RSVG: exploring data and models for visual grounding on remote sensing data. IEEE Trans. Geosci. Remote Sens. 61, 1–13 (2023)
Google Scholar
Zhan, Y., Xiong, Z., Yuan, Y.: SkyEyeGPT: unifying remote sensing vision-language tasks via instruction tuning with large language model. arXiv preprint arXiv:2401.09712 (2024)
Zhang, P., et al.: InternLM-XComposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)
Zhang, Z., Zhao, T., Guo, Y., Yin, J.: RS5M: a large scale vision-language dataset for remote sensing vision-language foundation model. arXiv preprint arXiv:2306.11300 (2023)
Zhu, B., et al.: METER-ML: a multi-sensor earth observation benchmark for automated methane source mapping. arXiv preprint arXiv:2207.11166 (2022)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, Q., Zhong, Y., Zhao, B., Xia, G.S., Zhang, L.: Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 13(6), 747–751 (2016)
Article Google Scholar

Download references

Acknowledgement

This research was supported by the National Natural Science Foundation of China under Grant 42071297, and in part by the AI & AI for Science Project of Nanjing University under Grant 02091480605203. We are grateful to High Performance Computing Center of Nanjing University for their help on GPU resources. We also would like to thank the anonymous reviewers for their constructive comments.

Author information

Authors and Affiliations

Nanjing University, Nanjing, China
Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang & Pengfeng Xiao

Authors

Dilxat Muhtar
View author publications
You can also search for this author in PubMed Google Scholar
Zhenshi Li
View author publications
You can also search for this author in PubMed Google Scholar
Feng Gu
View author publications
You can also search for this author in PubMed Google Scholar
Xueliang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Pengfeng Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xueliang Zhang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12217 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Muhtar, D., Li, Z., Gu, F., Zhang, X., Xiao, P. (2025). LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-72904-1_26
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model