PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

Sun, Yuxuan; Wu, Hao; Zhu, Chenglu; Zheng, Sunyi; Chen, Qizi; Zhang, Kai; Zhang, Yunlong; Wan, Dan; Lan, Xiaoxiao; Zheng, Mengyue; Li, Jingxiong; Lyu, Xinheng; Lin, Tao; Yang, Lin

doi:10.1007/978-3-031-73033-7_4

Yuxuan Sun ORCID: orcid.org/0000-0002-1277-4316^13,14,
Hao Wu ORCID: orcid.org/0009-0006-8642-4569¹⁵,
Chenglu Zhu ORCID: orcid.org/0000-0001-5705-3718¹⁴,
Sunyi Zheng¹⁴,
Qizi Chen¹⁶,
Kai Zhang¹⁷,
Yunlong Zhang^13,14,
Dan Wan¹⁸,
Xiaoxiao Lan¹³,
Mengyue Zheng¹⁴,
Jingxiong Li^13,14,
Xinheng Lyu¹⁴,
Tao Lin ORCID: orcid.org/0000-0002-3246-6935¹⁴ &
…
Lin Yang¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15120))

Included in the following conference series:

European Conference on Computer Vision

544 Accesses
1 Citations

Abstract

The emergence of Large Multimodal Models (LMMs) has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert validated pathology benchmark for LMMs. It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU leverages GPT-4V’s advanced capabilities, utilizing over 30,000 image-caption pairs to enrich the descriptive quality of captions and generate corresponding Q&As in a cascading process. To maximize PathMMU’s authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU’s validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, substantially smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A visual-language foundation model for computational pathology

Article 19 March 2024

Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation

IQAGPT: computed tomography image quality assessment with vision-language and ChatGPT models

Article Open access 05 August 2024

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS, pp. 23716–23736 (2022)
Google Scholar
Aresta, G., et al.: Bach: grand challenge on breast cancer histology images. Med. Image Anal. 56, 122–139 (2019)
Article Google Scholar
Arunachalam, H.B., et al.: Viable and necrotic tumor assessment from whole slide images of osteosarcoma using machine-learning and deep-learning models. PLoS ONE 14(4), e0210706 (2019)
Article Google Scholar
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b
Ben Abacha, A., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., Müller, H.: Overview of the VQA-med task at ImageCLEF 2021: visual question answering and generation in the medical domain. In: Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes (2021)
Google Scholar
Borkowski, A.A., Bui, M.M., Thomas, L.B., Wilson, C.P., DeLand, L.A., Mastorides, S.M.: Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142 (2019)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS, pp. 1877–1901 (2020)
Google Scholar
Cai, R., et al.: BenchLMM: benchmarking cross-style visual capability of large multimodal models. arXiv preprint arXiv:2312.02896 (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Google Scholar
Driess, D., et al.: PaLM-E: an embodied multimodal language model. In: ICML, pp. 8469–8488 (2023)
Google Scholar
Gamper, J., Rajpoot, N.: Multiple instance captioning: learning representations from histopathology textbooks and articles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16549–16559 (2021)
Google Scholar
Gao, P., et al.: LLaMA-Adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare (HEALTH) 3(1), 1–23 (2021)
Google Scholar
Guan, T., et al.: HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: CVPR, pp. 14375–14385 (2024)
Google Scholar
Han, C., et al.: Wsss4luad: grand challenge on weakly-supervised tissue semantic segmentation for lung adenocarcinoma. arXiv preprint arXiv:2204.06455 (2022)
He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T.J., Zou, J.: A visual-language foundation model for pathology image analysis using medical twitter. Nat. Med. 29(9), 2307–2316 (2023)
Article Google Scholar
Ikezogwo, W., et al.: Quilt-1M: one million image-text pairs for histopathology. In: NeurIPS, pp. 37995–38017 (2023)
Google Scholar
Kather, J.N., Halama, N., Marx, A.: 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo10 5281 (2018)
Google Scholar
Kriegsmann, K., et al.: Deep learning for the detection of anatomical tissue structures and neoplasms of the skin on scanned histopathological tissue sections. Front. Oncol. 12, 1022967 (2022)
Article Google Scholar
Kumar, V., Abbas, A.K., Fausto, N., Aster, J.C.: Robbins and Cotran Pathologic Basis of Disease, Professional Edition E-book. Elsevier Health Sciences (2014)
Google Scholar
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5(1), 1–10 (2018)
Article Google Scholar
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-Bench: benchmarking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)
Li, C., et al.: LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. In: NeurIPS, pp. 28541–28564 (2023)
Google Scholar
Li, C., et al.: YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML, pp. 19730–19742 (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS, pp. 34892–34916 (2023)
Google Scholar
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11976–11986 (2022)
Google Scholar
OpenAI: Introducing ChatGPT (2022). https://openai.com/blog/chatgpt
OpenAI: Gpt-4 technical report (2023)
Google Scholar
OpenAI: Gpt-4v(ision) system card (2023). https://cdn.openai.com/papers/GPTV_System_Card.pdf
Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Seyfioglu, M.S., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt-llava: visual instruction tuning by extracting localized narratives from open-source histopathology videos. In: CVPR, pp. 13183–13192 (2024)
Google Scholar
Silva-Rodríguez, J., Colomer, A., Sales, M.A., Molina, R., Naranjo, V.: Going deeper through the gleason scoring scale: an automatic end-to-end system for histology prostate grading and cribriform pattern detection. Comput. Methods Programs Biomed. 195, 105637 (2020)
Article Google Scholar
Sun, Y., et al.: Ernie 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137 (2021)
Sun, Y., et al.: PathGen-1.6M: 1.6 million pathology image-text pairs generation through multi-agent collaboration (2024). https://arxiv.org/abs/2407.00203
Sun, Y., Zhu, C., Zhang, Y., Li, H., Chen, P., Yang, L.: Assessing the robustness of deep learning-assisted pathological image analysis under practical variables of imaging system. In: ICASSP, pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10095887
Sun, Y., et al.: PathAsst: a generative foundation AI assistant towards artificial general intelligence of pathology. In: AAAI, pp. 5034–5042 (2024)
Google Scholar
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Touvron, H., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equivariant CNNs for digital pathology. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 210–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_24
Wang, J., et al.: Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126 (2023)
Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
Wei, J., et al.: A Petri dish for histopathology image analysis. In: Tucker, A., Henriques Abreu, P., Cardoso, J., Pereira Rodrigues, P., Riaño, D. (eds.) AIME 2021. LNCS (LNAI), vol. 12721, pp. 11–24. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77211-6_2
Xu, P., et al.: LVLM-eHub: a comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265 (2023)
Yin, Z., et al.: LAMM: language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In: NeurIPS, pp. 26650–26685 (2023)
Google Scholar
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Yue, X., et al.: MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In: CVPR, pp. 9556–9567 (2024)
Google Scholar
Zhang, X., et al.: PMC-VQA: visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)
Zhang, Y., Sun, Y., Li, H., Zheng, S., Zhu, C., Yang, L.: Benchmarking the robustness of deep neural networks to common corruptions in digital pathology. In: MICCAI, pp. 242–252 (2022)
Google Scholar
Zhang, Z., et al.: Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nat. Mach. Intell. 1(5), 236–245 (2019)
Article Google Scholar
Zheng, S., et al.: Benchmarking pathCLIP for pathology image analysis. J. Imaging Inform. Med. 1–17 (2024). https://doi.org/10.1007/s10278-024-01128-4
Zhu, C., et al.: Weakly supervised classification using multi-level instance-aware optimization on cervical cytologic image. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), pp. 1–5. IEEE (2022)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Acknowledgements

This study was partially supported by the National Natural Science Foundation of China (Grant No.92270108), Zhejiang Provincial Natural Science Foundation of China (Grant No.XHD23F0201), the Research Center for Industries of the Future (RCIF) at Westlake University, and the Westlake Education Foundation.

Author information

Authors and Affiliations

Zhejiang University, Hangzhou, 310058, China
Yuxuan Sun, Yunlong Zhang, Xiaoxiao Lan & Jingxiong Li
Westlake University, Hangzhou, 310030, China
Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Yunlong Zhang, Mengyue Zheng, Jingxiong Li, Xinheng Lyu, Tao Lin & Lin Yang
Macau University of Science and Technology, Macau, 999078, China
Hao Wu
Jiangnan University, Wuxi, 214122, China
Qizi Chen
The Ohio State University, Columbus, OH, 43210, USA
Kai Zhang
Fujian University of Traditional Chinese Medicine, Fuzhou, 350122, China
Dan Wan

Authors

Yuxuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chenglu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Sunyi Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Qizi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yunlong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Wan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxiao Lan
View author publications
You can also search for this author in PubMed Google Scholar
Mengyue Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jingxiong Li
View author publications
You can also search for this author in PubMed Google Scholar
Xinheng Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Lin
View author publications
You can also search for this author in PubMed Google Scholar
Lin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tao Lin or Lin Yang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2873 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, Y. et al. (2025). PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15120. Springer, Cham. https://doi.org/10.1007/978-3-031-73033-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-73033-7_4
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73032-0
Online ISBN: 978-3-031-73033-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology