PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology | SpringerLink
Skip to main content

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

The emergence of Large Multimodal Models (LMMs) has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert validated pathology benchmark for LMMs. It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU leverages GPT-4V’s advanced capabilities, utilizing over 30,000 image-caption pairs to enrich the descriptive quality of captions and generate corresponding Q&As in a cascading process. To maximize PathMMU’s authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU’s validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, substantially smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS, pp. 23716–23736 (2022)

    Google Scholar 

  2. Aresta, G., et al.: Bach: grand challenge on breast cancer histology images. Med. Image Anal. 56, 122–139 (2019)

    Article  Google Scholar 

  3. Arunachalam, H.B., et al.: Viable and necrotic tumor assessment from whole slide images of osteosarcoma using machine-learning and deep-learning models. PLoS ONE 14(4), e0210706 (2019)

    Article  Google Scholar 

  4. Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  5. Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b

  6. Ben Abacha, A., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., Müller, H.: Overview of the VQA-med task at ImageCLEF 2021: visual question answering and generation in the medical domain. In: Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes (2021)

    Google Scholar 

  7. Borkowski, A.A., Bui, M.M., Thomas, L.B., Wilson, C.P., DeLand, L.A., Mastorides, S.M.: Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142 (2019)

  8. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS, pp. 1877–1901 (2020)

    Google Scholar 

  9. Cai, R., et al.: BenchLMM: benchmarking cross-style visual capability of large multimodal models. arXiv preprint arXiv:2312.02896 (2023)

  10. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/

  11. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)

  12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)

    Google Scholar 

  13. Driess, D., et al.: PaLM-E: an embodied multimodal language model. In: ICML, pp. 8469–8488 (2023)

    Google Scholar 

  14. Gamper, J., Rajpoot, N.: Multiple instance captioning: learning representations from histopathology textbooks and articles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16549–16559 (2021)

    Google Scholar 

  15. Gao, P., et al.: LLaMA-Adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

  16. Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare (HEALTH) 3(1), 1–23 (2021)

    Google Scholar 

  17. Guan, T., et al.: HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: CVPR, pp. 14375–14385 (2024)

    Google Scholar 

  18. Han, C., et al.: Wsss4luad: grand challenge on weakly-supervised tissue semantic segmentation for lung adenocarcinoma. arXiv preprint arXiv:2204.06455 (2022)

  19. He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

  20. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T.J., Zou, J.: A visual-language foundation model for pathology image analysis using medical twitter. Nat. Med. 29(9), 2307–2316 (2023)

    Article  Google Scholar 

  21. Ikezogwo, W., et al.: Quilt-1M: one million image-text pairs for histopathology. In: NeurIPS, pp. 37995–38017 (2023)

    Google Scholar 

  22. Kather, J.N., Halama, N., Marx, A.: 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo10 5281 (2018)

    Google Scholar 

  23. Kriegsmann, K., et al.: Deep learning for the detection of anatomical tissue structures and neoplasms of the skin on scanned histopathological tissue sections. Front. Oncol. 12, 1022967 (2022)

    Article  Google Scholar 

  24. Kumar, V., Abbas, A.K., Fausto, N., Aster, J.C.: Robbins and Cotran Pathologic Basis of Disease, Professional Edition E-book. Elsevier Health Sciences (2014)

    Google Scholar 

  25. Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5(1), 1–10 (2018)

    Article  Google Scholar 

  26. Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

  27. Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-Bench: benchmarking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)

  28. Li, C., et al.: LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. In: NeurIPS, pp. 28541–28564 (2023)

    Google Scholar 

  29. Li, C., et al.: YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022)

  30. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML, pp. 19730–19742 (2023)

    Google Scholar 

  31. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS, pp. 34892–34916 (2023)

    Google Scholar 

  32. Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

  33. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11976–11986 (2022)

    Google Scholar 

  34. OpenAI: Introducing ChatGPT (2022). https://openai.com/blog/chatgpt

  35. OpenAI: Gpt-4 technical report (2023)

    Google Scholar 

  36. OpenAI: Gpt-4v(ision) system card (2023). https://cdn.openai.com/papers/GPTV_System_Card.pdf

  37. Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  38. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  39. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  40. Seyfioglu, M.S., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt-llava: visual instruction tuning by extracting localized narratives from open-source histopathology videos. In: CVPR, pp. 13183–13192 (2024)

    Google Scholar 

  41. Silva-Rodríguez, J., Colomer, A., Sales, M.A., Molina, R., Naranjo, V.: Going deeper through the gleason scoring scale: an automatic end-to-end system for histology prostate grading and cribriform pattern detection. Comput. Methods Programs Biomed. 195, 105637 (2020)

    Article  Google Scholar 

  42. Sun, Y., et al.: Ernie 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137 (2021)

  43. Sun, Y., et al.: PathGen-1.6M: 1.6 million pathology image-text pairs generation through multi-agent collaboration (2024). https://arxiv.org/abs/2407.00203

  44. Sun, Y., Zhu, C., Zhang, Y., Li, H., Chen, P., Yang, L.: Assessing the robustness of deep learning-assisted pathological image analysis under practical variables of imaging system. In: ICASSP, pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10095887

  45. Sun, Y., et al.: PathAsst: a generative foundation AI assistant towards artificial general intelligence of pathology. In: AAAI, pp. 5034–5042 (2024)

    Google Scholar 

  46. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  47. Touvron, H., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  48. Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equivariant CNNs for digital pathology. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 210–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_24

  49. Wang, J., et al.: Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126 (2023)

  50. Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)

  51. Wei, J., et al.: A Petri dish for histopathology image analysis. In: Tucker, A., Henriques Abreu, P., Cardoso, J., Pereira Rodrigues, P., Riaño, D. (eds.) AIME 2021. LNCS (LNAI), vol. 12721, pp. 11–24. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77211-6_2

  52. Xu, P., et al.: LVLM-eHub: a comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265 (2023)

  53. Yin, Z., et al.: LAMM: language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In: NeurIPS, pp. 26650–26685 (2023)

    Google Scholar 

  54. Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)

  55. Yue, X., et al.: MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In: CVPR, pp. 9556–9567 (2024)

    Google Scholar 

  56. Zhang, X., et al.: PMC-VQA: visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)

  57. Zhang, Y., Sun, Y., Li, H., Zheng, S., Zhu, C., Yang, L.: Benchmarking the robustness of deep neural networks to common corruptions in digital pathology. In: MICCAI, pp. 242–252 (2022)

    Google Scholar 

  58. Zhang, Z., et al.: Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nat. Mach. Intell. 1(5), 236–245 (2019)

    Article  Google Scholar 

  59. Zheng, S., et al.: Benchmarking pathCLIP for pathology image analysis. J. Imaging Inform. Med. 1–17 (2024). https://doi.org/10.1007/s10278-024-01128-4

  60. Zhu, C., et al.: Weakly supervised classification using multi-level instance-aware optimization on cervical cytologic image. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), pp. 1–5. IEEE (2022)

    Google Scholar 

  61. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Acknowledgements

This study was partially supported by the National Natural Science Foundation of China (Grant No.92270108), Zhejiang Provincial Natural Science Foundation of China (Grant No.XHD23F0201), the Research Center for Industries of the Future (RCIF) at Westlake University, and the Westlake Education Foundation.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tao Lin or Lin Yang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2873 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, Y. et al. (2025). PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15120. Springer, Cham. https://doi.org/10.1007/978-3-031-73033-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73033-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73032-0

  • Online ISBN: 978-3-031-73033-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics