Pseudo-RIS: Distinctive Pseudo-Supervision Generation for Referring Image Segmentation | SpringerLink
Skip to main content

Pseudo-RIS: Distinctive Pseudo-Supervision Generation for Referring Image Segmentation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15126))

Included in the following conference series:

  • 202 Accesses

Abstract

We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naïve incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) ‘distinctive caption sampling’, a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) ‘distinctiveness-based text filtering’ to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8007
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10009
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)

    Google Scholar 

  2. Basu, S., Ramachandran, G.S., Keskar, N.S., Varshney, L.R.: Mirostat: a perplexity-controlled neural text decoding algorithm. In: ICLR (2021)

    Google Scholar 

  3. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  4. Chng, Y.X., Zheng, H., Han, Y., Qiu, X., Huang, G.: Mask grounding for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26573–26583 (2024)

    Google Scholar 

  5. Cho, Y., Yu, H., Kang, S.J.: Cross-aware early fusion with stage-divided vision and language transformer encoders for referring image segmentation. TMM (2023)

    Google Scholar 

  6. Choi, M.: Referring object manipulation of natural images with conditional classifier-free guidance. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 627–643. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_36

    Chapter  Google Scholar 

  7. Dai, Q., Yang, S.: Curriculum point prompting for weakly-supervised referring image segmentation. In: CVPR (2024)

    Google Scholar 

  8. Dang, R., et al.: InstructDet: diversifying referring object detection with generalized instructions. In: ICLR (2024)

    Google Scholar 

  9. Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV (2021)

    Google Scholar 

  10. Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: ACL (2018)

    Google Scholar 

  11. Feng, G., Hu, Z., Zhang, L., Lu, H.: Encoder fusion network with co-attention embedding for referring image segmentation. In: CVPR (2021)

    Google Scholar 

  12. Feng, G., Hu, Z., Zhang, L., Lu, H.: Encoder fusion network with co-attention embedding for referring image segmentation. In: CVPR (2021)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  14. Holla, M., Lourentzou, I.: Commonsense for zero-shot natural language video localization. In: AAAI (2024)

    Google Scholar 

  15. Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. In: ICLR (2019)

    Google Scholar 

  16. Hu, Y., et al.: Beyond one-to-one: rethinking the referring image segmentation. In: ICCV (2023)

    Google Scholar 

  17. Hu, Z., Feng, G., Sun, J., Zhang, L., Lu, H.: Bi-directional relationship inferring network for referring image segmentation. In: CVPR (2020)

    Google Scholar 

  18. Huang, S., et al.: Referring image segmentation via cross-modal progressive comprehension. In: CVPR (2020)

    Google Scholar 

  19. Huang, Y., Wu, Z., Gao, C., Peng, J., Yang, X.: Exploring the distinctiveness and fidelity of the descriptions generated by large vision-language models. arXiv preprint arXiv:2404.17534 (2024)

  20. Huang, Z., Satoh, S.: Referring image segmentation via joint mask contextual embedding learning and progressive alignment network. In: EMNLP (2023)

    Google Scholar 

  21. Hui, T., et al.: Linguistic structure guided context modeling for referring image segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 59–75. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_4

    Chapter  Google Scholar 

  22. Jiang, H., Lin, Y., Han, D., Song, S., Huang, G.: Pseudo-q: generating pseudo language queries for visual grounding. In: CVPR (2022)

    Google Scholar 

  23. Jing, Y., Kong, T., Wang, W., Wang, L., Li, L., Tan, T.: Locate then segment: a strong pipeline for referring image segmentation. In: CVPR (2021)

    Google Scholar 

  24. Jung, M., Choi, S., Kim, J., Kim, J.H., Zhang, B.T.: Modal-specific pseudo query generation for video corpus moment retrieval. In: EMNLP (2022)

    Google Scholar 

  25. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferitGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)

    Google Scholar 

  26. Kim, D., Park, J., Lee, J., Park, S., Sohn, K.: Language-free training for zero-shot video grounding. In: WACV (2023)

    Google Scholar 

  27. Kim, D., Kim, N., Lan, C., Kwak, S.: Shatter and gather: learning referring image segmentation with text supervision. In: ICCV (2023)

    Google Scholar 

  28. Kim, N., Kim, D., Lan, C., Zeng, W., Kwak, S.: ReSTR: convolution-free referring image segmentation using transformers. In: CVPR (2022)

    Google Scholar 

  29. Kim, S., Kang, M., Kim, D., Park, J., Kwak, S.: Extending clip’s image-text alignment to referring image segmentation. In: NAACL (2024)

    Google Scholar 

  30. Kim, T., Kim, J., Lee, G., Yun, S.Y.: Instructive decoding: instruction-tuned large language models are self-refiner from noisy instructions. In: ICLR (2024)

    Google Scholar 

  31. Kirillov, A., et al.: Segment anything. In: ICCV (2023)

    Google Scholar 

  32. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)

    Article  MathSciNet  Google Scholar 

  33. Lai, X., et al.: LISA: reasoning segmentation via large language model. In: CVPR (2024)

    Google Scholar 

  34. Lee, J., Lee, S., Nam, J., Yu, S., Do, J., Taghavi, T.: Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency. In: ICCV (2023)

    Google Scholar 

  35. Leng, S., et al.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: CVPR (2024)

    Google Scholar 

  36. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)

    Google Scholar 

  37. Li, X.L., et al.: Contrastive decoding: Open-ended text generation as optimization. In: ACL (2023)

    Google Scholar 

  38. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  39. Liu, C., Ding, H., Jiang, X.: GRES: generalized referring expression segmentation. In: CVPR (2023)

    Google Scholar 

  40. Liu, F., et al.: Referring image segmentation using text supervision. In: ICCV (2023)

    Google Scholar 

  41. Liu, J., et al.: Polyformer: referring image segmentation as sequential polygon generation. In: CVPR (2023)

    Google Scholar 

  42. Liu, S., et al.: Rotated multi-scale interaction network for referring remote sensing image segmentation. In: CVPR (2024)

    Google Scholar 

  43. Liu, Y., Zhang, J., Chen, Q., Peng, Y.: Confidence-aware pseudo-label learning for weakly supervised visual grounding. In: ICCV (2023)

    Google Scholar 

  44. Locatello, F., et al.: Object-centric learning with slot attention. In: NeurIPS (2020)

    Google Scholar 

  45. Lu, Y., Quan, R., Zhu, L., Yang, Y.: Zero-shot video grounding with pseudo query lookup and verification. TIP 33, 1643–1654 (2024)

    Google Scholar 

  46. Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: CVPR (2020)

    Google Scholar 

  47. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)

    Google Scholar 

  48. Meister, C., Pimentel, T., Malagutti, L., Wilcox, E., Cotterell, R.: On the efficacy of sampling adapters. In: ACL (2023)

    Google Scholar 

  49. Meister, C., Pimentel, T., Wiher, G., Cotterell, R.: Locally typical sampling. TACL 11, 102–121 (2023)

    Article  Google Scholar 

  50. Nag, S., Goswami, K., Karanam, S.: Safari: adaptive sequence transformer for weakly supervised referring expression segmentation. arXiv preprint arXiv:2407.02389 (2024)

  51. Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48

    Chapter  Google Scholar 

  52. Nam, J., Ahn, D., Kang, D., Ha, S.J., Choi, J.: Zero-shot natural language video localization. In: ICCV (2021)

    Google Scholar 

  53. Ni, M., Zhang, Y., Feng, K., Li, X., Guo, Y., Zuo, W.: Ref-Diff: zero-shot referring image segmentation with generative models. arXiv preprint arXiv:2308.16777 (2023)

  54. O’Brien, S., Lewis, M.: Contrastive decoding improves reasoning in large language models. arXiv preprint arXiv:2309.09117 (2023)

  55. Padmakumar, A., et al.: Teach: task-driven embodied agents that chat. In: AAAI (2022)

    Google Scholar 

  56. Qiu, Z., Ou, Z., Wu, B., Li, J., Liu, A., King, I.: Entropy-based decoding for retrieval-augmented large language models. arXiv preprint arXiv:2406.17519 (2024)

  57. Qu, M., Wu, Y., Wei, Y., Liu, W., Liang, X., Zhao, Y.: Learning to segment every referring object point by point. In: CVPR (2023)

    Google Scholar 

  58. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  59. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)

    Google Scholar 

  60. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  61. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

    Google Scholar 

  62. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)

    Google Scholar 

  63. Sennrich, R., Vamvas, J., Mohammadshahi, A.: Mitigating hallucinations and off-target machine translation with source-contrastive and language-contrastive decoding. In: ACL (2024)

    Google Scholar 

  64. Shah, N.A., VS, V., Patel, V.M.: LQMFormer: language-aware query mask transformer for referring image segmentation. In: CVPR (2024)

    Google Scholar 

  65. Shang, C., Song, Z., Qiu, H., Wang, L., Meng, F., Li, H.: Prompt-driven referring image segmentation with instance contrasting. In: CVPR (2024)

    Google Scholar 

  66. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)

    Google Scholar 

  67. Shi, H., Pan, W., Zhao, Z., Zhang, M., Wu, F.: Unsupervised domain adaptation for referring semantic segmentation. In: ACM (2023)

    Google Scholar 

  68. Shi, W., Han, X., Lewis, M., Tsvetkov, Y., Zettlemoyer, L., Yih, S.W.t.: Trusting your evidence: hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739 (2023)

  69. Strudel, R., Laptev, I., Schmid, C.: Weakly-supervised segmentation of referring expressions. arXiv preprint arXiv:2205.04725 (2022)

  70. Su, Y., Lan, T., Wang, Y., Yogatama, D., Kong, L., Collier, N.: A contrastive framework for neural text generation. In: NeurIPS (2022)

    Google Scholar 

  71. Suo, Y., Zhu, L., Yang, Y.: Text augmented spatial aware zero-shot referring image segmentation. In: EMNLP Findings (2023)

    Google Scholar 

  72. Tang, J., Zheng, G., Shi, C., Yang, S.: Contrastive grouping with transformer for referring image segmentation. In: CVPR (2023)

    Google Scholar 

  73. Wang, W., et al.: CM-MaskSD: cross-modality masked self-distillation for referring image segmentation. TMM (2024)

    Google Scholar 

  74. Wang, W., et al.: Unveiling parts beyond objects: towards finer-granularity referring expression segmentation. In: CVPR (2024)

    Google Scholar 

  75. Wang, X., et al.: FreeSOLO: learning to segment objects without annotations. In: CVPR (2022)

    Google Scholar 

  76. Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: CVPR (2023)

    Google Scholar 

  77. Wang, Y., et al.: BarLeRIa: an efficient tuning framework for referring image segmentation. In: ICLR (2024)

    Google Scholar 

  78. Wang, Z., et al.: CRIS: CLIP-driven referring image segmentation. In: CVPR (2022)

    Google Scholar 

  79. Wu, C., Lin, Z., Cohen, S., Bui, T., Maji, S.: PhraseCut: language-based image segmentation in the wild. In: CVPR (2020)

    Google Scholar 

  80. Wu, J., Li, X., Li, X., Ding, H., Tong, Y., Tao, D.: Towards robust referring image segmentation. TIP 33, 1782–1794 (2024)

    Google Scholar 

  81. Wu, Y., Zhang, Z., Xie, C., Zhu, F., Zhao, R.: Advancing referring expression segmentation beyond single image. In: ICCV, October 2023

    Google Scholar 

  82. Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: GSVA: generalized segmentation via multimodal large language models. In: CVPR (2024)

    Google Scholar 

  83. Xu, L., Huang, M.H., Shang, X., Yuan, Z., Sun, Y., Liu, J.: Meta compositional referring expression segmentation. In: CVPR (2023)

    Google Scholar 

  84. Xu, Z., Jiang, F., Niu, L., Jia, J., Lin, B.Y., Poovendran, R.: SafeDecoding: defending against jailbreak attacks via safety-aware decoding. arXiv preprint arXiv:2402.08983 (2024)

  85. Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., Li, G.: Bridging vision and language encoders: parameter-efficient tuning for referring image segmentation. In: ICCV (2023)

    Google Scholar 

  86. Yang, D., et al.: Sam as the guide: mastering pseudo-label refinement in semi-supervised referring expression segmentation. In: ICML

    Google Scholar 

  87. Yang, S., Xia, M., Li, G., Zhou, H.Y., Yu, Y.: Bottom-up shift and reasoning for referring image segmentation. In: CVPR (2021)

    Google Scholar 

  88. Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR (2022)

    Google Scholar 

  89. Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR (2019)

    Google Scholar 

  90. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. TMLR (2022)

    Google Scholar 

  91. Yu, S., Seo, P.H., Son, J.: Zero-shot referring image segmentation with global-local context features. In: CVPR (2023)

    Google Scholar 

  92. Zang, Y., et al.: RESMatch: referring expression segmentation in a semi-supervised manner. arXiv preprint arXiv:2402.05589 (2024)

  93. Zheng, M., Gong, S., Jin, H., Peng, Y., Liu, Y.: Generating structured pseudo labels for noise-resistant zero-shot video sentence localization. In: ACL (2023)

    Google Scholar 

  94. Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the IITP grants (No. 2019-0-01842 (3%), No. 2021-0-02068 (2%), No. 2022-0-00926 (50%), No. 2024-2020-0-01819 (5%)) funded by MSIT, Culture, Sports and Tourism R&D Program of the KOCCA grant (RS-2024-00345025 (5%)) funded by MCST, and the GIST-MIT Research Collaboration grant (35%) funded by GIST, Korea.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Paul Hongsuck Seo or Jeany Son .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2641 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, S., Seo, P.H., Son, J. (2025). Pseudo-RIS: Distinctive Pseudo-Supervision Generation for Referring Image Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15126. Springer, Cham. https://doi.org/10.1007/978-3-031-73113-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73113-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73112-9

  • Online ISBN: 978-3-031-73113-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics