Abstract
We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naïve incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) ‘distinctive caption sampling’, a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) ‘distinctiveness-based text filtering’ to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Basu, S., Ramachandran, G.S., Keskar, N.S., Varshney, L.R.: Mirostat: a perplexity-controlled neural text decoding algorithm. In: ICLR (2021)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Chng, Y.X., Zheng, H., Han, Y., Qiu, X., Huang, G.: Mask grounding for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26573–26583 (2024)
Cho, Y., Yu, H., Kang, S.J.: Cross-aware early fusion with stage-divided vision and language transformer encoders for referring image segmentation. TMM (2023)
Choi, M.: Referring object manipulation of natural images with conditional classifier-free guidance. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 627–643. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_36
Dai, Q., Yang, S.: Curriculum point prompting for weakly-supervised referring image segmentation. In: CVPR (2024)
Dang, R., et al.: InstructDet: diversifying referring object detection with generalized instructions. In: ICLR (2024)
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV (2021)
Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: ACL (2018)
Feng, G., Hu, Z., Zhang, L., Lu, H.: Encoder fusion network with co-attention embedding for referring image segmentation. In: CVPR (2021)
Feng, G., Hu, Z., Zhang, L., Lu, H.: Encoder fusion network with co-attention embedding for referring image segmentation. In: CVPR (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Holla, M., Lourentzou, I.: Commonsense for zero-shot natural language video localization. In: AAAI (2024)
Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. In: ICLR (2019)
Hu, Y., et al.: Beyond one-to-one: rethinking the referring image segmentation. In: ICCV (2023)
Hu, Z., Feng, G., Sun, J., Zhang, L., Lu, H.: Bi-directional relationship inferring network for referring image segmentation. In: CVPR (2020)
Huang, S., et al.: Referring image segmentation via cross-modal progressive comprehension. In: CVPR (2020)
Huang, Y., Wu, Z., Gao, C., Peng, J., Yang, X.: Exploring the distinctiveness and fidelity of the descriptions generated by large vision-language models. arXiv preprint arXiv:2404.17534 (2024)
Huang, Z., Satoh, S.: Referring image segmentation via joint mask contextual embedding learning and progressive alignment network. In: EMNLP (2023)
Hui, T., et al.: Linguistic structure guided context modeling for referring image segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 59–75. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_4
Jiang, H., Lin, Y., Han, D., Song, S., Huang, G.: Pseudo-q: generating pseudo language queries for visual grounding. In: CVPR (2022)
Jing, Y., Kong, T., Wang, W., Wang, L., Li, L., Tan, T.: Locate then segment: a strong pipeline for referring image segmentation. In: CVPR (2021)
Jung, M., Choi, S., Kim, J., Kim, J.H., Zhang, B.T.: Modal-specific pseudo query generation for video corpus moment retrieval. In: EMNLP (2022)
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferitGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Kim, D., Park, J., Lee, J., Park, S., Sohn, K.: Language-free training for zero-shot video grounding. In: WACV (2023)
Kim, D., Kim, N., Lan, C., Kwak, S.: Shatter and gather: learning referring image segmentation with text supervision. In: ICCV (2023)
Kim, N., Kim, D., Lan, C., Zeng, W., Kwak, S.: ReSTR: convolution-free referring image segmentation using transformers. In: CVPR (2022)
Kim, S., Kang, M., Kim, D., Park, J., Kwak, S.: Extending clip’s image-text alignment to referring image segmentation. In: NAACL (2024)
Kim, T., Kim, J., Lee, G., Yun, S.Y.: Instructive decoding: instruction-tuned large language models are self-refiner from noisy instructions. In: ICLR (2024)
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)
Lai, X., et al.: LISA: reasoning segmentation via large language model. In: CVPR (2024)
Lee, J., Lee, S., Nam, J., Yu, S., Do, J., Taghavi, T.: Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency. In: ICCV (2023)
Leng, S., et al.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: CVPR (2024)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Li, X.L., et al.: Contrastive decoding: Open-ended text generation as optimization. In: ACL (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, C., Ding, H., Jiang, X.: GRES: generalized referring expression segmentation. In: CVPR (2023)
Liu, F., et al.: Referring image segmentation using text supervision. In: ICCV (2023)
Liu, J., et al.: Polyformer: referring image segmentation as sequential polygon generation. In: CVPR (2023)
Liu, S., et al.: Rotated multi-scale interaction network for referring remote sensing image segmentation. In: CVPR (2024)
Liu, Y., Zhang, J., Chen, Q., Peng, Y.: Confidence-aware pseudo-label learning for weakly supervised visual grounding. In: ICCV (2023)
Locatello, F., et al.: Object-centric learning with slot attention. In: NeurIPS (2020)
Lu, Y., Quan, R., Zhu, L., Yang, Y.: Zero-shot video grounding with pseudo query lookup and verification. TIP 33, 1643–1654 (2024)
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: CVPR (2020)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Meister, C., Pimentel, T., Malagutti, L., Wilcox, E., Cotterell, R.: On the efficacy of sampling adapters. In: ACL (2023)
Meister, C., Pimentel, T., Wiher, G., Cotterell, R.: Locally typical sampling. TACL 11, 102–121 (2023)
Nag, S., Goswami, K., Karanam, S.: Safari: adaptive sequence transformer for weakly supervised referring expression segmentation. arXiv preprint arXiv:2407.02389 (2024)
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Nam, J., Ahn, D., Kang, D., Ha, S.J., Choi, J.: Zero-shot natural language video localization. In: ICCV (2021)
Ni, M., Zhang, Y., Feng, K., Li, X., Guo, Y., Zuo, W.: Ref-Diff: zero-shot referring image segmentation with generative models. arXiv preprint arXiv:2308.16777 (2023)
O’Brien, S., Lewis, M.: Contrastive decoding improves reasoning in large language models. arXiv preprint arXiv:2309.09117 (2023)
Padmakumar, A., et al.: Teach: task-driven embodied agents that chat. In: AAAI (2022)
Qiu, Z., Ou, Z., Wu, B., Li, J., Liu, A., King, I.: Entropy-based decoding for retrieval-augmented large language models. arXiv preprint arXiv:2406.17519 (2024)
Qu, M., Wu, Y., Wei, Y., Liu, W., Liang, X., Zhao, Y.: Learning to segment every referring object point by point. In: CVPR (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)
Sennrich, R., Vamvas, J., Mohammadshahi, A.: Mitigating hallucinations and off-target machine translation with source-contrastive and language-contrastive decoding. In: ACL (2024)
Shah, N.A., VS, V., Patel, V.M.: LQMFormer: language-aware query mask transformer for referring image segmentation. In: CVPR (2024)
Shang, C., Song, Z., Qiu, H., Wang, L., Meng, F., Li, H.: Prompt-driven referring image segmentation with instance contrasting. In: CVPR (2024)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Shi, H., Pan, W., Zhao, Z., Zhang, M., Wu, F.: Unsupervised domain adaptation for referring semantic segmentation. In: ACM (2023)
Shi, W., Han, X., Lewis, M., Tsvetkov, Y., Zettlemoyer, L., Yih, S.W.t.: Trusting your evidence: hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739 (2023)
Strudel, R., Laptev, I., Schmid, C.: Weakly-supervised segmentation of referring expressions. arXiv preprint arXiv:2205.04725 (2022)
Su, Y., Lan, T., Wang, Y., Yogatama, D., Kong, L., Collier, N.: A contrastive framework for neural text generation. In: NeurIPS (2022)
Suo, Y., Zhu, L., Yang, Y.: Text augmented spatial aware zero-shot referring image segmentation. In: EMNLP Findings (2023)
Tang, J., Zheng, G., Shi, C., Yang, S.: Contrastive grouping with transformer for referring image segmentation. In: CVPR (2023)
Wang, W., et al.: CM-MaskSD: cross-modality masked self-distillation for referring image segmentation. TMM (2024)
Wang, W., et al.: Unveiling parts beyond objects: towards finer-granularity referring expression segmentation. In: CVPR (2024)
Wang, X., et al.: FreeSOLO: learning to segment objects without annotations. In: CVPR (2022)
Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: CVPR (2023)
Wang, Y., et al.: BarLeRIa: an efficient tuning framework for referring image segmentation. In: ICLR (2024)
Wang, Z., et al.: CRIS: CLIP-driven referring image segmentation. In: CVPR (2022)
Wu, C., Lin, Z., Cohen, S., Bui, T., Maji, S.: PhraseCut: language-based image segmentation in the wild. In: CVPR (2020)
Wu, J., Li, X., Li, X., Ding, H., Tong, Y., Tao, D.: Towards robust referring image segmentation. TIP 33, 1782–1794 (2024)
Wu, Y., Zhang, Z., Xie, C., Zhu, F., Zhao, R.: Advancing referring expression segmentation beyond single image. In: ICCV, October 2023
Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: GSVA: generalized segmentation via multimodal large language models. In: CVPR (2024)
Xu, L., Huang, M.H., Shang, X., Yuan, Z., Sun, Y., Liu, J.: Meta compositional referring expression segmentation. In: CVPR (2023)
Xu, Z., Jiang, F., Niu, L., Jia, J., Lin, B.Y., Poovendran, R.: SafeDecoding: defending against jailbreak attacks via safety-aware decoding. arXiv preprint arXiv:2402.08983 (2024)
Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., Li, G.: Bridging vision and language encoders: parameter-efficient tuning for referring image segmentation. In: ICCV (2023)
Yang, D., et al.: Sam as the guide: mastering pseudo-label refinement in semi-supervised referring expression segmentation. In: ICML
Yang, S., Xia, M., Li, G., Zhou, H.Y., Yu, Y.: Bottom-up shift and reasoning for referring image segmentation. In: CVPR (2021)
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR (2022)
Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR (2019)
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. TMLR (2022)
Yu, S., Seo, P.H., Son, J.: Zero-shot referring image segmentation with global-local context features. In: CVPR (2023)
Zang, Y., et al.: RESMatch: referring expression segmentation in a semi-supervised manner. arXiv preprint arXiv:2402.05589 (2024)
Zheng, M., Gong, S., Jin, H., Peng, Y., Liu, Y.: Generating structured pseudo labels for noise-resistant zero-shot video sentence localization. In: ACL (2023)
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)
Acknowledgements
This work was supported by the IITP grants (No. 2019-0-01842 (3%), No. 2021-0-02068 (2%), No. 2022-0-00926 (50%), No. 2024-2020-0-01819 (5%)) funded by MSIT, Culture, Sports and Tourism R&D Program of the KOCCA grant (RS-2024-00345025 (5%)) funded by MCST, and the GIST-MIT Research Collaboration grant (35%) funded by GIST, Korea.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, S., Seo, P.H., Son, J. (2025). Pseudo-RIS: Distinctive Pseudo-Supervision Generation for Referring Image Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15126. Springer, Cham. https://doi.org/10.1007/978-3-031-73113-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-73113-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73112-9
Online ISBN: 978-3-031-73113-6
eBook Packages: Computer ScienceComputer Science (R0)