Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Pak, Byeonghyun; Woo, Byeongju; Kim, Sunghwan; Kim, Dae-hwan; Kim, Hoseong

doi:10.1007/978-3-031-72998-0_3

Byeonghyun Pak¹³,
Byeongju Woo¹³,
Sunghwan Kim¹³,
Dae-hwan Kim¹³ &
…
Hoseong Kim¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15115))

Included in the following conference series:

European Conference on Computer Vision

128 Accesses

Abstract

In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5\(\rightarrow \)Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at https://byeonghyunpak.github.io/tqdm.

B. Pak, B. Woo and S. Kim—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

k-means Mask Transformer

Notes

1.
https://chat.openai.com.
2.
Semantic coherence is a property of vision models in which semantically similar regions in images exhibit similar pixel representations [3, 38].

References

Cai, Z., et al.: X-DETR: a versatile architecture for instance-wise vision-language tasks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13696, pp. 290–308. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_17
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Google Scholar
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. NeurIPS (2021)
Google Scholar
Cho, J., Nam, G., Kim, S., Yang, H., Kwak, S.: PromptStyler: prompt-driven style generation for source-free domain generalization. In: ICCV (2023)
Google Scholar
Choi, S., Jung, S., Yun, H., Kim, J.T., Kim, S., Choo, J.: RobustNet: improving domain generalization in urban-scene segmentation via instance selective whitening. In: CVPR (2021)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Google Scholar
Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: CVPR (2021)
Google Scholar
Ding, J., Xue, N., Xia, G.S., Schiele, B., Dai, D.: HGFormer: hierarchical grouping transformer for domain heneralized semantic segmentation. In: CVPR (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fahes, M., Vu, T.H., Bursuc, A., Pérez, P., de Charette, R.: A simple recipe for language-guided domain generalized segmentation. arXiv preprint arXiv:2311.17922 (2023)
Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA-02: a visual representation for neon genesis. arXiv preprint arXiv:2303.11331 (2023)
Fang, Y., et al.: EVA: exploring the limits of masked visual representation learning at scale. In: CVPR (2023)
Google Scholar
Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
Google Scholar
Hoyer, L., Dai, D., Van Gool, L.: DAFormer: improving network architectures and training strategies for domain-adaptive semantic segmentation. In: CVPR (2022)
Google Scholar
Huang, Z., Zhou, A., Ling, Z., Cai, M., Wang, H., Lee, Y.J.: A sentence speaks a thousand images: Domain generalization through distilling CLIP with language guidance. In: ICCV (2023)
Google Scholar
Hümmer, C., Schwonberg, M., Zhong, L., Cao, H., Knoll, A., Gottschalk, H.: VLTSeg: simple transfer of CLIP-based vision-language representations for domain generalized semantic segmentation. arXiv preprint arXiv:2312.02021 (2023)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Google Scholar
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR – modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)
Google Scholar
Kim, S., Kim, D.H., Kim, H.: Texture learning domain randomization for domain generalized segmentation. In: ICCV (2023)
Google Scholar
Lee, S., Seong, H., Lee, S., Kim, E.: WildNet: learning domain generalized semantic segmentation from the wild. In: CVPR (2022)
Google Scholar
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)
Google Scholar
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: DN-DETR: accelerate detr training by introducing query denoising. In: CVPR (2022)
Google Scholar
Li, X., et al.: Transformer-based visual segmentation: a survey. arXiv preprint arXiv:2304.09854 (2023)
Li, Z., Kamnitsas, K., Glocker, B.: Analyzing overfitting under class imbalance in neural networks for image segmentation. IEEE Trans. Med. Imaging 40(3), 1065–1077 (2020)
Article Google Scholar
Li, Z., et al.: Panoptic SegFormer: delving deeper into panoptic segmentation with transformers. In: CVPR (2022)
Google Scholar
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted CLIP. In: CVPR (2023)
Google Scholar
Lin, Y., et al.: CLIP is also an efficient segmenter: a text-driven approach for weakly supervised semantic segmentation. In: CVPR (2023)
Google Scholar
Liu, S., et al.: DAB-DETR: dynamic anchor boxes are better queries for DETR. ICLR (2022)
Google Scholar
Liu, Y., Liu, C., Han, K., Tang, Q., Qin, Z.: Boosting semantic segmentation from the perspective of explicit class embeddings. In: ICCV (2023)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ICLR (2019)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS (2019)
Google Scholar
Mangla, P., Chandhok, S., Aggarwal, M., Balasubramanian, V.N., Krishnamurthy, B.: INDIGO: intrinsic multimodality for domain generalization. arXiv preprint arXiv:2206.05912 (2022)
Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
Google Scholar
Mukhoti, J., et al.: Open vocabulary semantic segmentation with patch aligned contrastive learning. In: CVPR (2023)
Google Scholar
Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: ICCV (2017)
Google Scholar
Nguyen, T., Ilharco, G., Wortsman, M., Oh, S., Schmidt, L.: Quality not quantity: on the interaction between dataset design and robustness of CLIP. NeurIPS (2022)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Pan, X., Luo, P., Shi, J., Tang, X.: Two at once: enhancing learning and generalization capacities via ibn-net. In: ECCV (2018)
Google Scholar
Pan, X., Zhan, X., Shi, J., Tang, X., Luo, P.: Switchable whitening for deep representation learning. In: ICCV (2019)
Google Scholar
Peng, D., Lei, Y., Hayat, M., Guo, Y., Li, W.: Semantic-aware domain generalized segmentation. In: CVPR (2022)
Google Scholar
Pham, H., et al.: Combined scaling for zero-shot transfer learning. Neurocomputing (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)
Google Scholar
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Chapter Google Scholar
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)
Google Scholar
Sun, Q., Chen, H., Zheng, M., Wu, Z., Felsberg, M., Tang, Y.: IBAFormer: intra-batch attention transformer for domain generalized semantic segmentation. arXiv preprint arXiv:2309.06282 (2023)
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
Vaswani, A., et al.: Attention is all you need. NeurIPS (2017)
Google Scholar
Wei, Z., et al.: Stronger, fewer, & superior: harnessing vision foundation models for domain generalized semantic segmentation. arXiv preprint arXiv:2312.04265 (2023)
Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: CVPR (2022)
Google Scholar
Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42
Yu, F., et al.: BDD100k: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)
Google Scholar
Yu, Q., et al.: kMaX-DeepLab: k-means mask transformer. In: ECCV (2022)
Google Scholar
Yue, X., Zhang, Y., Zhao, S., Sangiovanni-Vincentelli, A., Keutzer, K., Gong, B.: Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In: ICCV (2019)
Google Scholar
Zhang, B., Tian, Z., Tang, Q., Chu, X., Wei, X., Shen, C., et al.: Segvit: semantic segmentation with plain vision transformers. NeurIPS (2022)
Google Scholar
Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
Zhang, H., et al.: MP-Former: mask-piloted transformer for image segmentation. In: CVPR (2023)
Google Scholar
Zhao, Y., Zhong, Z., Zhao, N., Sebe, N., Lee, G.H.: Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13688, pp. 535–552. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_31
Zhao, Y., Zhong, Z., Zhao, N., Sebe, N., Lee, G.H.: Style-hallucinated dual consistency learning: a unified framework for visual domain generalization. IJCV (2023)
Google Scholar
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022)
Google Scholar
Zhou, Z., Lei, Y., Zhang, B., Liu, L., Liu, Y.: ZegCLIP: towards adapting clip for zero-shot semantic segmentation. In: CVPR (2023)
Google Scholar
Zhu, C., Chen, L.: A survey on open-vocabulary detection and segmentation: past, present, and future. arXiv preprint arXiv:2307.09220 (2023)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2020)
Google Scholar

Download references

Acknowledgements

We sincerely thank Chanyong Lee and Eunjin Koh for their constructive discussions and support. We also appreciate Junyoung Kim, Chaehyeon Lim and Minkyu Song for providing insightful feedback. This work was supported by the Agency for Defense Development (ADD) grant funded by the Korea government (279002001).

Author information

Authors and Affiliations

Agency for Defense Development (ADD), Daejeon, South Korea
Byeonghyun Pak, Byeongju Woo, Sunghwan Kim, Dae-hwan Kim & Hoseong Kim

Authors

Byeonghyun Pak
View author publications
You can also search for this author in PubMed Google Scholar
Byeongju Woo
View author publications
You can also search for this author in PubMed Google Scholar
Sunghwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Dae-hwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hoseong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Byeonghyun Pak .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4254 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pak, B., Woo, B., Kim, S., Kim, Dh., Kim, H. (2025). Textual Query-Driven Mask Transformer for Domain Generalized Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15115. Springer, Cham. https://doi.org/10.1007/978-3-031-72998-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-72998-0_3
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72997-3
Online ISBN: 978-3-031-72998-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

k-means Mask Transformer

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 4254 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

k-means Mask Transformer

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 4254 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation