Abstract
In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5\(\rightarrow \)Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at https://byeonghyunpak.github.io/tqdm.
B. Pak, B. Woo and S. Kim—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
References
Cai, Z., et al.: X-DETR: a versatile architecture for instance-wise vision-language tasks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13696, pp. 290–308. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_17
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. NeurIPS (2021)
Cho, J., Nam, G., Kim, S., Yang, H., Kwak, S.: PromptStyler: prompt-driven style generation for source-free domain generalization. In: ICCV (2023)
Choi, S., Jung, S., Yun, H., Kim, J.T., Kim, S., Choo, J.: RobustNet: improving domain generalization in urban-scene segmentation via instance selective whitening. In: CVPR (2021)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: CVPR (2021)
Ding, J., Xue, N., Xia, G.S., Schiele, B., Dai, D.: HGFormer: hierarchical grouping transformer for domain heneralized semantic segmentation. In: CVPR (2023)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fahes, M., Vu, T.H., Bursuc, A., Pérez, P., de Charette, R.: A simple recipe for language-guided domain generalized segmentation. arXiv preprint arXiv:2311.17922 (2023)
Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA-02: a visual representation for neon genesis. arXiv preprint arXiv:2303.11331 (2023)
Fang, Y., et al.: EVA: exploring the limits of masked visual representation learning at scale. In: CVPR (2023)
Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
Hoyer, L., Dai, D., Van Gool, L.: DAFormer: improving network architectures and training strategies for domain-adaptive semantic segmentation. In: CVPR (2022)
Huang, Z., Zhou, A., Ling, Z., Cai, M., Wang, H., Lee, Y.J.: A sentence speaks a thousand images: Domain generalization through distilling CLIP with language guidance. In: ICCV (2023)
Hümmer, C., Schwonberg, M., Zhong, L., Cao, H., Knoll, A., Gottschalk, H.: VLTSeg: simple transfer of CLIP-based vision-language representations for domain generalized semantic segmentation. arXiv preprint arXiv:2312.02021 (2023)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR – modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)
Kim, S., Kim, D.H., Kim, H.: Texture learning domain randomization for domain generalized segmentation. In: ICCV (2023)
Lee, S., Seong, H., Lee, S., Kim, E.: WildNet: learning domain generalized semantic segmentation from the wild. In: CVPR (2022)
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: DN-DETR: accelerate detr training by introducing query denoising. In: CVPR (2022)
Li, X., et al.: Transformer-based visual segmentation: a survey. arXiv preprint arXiv:2304.09854 (2023)
Li, Z., Kamnitsas, K., Glocker, B.: Analyzing overfitting under class imbalance in neural networks for image segmentation. IEEE Trans. Med. Imaging 40(3), 1065–1077 (2020)
Li, Z., et al.: Panoptic SegFormer: delving deeper into panoptic segmentation with transformers. In: CVPR (2022)
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted CLIP. In: CVPR (2023)
Lin, Y., et al.: CLIP is also an efficient segmenter: a text-driven approach for weakly supervised semantic segmentation. In: CVPR (2023)
Liu, S., et al.: DAB-DETR: dynamic anchor boxes are better queries for DETR. ICLR (2022)
Liu, Y., Liu, C., Han, K., Tang, Q., Qin, Z.: Boosting semantic segmentation from the perspective of explicit class embeddings. In: ICCV (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ICLR (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS (2019)
Mangla, P., Chandhok, S., Aggarwal, M., Balasubramanian, V.N., Krishnamurthy, B.: INDIGO: intrinsic multimodality for domain generalization. arXiv preprint arXiv:2206.05912 (2022)
Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
Mukhoti, J., et al.: Open vocabulary semantic segmentation with patch aligned contrastive learning. In: CVPR (2023)
Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: ICCV (2017)
Nguyen, T., Ilharco, G., Wortsman, M., Oh, S., Schmidt, L.: Quality not quantity: on the interaction between dataset design and robustness of CLIP. NeurIPS (2022)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Pan, X., Luo, P., Shi, J., Tang, X.: Two at once: enhancing learning and generalization capacities via ibn-net. In: ECCV (2018)
Pan, X., Zhan, X., Shi, J., Tang, X., Luo, P.: Switchable whitening for deep representation learning. In: ICCV (2019)
Peng, D., Lei, Y., Hayat, M., Guo, Y., Li, W.: Semantic-aware domain generalized segmentation. In: CVPR (2022)
Pham, H., et al.: Combined scaling for zero-shot transfer learning. Neurocomputing (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)
Sun, Q., Chen, H., Zheng, M., Wu, Z., Felsberg, M., Tang, Y.: IBAFormer: intra-batch attention transformer for domain generalized semantic segmentation. arXiv preprint arXiv:2309.06282 (2023)
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
Vaswani, A., et al.: Attention is all you need. NeurIPS (2017)
Wei, Z., et al.: Stronger, fewer, & superior: harnessing vision foundation models for domain generalized semantic segmentation. arXiv preprint arXiv:2312.04265 (2023)
Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: CVPR (2022)
Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42
Yu, F., et al.: BDD100k: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)
Yu, Q., et al.: kMaX-DeepLab: k-means mask transformer. In: ECCV (2022)
Yue, X., Zhang, Y., Zhao, S., Sangiovanni-Vincentelli, A., Keutzer, K., Gong, B.: Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In: ICCV (2019)
Zhang, B., Tian, Z., Tang, Q., Chu, X., Wei, X., Shen, C., et al.: Segvit: semantic segmentation with plain vision transformers. NeurIPS (2022)
Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
Zhang, H., et al.: MP-Former: mask-piloted transformer for image segmentation. In: CVPR (2023)
Zhao, Y., Zhong, Z., Zhao, N., Sebe, N., Lee, G.H.: Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13688, pp. 535–552. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_31
Zhao, Y., Zhong, Z., Zhao, N., Sebe, N., Lee, G.H.: Style-hallucinated dual consistency learning: a unified framework for visual domain generalization. IJCV (2023)
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022)
Zhou, Z., Lei, Y., Zhang, B., Liu, L., Liu, Y.: ZegCLIP: towards adapting clip for zero-shot semantic segmentation. In: CVPR (2023)
Zhu, C., Chen, L.: A survey on open-vocabulary detection and segmentation: past, present, and future. arXiv preprint arXiv:2307.09220 (2023)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2020)
Acknowledgements
We sincerely thank Chanyong Lee and Eunjin Koh for their constructive discussions and support. We also appreciate Junyoung Kim, Chaehyeon Lim and Minkyu Song for providing insightful feedback. This work was supported by the Agency for Defense Development (ADD) grant funded by the Korea government (279002001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pak, B., Woo, B., Kim, S., Kim, Dh., Kim, H. (2025). Textual Query-Driven Mask Transformer for Domain Generalized Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15115. Springer, Cham. https://doi.org/10.1007/978-3-031-72998-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-72998-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72997-3
Online ISBN: 978-3-031-72998-0
eBook Packages: Computer ScienceComputer Science (R0)