Expanding Scene Graph Boundaries: Fully Open-Vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

Chen, Zuyao; Wu, Jinlin; Lei, Zhen; Zhang, Zhaoxiang; Chen, Chang Wen

doi:10.1007/978-3-031-72848-8_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15124))

Included in the following conference series:

European Conference on Computer Vision

231 Accesses

Abstract

Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relation-based SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework. Our code is available at https://github.com/gpt4vision/OvSGTR/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8007; Price includes VAT (Japan)

Softcover Book: JPY 10009; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Towards Open-Vocabulary Scene Graph Generation with Prompt-Based Finetuning

Compact-VG: A Small-scale Dataset for Scene Graph Generation

Open-Vocabulary Scene Graph Generation via Synonym-Based Predicate Descriptor

References

Chen, S., Jin, Q., Wang, P., Wu, Q.: Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: CVPR, pp. 9959–9968 (2020)
Google Scholar
Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: CVPR, pp. 6163–6171 (2019)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.00325 (2015)
Google Scholar
Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Chiou, M., Ding, H., Yan, H., Wang, C., Zimmermann, R., Feng, J.: Recovering the unbiased scene graphs from the biased ones. In: ACMMM, pp. 1581–1590 (2021)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: CVPR, pp. 14064–14073 (2022)
Google Scholar
Ghiasi, G., Gu, X., Cui, Y., Lin, T.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31
Chapter Google Scholar
Gu, J., Joty, S.R., Cai, J., Zhao, H., Yang, X., Wang, G.: Unpaired image captioning via scene graph alignments. In: ICCV, pp. 10322–10331 (2019)
Google Scholar
He, T., Gao, L., Song, J., Li, Y.: Towards open-vocabulary scene graph generation with prompt-based finetuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 56–73. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_4
Chapter Google Scholar
Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: CVPR, pp. 1219–1228 (2018)
Google Scholar
Kenfack, F.K., Siddiky, F.A., Balint-Benczedi, F., Beetz, M.: Robotvqa - a scene-graph- and deep-learning-based visual question answering system for robot manipulation. In: IROS, pp. 9667–9674 (2020)
Google Scholar
Knyazev, B., de Vries, H., Cangea, C., Taylor, G.W., Courville, A.C., Belilovsky, E.: Generative compositional augmentations for scene graph prediction. In: ICCV, pp. 15807–15817 (2021)
Google Scholar
Lee, S., Kim, J., Oh, Y., Jeon, J.H.: Visual question answering over scene graph. In: GC, pp. 45–50 (2019)
Google Scholar
Li, L.H., et al.: Grounded language-image pre-training. In: CVPR, pp. 10955–10965 (2022)
Google Scholar
Li, R., Zhang, S., He, X.: SGTR: end-to-end scene graph generation with transformer. In: CVPR, pp. 19464–19474 (2022)
Google Scholar
Li, R., Zhang, S., Lin, D., Chen, K., He, X.: From pixels to graphs: open-vocabulary scene graph generation with vision-language models. In: CVPR, pp. 28076–28086 (2024)
Google Scholar
Li, R., Zhang, S., Wan, B., He, X.: Bipartite graph network with adaptive message passing for unbiased scene graph generation. In: CVPR, pp. 11109–11119 (2021)
Google Scholar
Li, X., Chen, L., Ma, W., Yang, Y., Xiao, J.: Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. In: ACMMM, pp. 4204–4213 (2022)
Google Scholar
Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2999–3007 (2017)
Google Scholar
Lin, X., Ding, C., Zhan, Y., Li, Z., Tao, D.: Hl-net: heterophily learning network for scene graph generation. In: CVPR, pp. 19454–19463 (2022)
Google Scholar
Liu, H., Yan, N., Mortazavi, M.S., Bhanu, B.: Fully convolutional scene graph generation. In: CVPR, pp. 11546–11556 (2021)
Google Scholar
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. CoRR abs/2303.05499 (2023)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)
Google Scholar
Mao, J.: Scene graph parser (2022). https://github.com/vacancy/SceneGraphParser
Nguyen, K., Tripathi, S., Du, B., Guha, T., Nguyen, T.Q.: In defense of scene graphs for image captioning. In: ICCV, pp. 1387–1396 (2021)
Google Scholar
Nuthalapati, S.V., et al.: Lightweight visual question answering using scene graphs. In: CIKM, pp. 3353–3357 (2021)
Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS, pp. 1143–1151 (2011)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NeurIPS 28 (2015)
Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I.D., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Google Scholar
Suhail, M., et al.: Energy-based learning for scene graph generation. In: CVPR, pp. 13936–13945 (2021)
Google Scholar
Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: CVPR, pp. 3713–3722 (2020)
Google Scholar
Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: CVPR, pp. 6619–6628 (2019)
Google Scholar
Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: CVPR, pp. 3233–3241 (2017)
Google Scholar
Wang, D., Beck, D., Cohn, T.: On the role of scene graphs in image captioning. In: LANTERN@EMNLP-IJCNLP, pp. 29–34 (2019)
Google Scholar
Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. CoRR abs/2109.08472 (2021)
Google Scholar
Wu, J., et al.: Towards open vocabulary learning: a survey. IEEE TPAMI (2024)
Google Scholar
Wu, S., Zhang, W., Jin, S., Liu, W., Loy, C.C.: Aligning bag of regions for open-vocabulary object detection. In: CVPR, pp. 15254–15264 (2023)
Google Scholar
Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: CVPR, pp. 3097–3106 (2017)
Google Scholar
Yang, L., et al.: Diffusion-based scene graph to image generation with masked contrastive pre-training. CoRR abs/2211.11138 (2022)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR, pp. 10685–10694 (2019)
Google Scholar
Ye, K., Kovashka, A.: Linguistic structures as weak supervision for visual scene graph generation. In: CVPR, pp. 8289–8299 (2021)
Google Scholar
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.: Open-vocabulary object detection using captions. In: CVPR, pp. 14393–14402 (2021)
Google Scholar
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: CVPR, pp. 5831–5840 (2018)
Google Scholar
Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: CVPR, pp. 11535–11543 (2019)
Google Scholar
Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., Chen, C.W.: Learning to generate language-supervised and open-vocabulary scene graph using pre-trained visual-semantic space. In: CVPR, pp. 2915–2924 (2023)
Google Scholar
Zhong, Y., Shi, J., Yang, J., Xu, C., Li, Y.: Learning to generate scene graph from natural language supervision. In: ICCV, pp. 1823–1834 (2021)
Google Scholar
Zhong, Y., et al.: Regionclip: region-based language-image pretraining. In: CVPR, pp. 16772–16782 (2022)
Google Scholar
Zhu, C., Chen, L.: A survey on open-vocabulary detection and segmentation: past, present, and future. CoRR abs/2307.09220 (2023)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)
Google Scholar

Download references

Acknowledgements

This research was supported in part by the Hong Kong Research Grants Council (GRF-15229423), the Chinese National Natural Science Foundation Projects (U23B2054, 62306313), and the InnoHK program.

Author information

Authors and Affiliations

The Hong Kong Polytechnic University, Hung Hom, Hong Kong
Zuyao Chen & Chang Wen Chen
Centre for Artificial Intelligence and Robotics, HKISI-CAS, Pak Shek Kok, Hong Kong
Zuyao Chen, Jinlin Wu, Zhen Lei & Zhaoxiang Zhang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jinlin Wu, Zhen Lei & Zhaoxiang Zhang
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Zhen Lei & Zhaoxiang Zhang

Authors

Zuyao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jinlin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Lei
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chang Wen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chang Wen Chen .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Wu, J., Lei, Z., Zhang, Z., Chen, C.W. (2025). Expanding Scene Graph Boundaries: Fully Open-Vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-72848-8_7
Published: 29 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72847-1
Online ISBN: 978-3-031-72848-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Expanding Scene Graph Boundaries: Fully Open-Vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention