Abstract
Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relation-based SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework. Our code is available at https://github.com/gpt4vision/OvSGTR/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, S., Jin, Q., Wang, P., Wu, Q.: Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: CVPR, pp. 9959–9968 (2020)
Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: CVPR, pp. 6163–6171 (2019)
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.00325 (2015)
Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chiou, M., Ding, H., Yan, H., Wang, C., Zimmermann, R., Feng, J.: Recovering the unbiased scene graphs from the biased ones. In: ACMMM, pp. 1581–1590 (2021)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: CVPR, pp. 14064–14073 (2022)
Ghiasi, G., Gu, X., Cui, Y., Lin, T.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31
Gu, J., Joty, S.R., Cai, J., Zhao, H., Yang, X., Wang, G.: Unpaired image captioning via scene graph alignments. In: ICCV, pp. 10322–10331 (2019)
He, T., Gao, L., Song, J., Li, Y.: Towards open-vocabulary scene graph generation with prompt-based finetuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 56–73. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_4
Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: CVPR, pp. 1219–1228 (2018)
Kenfack, F.K., Siddiky, F.A., Balint-Benczedi, F., Beetz, M.: Robotvqa - a scene-graph- and deep-learning-based visual question answering system for robot manipulation. In: IROS, pp. 9667–9674 (2020)
Knyazev, B., de Vries, H., Cangea, C., Taylor, G.W., Courville, A.C., Belilovsky, E.: Generative compositional augmentations for scene graph prediction. In: ICCV, pp. 15807–15817 (2021)
Lee, S., Kim, J., Oh, Y., Jeon, J.H.: Visual question answering over scene graph. In: GC, pp. 45–50 (2019)
Li, L.H., et al.: Grounded language-image pre-training. In: CVPR, pp. 10955–10965 (2022)
Li, R., Zhang, S., He, X.: SGTR: end-to-end scene graph generation with transformer. In: CVPR, pp. 19464–19474 (2022)
Li, R., Zhang, S., Lin, D., Chen, K., He, X.: From pixels to graphs: open-vocabulary scene graph generation with vision-language models. In: CVPR, pp. 28076–28086 (2024)
Li, R., Zhang, S., Wan, B., He, X.: Bipartite graph network with adaptive message passing for unbiased scene graph generation. In: CVPR, pp. 11109–11119 (2021)
Li, X., Chen, L., Ma, W., Yang, Y., Xiao, J.: Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. In: ACMMM, pp. 4204–4213 (2022)
Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2999–3007 (2017)
Lin, X., Ding, C., Zhan, Y., Li, Z., Tao, D.: Hl-net: heterophily learning network for scene graph generation. In: CVPR, pp. 19454–19463 (2022)
Liu, H., Yan, N., Mortazavi, M.S., Bhanu, B.: Fully convolutional scene graph generation. In: CVPR, pp. 11546–11556 (2021)
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. CoRR abs/2303.05499 (2023)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)
Mao, J.: Scene graph parser (2022). https://github.com/vacancy/SceneGraphParser
Nguyen, K., Tripathi, S., Du, B., Guha, T., Nguyen, T.Q.: In defense of scene graphs for image captioning. In: ICCV, pp. 1387–1396 (2021)
Nuthalapati, S.V., et al.: Lightweight visual question answering using scene graphs. In: CIKM, pp. 3353–3357 (2021)
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS, pp. 1143–1151 (2011)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NeurIPS 28 (2015)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I.D., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Suhail, M., et al.: Energy-based learning for scene graph generation. In: CVPR, pp. 13936–13945 (2021)
Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: CVPR, pp. 3713–3722 (2020)
Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: CVPR, pp. 6619–6628 (2019)
Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: CVPR, pp. 3233–3241 (2017)
Wang, D., Beck, D., Cohn, T.: On the role of scene graphs in image captioning. In: LANTERN@EMNLP-IJCNLP, pp. 29–34 (2019)
Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. CoRR abs/2109.08472 (2021)
Wu, J., et al.: Towards open vocabulary learning: a survey. IEEE TPAMI (2024)
Wu, S., Zhang, W., Jin, S., Liu, W., Loy, C.C.: Aligning bag of regions for open-vocabulary object detection. In: CVPR, pp. 15254–15264 (2023)
Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: CVPR, pp. 3097–3106 (2017)
Yang, L., et al.: Diffusion-based scene graph to image generation with masked contrastive pre-training. CoRR abs/2211.11138 (2022)
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR, pp. 10685–10694 (2019)
Ye, K., Kovashka, A.: Linguistic structures as weak supervision for visual scene graph generation. In: CVPR, pp. 8289–8299 (2021)
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.: Open-vocabulary object detection using captions. In: CVPR, pp. 14393–14402 (2021)
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: CVPR, pp. 5831–5840 (2018)
Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: CVPR, pp. 11535–11543 (2019)
Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., Chen, C.W.: Learning to generate language-supervised and open-vocabulary scene graph using pre-trained visual-semantic space. In: CVPR, pp. 2915–2924 (2023)
Zhong, Y., Shi, J., Yang, J., Xu, C., Li, Y.: Learning to generate scene graph from natural language supervision. In: ICCV, pp. 1823–1834 (2021)
Zhong, Y., et al.: Regionclip: region-based language-image pretraining. In: CVPR, pp. 16772–16782 (2022)
Zhu, C., Chen, L.: A survey on open-vocabulary detection and segmentation: past, present, and future. CoRR abs/2307.09220 (2023)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)
Acknowledgements
This research was supported in part by the Hong Kong Research Grants Council (GRF-15229423), the Chinese National Natural Science Foundation Projects (U23B2054, 62306313), and the InnoHK program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Z., Wu, J., Lei, Z., Zhang, Z., Chen, C.W. (2025). Expanding Scene Graph Boundaries: Fully Open-Vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-72848-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72847-1
Online ISBN: 978-3-031-72848-8
eBook Packages: Computer ScienceComputer Science (R0)