Expanding Scene Graph Boundaries: Fully Open-Vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention | SpringerLink
Skip to main content

Expanding Scene Graph Boundaries: Fully Open-Vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relation-based SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework. Our code is available at https://github.com/gpt4vision/OvSGTR/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8007
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10009
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Chen, S., Jin, Q., Wang, P., Wu, Q.: Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: CVPR, pp. 9959–9968 (2020)

    Google Scholar 

  2. Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: CVPR, pp. 6163–6171 (2019)

    Google Scholar 

  3. Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.00325 (2015)

    Google Scholar 

  4. Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7

    Chapter  Google Scholar 

  5. Chiou, M., Ding, H., Yan, H., Wang, C., Zimmermann, R., Feng, J.: Recovering the unbiased scene graphs from the biased ones. In: ACMMM, pp. 1581–1590 (2021)

    Google Scholar 

  6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  7. Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: CVPR, pp. 14064–14073 (2022)

    Google Scholar 

  8. Ghiasi, G., Gu, X., Cui, Y., Lin, T.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31

    Chapter  Google Scholar 

  9. Gu, J., Joty, S.R., Cai, J., Zhao, H., Yang, X., Wang, G.: Unpaired image captioning via scene graph alignments. In: ICCV, pp. 10322–10331 (2019)

    Google Scholar 

  10. He, T., Gao, L., Song, J., Li, Y.: Towards open-vocabulary scene graph generation with prompt-based finetuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 56–73. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_4

    Chapter  Google Scholar 

  11. Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: CVPR, pp. 1219–1228 (2018)

    Google Scholar 

  12. Kenfack, F.K., Siddiky, F.A., Balint-Benczedi, F., Beetz, M.: Robotvqa - a scene-graph- and deep-learning-based visual question answering system for robot manipulation. In: IROS, pp. 9667–9674 (2020)

    Google Scholar 

  13. Knyazev, B., de Vries, H., Cangea, C., Taylor, G.W., Courville, A.C., Belilovsky, E.: Generative compositional augmentations for scene graph prediction. In: ICCV, pp. 15807–15817 (2021)

    Google Scholar 

  14. Lee, S., Kim, J., Oh, Y., Jeon, J.H.: Visual question answering over scene graph. In: GC, pp. 45–50 (2019)

    Google Scholar 

  15. Li, L.H., et al.: Grounded language-image pre-training. In: CVPR, pp. 10955–10965 (2022)

    Google Scholar 

  16. Li, R., Zhang, S., He, X.: SGTR: end-to-end scene graph generation with transformer. In: CVPR, pp. 19464–19474 (2022)

    Google Scholar 

  17. Li, R., Zhang, S., Lin, D., Chen, K., He, X.: From pixels to graphs: open-vocabulary scene graph generation with vision-language models. In: CVPR, pp. 28076–28086 (2024)

    Google Scholar 

  18. Li, R., Zhang, S., Wan, B., He, X.: Bipartite graph network with adaptive message passing for unbiased scene graph generation. In: CVPR, pp. 11109–11119 (2021)

    Google Scholar 

  19. Li, X., Chen, L., Ma, W., Yang, Y., Xiao, J.: Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. In: ACMMM, pp. 4204–4213 (2022)

    Google Scholar 

  20. Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2999–3007 (2017)

    Google Scholar 

  21. Lin, X., Ding, C., Zhan, Y., Li, Z., Tao, D.: Hl-net: heterophily learning network for scene graph generation. In: CVPR, pp. 19454–19463 (2022)

    Google Scholar 

  22. Liu, H., Yan, N., Mortazavi, M.S., Bhanu, B.: Fully convolutional scene graph generation. In: CVPR, pp. 11546–11556 (2021)

    Google Scholar 

  23. Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. CoRR abs/2303.05499 (2023)

    Google Scholar 

  24. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)

    Google Scholar 

  25. Mao, J.: Scene graph parser (2022). https://github.com/vacancy/SceneGraphParser

  26. Nguyen, K., Tripathi, S., Du, B., Guha, T., Nguyen, T.Q.: In defense of scene graphs for image captioning. In: ICCV, pp. 1387–1396 (2021)

    Google Scholar 

  27. Nuthalapati, S.V., et al.: Lightweight visual question answering using scene graphs. In: CIKM, pp. 3353–3357 (2021)

    Google Scholar 

  28. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS, pp. 1143–1151 (2011)

    Google Scholar 

  29. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)

    Google Scholar 

  30. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  31. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NeurIPS 28 (2015)

    Google Scholar 

  32. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I.D., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)

    Google Scholar 

  33. Suhail, M., et al.: Energy-based learning for scene graph generation. In: CVPR, pp. 13936–13945 (2021)

    Google Scholar 

  34. Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: CVPR, pp. 3713–3722 (2020)

    Google Scholar 

  35. Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: CVPR, pp. 6619–6628 (2019)

    Google Scholar 

  36. Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: CVPR, pp. 3233–3241 (2017)

    Google Scholar 

  37. Wang, D., Beck, D., Cohn, T.: On the role of scene graphs in image captioning. In: LANTERN@EMNLP-IJCNLP, pp. 29–34 (2019)

    Google Scholar 

  38. Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. CoRR abs/2109.08472 (2021)

    Google Scholar 

  39. Wu, J., et al.: Towards open vocabulary learning: a survey. IEEE TPAMI (2024)

    Google Scholar 

  40. Wu, S., Zhang, W., Jin, S., Liu, W., Loy, C.C.: Aligning bag of regions for open-vocabulary object detection. In: CVPR, pp. 15254–15264 (2023)

    Google Scholar 

  41. Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: CVPR, pp. 3097–3106 (2017)

    Google Scholar 

  42. Yang, L., et al.: Diffusion-based scene graph to image generation with masked contrastive pre-training. CoRR abs/2211.11138 (2022)

    Google Scholar 

  43. Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR, pp. 10685–10694 (2019)

    Google Scholar 

  44. Ye, K., Kovashka, A.: Linguistic structures as weak supervision for visual scene graph generation. In: CVPR, pp. 8289–8299 (2021)

    Google Scholar 

  45. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.: Open-vocabulary object detection using captions. In: CVPR, pp. 14393–14402 (2021)

    Google Scholar 

  46. Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: CVPR, pp. 5831–5840 (2018)

    Google Scholar 

  47. Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: CVPR, pp. 11535–11543 (2019)

    Google Scholar 

  48. Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., Chen, C.W.: Learning to generate language-supervised and open-vocabulary scene graph using pre-trained visual-semantic space. In: CVPR, pp. 2915–2924 (2023)

    Google Scholar 

  49. Zhong, Y., Shi, J., Yang, J., Xu, C., Li, Y.: Learning to generate scene graph from natural language supervision. In: ICCV, pp. 1823–1834 (2021)

    Google Scholar 

  50. Zhong, Y., et al.: Regionclip: region-based language-image pretraining. In: CVPR, pp. 16772–16782 (2022)

    Google Scholar 

  51. Zhu, C., Chen, L.: A survey on open-vocabulary detection and segmentation: past, present, and future. CoRR abs/2307.09220 (2023)

    Google Scholar 

  52. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)

    Google Scholar 

Download references

Acknowledgements

This research was supported in part by the Hong Kong Research Grants Council (GRF-15229423), the Chinese National Natural Science Foundation Projects (U23B2054, 62306313), and the InnoHK program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chang Wen Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Z., Wu, J., Lei, Z., Zhang, Z., Chen, C.W. (2025). Expanding Scene Graph Boundaries: Fully Open-Vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72848-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72847-1

  • Online ISBN: 978-3-031-72848-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics