Zero-Shot Video Moment Retrieval Using BLIP-Based Models | SpringerLink
Skip to main content

Zero-Shot Video Moment Retrieval Using BLIP-Based Models

  • Conference paper
  • First Online:
Advances in Visual Computing (ISVC 2023)

Abstract

Video Moment Retrieval (VMR) is a challenging task at the intersection of vision and language, with the goal to retrieve relevant moments from videos corresponding to natural language queries. State-of-the-art approaches for VMR often rely on large amounts of training data including frame-level saliency annotations, weakly supervised pre-training on speech captions, and signals from additional modalities such as audio, which can be limiting in practical scenarios. Moreover, most of these approaches make use of pre-trained spatio-temporal backbones for aggregating temporal features across multiple frames, which incurs significant training and inference costs. To address these limitations, we propose a zero-shot approach with sparse frame-sampling strategies that does not rely on additional modalities and performs well with feature extraction from just individual frames. Our approach uses Bootstrapped Language-Image Pre-training based models (BLIP/BLIP-2), which have been shown to be effective for various downstream vision-language tasks, even in zero-shot settings. We show that such models can be easily repurposed as effective, off-the-shelf feature extractors for VMR. On the QVHighlights benchmark for VMR, our approach outperforms both zero-shot approaches and supervised approaches (without saliency score annotations) by at least \(25\%\) and \(21\%\) respectively, on all metrics. Further, we also show that our approach is comparable to state-of-the-art supervised approaches trained on saliency score annotations and additional modalities, with a gap of at most \(7\%\) across all metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR 2017, pp. 4724–4733. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.502

  2. Chen, J., Luo, W., Zhang, W., Ma, L.: Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding. In: AAAI 2022, pp. 267–275. AAAI Press (2022). https://ojs.aaai.org/index.php/AAAI/article/view/19902

  3. Choe, T.E., Lee, M.W., Guo, F., Taylor, G., Yu, L., Haering, N.: Semantic video event search for surveillance video. In: ICCV Workshops 2011, pp. 1963–1970. IEEE (2011)

    Google Scholar 

  4. Diwan, A., Peng, P., Mooney, R.J.: Zero-shot video moment retrieval with off-the-shelf models. CoRR abs/2211.02178 (2022). https://doi.org/10.48550/arXiv.2211.02178

  5. Du, Y., Liu, Z., Li, J., Zhao, W.X.: A survey of vision-language pre-trained models. In: IJCAI 2022, pp. 5436–5443. ijcai.org (2022). https://doi.org/10.24963/ijcai.2022/762

  6. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV 2019, pp. 6201–6210. IEEE (2019). https://doi.org/10.1109/ICCV.2019.00630

  7. Hao, J., Sun, H., Ren, P., Wang, J., Qi, Q., Liao, J.: Can shuffling video benefit temporal bias problem: a novel training framework for temporal grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 130–147. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_8

    Chapter  Google Scholar 

  8. Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language. In: ICCV 2017, pp. 5804–5813. IEEE Computer Society (2017). https://doi.org/10.1109/ICCV.2017.618

  9. Huang, J., Worring, M.: Query-controllable video summarization. In: ICMR 2020, pp. 242–250. ACM (2020). https://doi.org/10.1145/3372278.3390695

  10. Leake, M., Davis, A., Truong, A., Agrawala, M.: Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36(4), 130:1–130:14 (2017). https://doi.org/10.1145/3072959.3073653

  11. Lei, J., Berg, T.L., Bansal, M.: QVHighlights: detecting moments and highlights in videos via natural language queries. CoRR abs/2107.09609 (2021). arxiv.org/abs/2107.09609

  12. Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 447–463. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_27

    Chapter  Google Scholar 

  13. Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR abs/2301.12597 (2023). https://doi.org/10.48550/arXiv.2301.12597

  14. Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML 2022, vol. 162, pp. 12888–12900. PMLR (2022). https://proceedings.mlr.press/v162/li22n.html

  15. Liu, D., et al.: Context-aware biaffine localizing network for temporal sentence grounding. In: CVPR 2021, pp. 11235–11244. Computer Vision Foundation/IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01108

  16. Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR 2022, pp. 3032–3041. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00305

  17. Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR 2023, pp. 23023–23033 (2023)

    Google Scholar 

  18. Nam, J., Ahn, D., Kang, D., Ha, S.J., Choi, J.: Zero-shot natural language video localization. In: ICCV 2021, pp. 1450–1459. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00150

  19. Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. CoRR abs/2003.07048 (2020). arxiv.org/abs/2003.07048

  20. Tang, K., Bao, Y., Zhao, Z., Zhu, L., Lin, Y., Peng, Y.: AutoHighlight: automatic highlights detection and segmentation in soccer matches. In: IEEE BigData 2018, pp. 4619–4624. IEEE (2018). https://doi.org/10.1109/BigData.2018.8621906

  21. Tellex, S., Kollar, T., Shaw, G., Roy, N., Roy, D.: Grounding spatial language for video search. In: ICMI-MLMI 2010, pp. 31:1–31:8. ACM (2010). https://doi.org/10.1145/1891903.1891944

  22. Wang, G., Wu, X., Liu, Z., Yan, J.: Prompt-based zero-shot video moment retrieval. In: MM 2022, pp. 413–421. ACM (2022). https://doi.org/10.1145/3503161.3548004

  23. Wang, M., Yang, G.W., Hu, S.M., Yau, S.T., Shamir, A., et al.: Write-a-video: computational video montage from themed text. ACM Trans. Graph. 38(6), 177–1 (2019)

    Article  Google Scholar 

  24. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19

    Chapter  Google Scholar 

  25. Xu, M., et al.: Boundary-sensitive pre-training for temporal localization in videos. In: ICCV 2021, pp. 7200–7210. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00713

  26. Zeng, Y., Cao, D., Lu, S., Zhang, H., Xu, J., Qin, Z.: Moment is important: language-based video moment retrieval via adversarial learning. ACM Trans. Multim. Comput. Commun. Appl. 18(2), 56:1–56:21 (2022). https://doi.org/10.1145/3478025

  27. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: The elements of temporal sentence grounding in videos: a survey and future directions. CoRR abs/2201.08071 (2022). arxiv.org/abs/2201.08071

  28. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI 2020, pp. 12870–12877. AAAI Press (2020). https://ojs.aaai.org/index.php/AAAI/article/view/6984

Download references

Acknowledgments

This work was partially funded by the Research School on “Service-Oriented Systems Engineering” of the Hasso Plattner Institute.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jobin Idiculla Wattasseril .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wattasseril, J.I., Shekhar, S., Döllner, J., Trapp, M. (2023). Zero-Shot Video Moment Retrieval Using BLIP-Based Models. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2023. Lecture Notes in Computer Science, vol 14361. Springer, Cham. https://doi.org/10.1007/978-3-031-47969-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-47969-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47968-7

  • Online ISBN: 978-3-031-47969-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics