Abstract
Video Moment Retrieval (VMR) is a challenging task at the intersection of vision and language, with the goal to retrieve relevant moments from videos corresponding to natural language queries. State-of-the-art approaches for VMR often rely on large amounts of training data including frame-level saliency annotations, weakly supervised pre-training on speech captions, and signals from additional modalities such as audio, which can be limiting in practical scenarios. Moreover, most of these approaches make use of pre-trained spatio-temporal backbones for aggregating temporal features across multiple frames, which incurs significant training and inference costs. To address these limitations, we propose a zero-shot approach with sparse frame-sampling strategies that does not rely on additional modalities and performs well with feature extraction from just individual frames. Our approach uses Bootstrapped Language-Image Pre-training based models (BLIP/BLIP-2), which have been shown to be effective for various downstream vision-language tasks, even in zero-shot settings. We show that such models can be easily repurposed as effective, off-the-shelf feature extractors for VMR. On the QVHighlights benchmark for VMR, our approach outperforms both zero-shot approaches and supervised approaches (without saliency score annotations) by at least \(25\%\) and \(21\%\) respectively, on all metrics. Further, we also show that our approach is comparable to state-of-the-art supervised approaches trained on saliency score annotations and additional modalities, with a gap of at most \(7\%\) across all metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR 2017, pp. 4724–4733. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.502
Chen, J., Luo, W., Zhang, W., Ma, L.: Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding. In: AAAI 2022, pp. 267–275. AAAI Press (2022). https://ojs.aaai.org/index.php/AAAI/article/view/19902
Choe, T.E., Lee, M.W., Guo, F., Taylor, G., Yu, L., Haering, N.: Semantic video event search for surveillance video. In: ICCV Workshops 2011, pp. 1963–1970. IEEE (2011)
Diwan, A., Peng, P., Mooney, R.J.: Zero-shot video moment retrieval with off-the-shelf models. CoRR abs/2211.02178 (2022). https://doi.org/10.48550/arXiv.2211.02178
Du, Y., Liu, Z., Li, J., Zhao, W.X.: A survey of vision-language pre-trained models. In: IJCAI 2022, pp. 5436–5443. ijcai.org (2022). https://doi.org/10.24963/ijcai.2022/762
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV 2019, pp. 6201–6210. IEEE (2019). https://doi.org/10.1109/ICCV.2019.00630
Hao, J., Sun, H., Ren, P., Wang, J., Qi, Q., Liao, J.: Can shuffling video benefit temporal bias problem: a novel training framework for temporal grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 130–147. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_8
Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language. In: ICCV 2017, pp. 5804–5813. IEEE Computer Society (2017). https://doi.org/10.1109/ICCV.2017.618
Huang, J., Worring, M.: Query-controllable video summarization. In: ICMR 2020, pp. 242–250. ACM (2020). https://doi.org/10.1145/3372278.3390695
Leake, M., Davis, A., Truong, A., Agrawala, M.: Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36(4), 130:1–130:14 (2017). https://doi.org/10.1145/3072959.3073653
Lei, J., Berg, T.L., Bansal, M.: QVHighlights: detecting moments and highlights in videos via natural language queries. CoRR abs/2107.09609 (2021). arxiv.org/abs/2107.09609
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 447–463. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_27
Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR abs/2301.12597 (2023). https://doi.org/10.48550/arXiv.2301.12597
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML 2022, vol. 162, pp. 12888–12900. PMLR (2022). https://proceedings.mlr.press/v162/li22n.html
Liu, D., et al.: Context-aware biaffine localizing network for temporal sentence grounding. In: CVPR 2021, pp. 11235–11244. Computer Vision Foundation/IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01108
Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR 2022, pp. 3032–3041. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00305
Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR 2023, pp. 23023–23033 (2023)
Nam, J., Ahn, D., Kang, D., Ha, S.J., Choi, J.: Zero-shot natural language video localization. In: ICCV 2021, pp. 1450–1459. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00150
Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. CoRR abs/2003.07048 (2020). arxiv.org/abs/2003.07048
Tang, K., Bao, Y., Zhao, Z., Zhu, L., Lin, Y., Peng, Y.: AutoHighlight: automatic highlights detection and segmentation in soccer matches. In: IEEE BigData 2018, pp. 4619–4624. IEEE (2018). https://doi.org/10.1109/BigData.2018.8621906
Tellex, S., Kollar, T., Shaw, G., Roy, N., Roy, D.: Grounding spatial language for video search. In: ICMI-MLMI 2010, pp. 31:1–31:8. ACM (2010). https://doi.org/10.1145/1891903.1891944
Wang, G., Wu, X., Liu, Z., Yan, J.: Prompt-based zero-shot video moment retrieval. In: MM 2022, pp. 413–421. ACM (2022). https://doi.org/10.1145/3503161.3548004
Wang, M., Yang, G.W., Hu, S.M., Yau, S.T., Shamir, A., et al.: Write-a-video: computational video montage from themed text. ACM Trans. Graph. 38(6), 177–1 (2019)
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Xu, M., et al.: Boundary-sensitive pre-training for temporal localization in videos. In: ICCV 2021, pp. 7200–7210. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00713
Zeng, Y., Cao, D., Lu, S., Zhang, H., Xu, J., Qin, Z.: Moment is important: language-based video moment retrieval via adversarial learning. ACM Trans. Multim. Comput. Commun. Appl. 18(2), 56:1–56:21 (2022). https://doi.org/10.1145/3478025
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: The elements of temporal sentence grounding in videos: a survey and future directions. CoRR abs/2201.08071 (2022). arxiv.org/abs/2201.08071
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI 2020, pp. 12870–12877. AAAI Press (2020). https://ojs.aaai.org/index.php/AAAI/article/view/6984
Acknowledgments
This work was partially funded by the Research School on “Service-Oriented Systems Engineering” of the Hasso Plattner Institute.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wattasseril, J.I., Shekhar, S., Döllner, J., Trapp, M. (2023). Zero-Shot Video Moment Retrieval Using BLIP-Based Models. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2023. Lecture Notes in Computer Science, vol 14361. Springer, Cham. https://doi.org/10.1007/978-3-031-47969-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-47969-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47968-7
Online ISBN: 978-3-031-47969-4
eBook Packages: Computer ScienceComputer Science (R0)