Abstract
The sign spotting task aims to identify whether and where an isolated sign of interest exists in a continuous sign language video. Recently, it has received substantial attention since it is a promising tool to annotate large-scale sign language data. Previous methods utilized multiple sources of available supervision information to localize the sign actions under the RGB domain. However, these methods overlook the complementary nature of different modalities, i.e., RGB, optical flow, and pose, which are beneficial to the sign spotting task. To this end, we propose a framework to merge multiple modalities for multiple-shot supervised learning. Furthermore, we explore the sign spotting task with the one-shot setting, which needs fewer annotations and has broader applications. To evaluate our approach, we participated in the Sign Spotting Challenge, organized by ECCV 2022. The competition contains two tracks, i.e., multiple-shot supervised learning (MSSL for track 1) and one-shot learning with weak labels (OSLWL for track 2). In track 1, our method achieves around 0.566 F1-score and is ranked 2nd. In track 2, we are ranked the 1st, with a 0.6 F1-score. These results demonstrate the effectiveness of our proposed method. We hope our solution will provide some insight for future research in the community.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Albanie, S., et al.: BSL-1K: scaling up co-articulated sign language recognition using mouthing cues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 35–53. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_3
Bai, Y., Wang, Y., Tong, Y., Yang, Y., Liu, Q., Liu, J.: Boundary content graph neural network for temporal action proposal generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_8
Bull, H., Afouras, T., Varol, G., Albanie, S., Momeni, L., Zisserman, A.: Aligning subtitles in sign language videos. In: ICCV, pp. 11552–11561 (2021)
Cai, Y., et al.: Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: ICCV, pp. 2272–2281 (2019)
Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: CVPR, pp. 7784–7793 (2018)
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Chang, S., Wang, P., Wang, F., Li, H., Feng, J.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv (2021)
Chen, K., et al.: MMDetection: Open MMLab detection toolbox and benchmark. arXiv (2019)
MMP Contributors: OpenMMLab pose estimation toolbox and benchmark (2020). https://github.com/open-mmlab/mmpose
Douze, M., Jégou, H., Schmid, C., Pérez, P.: Compact video description for copy detection with precise temporal alignment. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 522–535. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_38
Enriquez, M.V., Alba-Castro, J.L., Docio-Fernandez, L., Junior, J.C.S.J., Escalera, S.: ECCV 2022 sign spotting challenge: dataset, design and results. In: ECCVW (2022)
Gong, G., Zheng, L., Mu, Y.: Scale matters: temporal scale aggregation network for precise action localization in untrimmed videos. In: ICME, pp. 1–6 (2020)
Hu, H., Zhao, W., Zhou, W., Wang, Y., Li, H.: SignBERT: pre-training of hand-model-aware representation for sign language recognition. In: ICCV, pp. 11087–11096 (2021)
Jiang, Y.G., Wang, J.: Partial copy detection in videos: a benchmark and an evaluation of popular methods. TBD 2(1), 32–42 (2016)
Li, D., et al.: TSPNet: hierarchical feature learning via temporal semantic pyramid for sign language translation. In: NeurIPS, vol. 33, pp. 12034–12045 (2020)
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)
Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: ICCV, pp. 3889–3898 (2019)
Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM MM, pp. 988–996 (2017)
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: ECCV, pp. 3–19 (2018)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Liu, Y., Ma, L., Zhang, Y., Liu, W., Chang, S.F.: Multi-granularity generator for temporal action proposal. In: CVPR, pp. 3604–3613 (2019)
Min, Y., Hao, A., Chai, X., Chen, X.: Visual alignment constraint for continuous sign language recognition. In: ICCV, pp. 11542–11551 (2021)
Momeni, L., Afouras, T., Stafylakis, T., Albanie, S., Zisserman, A.: Seeing wake words: audio-visual keyword spotting. arXiv (2020)
Ong, E.J., Koller, O., Pugeault, N., Bowden, R.: Sign spotting using hierarchical sequential patterns with temporal intervals. In: CVPR, pp. 1923–1930 (2014)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Sincan, O.M., Keles, H.Y.: AUTSL: a large scale multi-modal Turkish sign language dataset and baseline methods. IEEE Access 8, 181340–181355 (2020)
Stafylakis, T., Tzimiropoulos, G.: Zero-shot keyword spotting for visual speech recognition in-the-wild. In: ECCV, pp. 513–529 (2018)
Tan, H.K., Ngo, C.W., Hong, R., Chua, T.S.: Scalable detection of partial near-duplicate videos by visual-temporal consistency. In: ACM MM, pp. 145–154 (2009)
Tan, J., Tang, J., Wang, L., Wu, G.: Relaxed transformer decoders for direct action proposal generation. In: ICCV, pp. 13526–13535 (2021)
Varol, G., Momeni, L., Albanie, S., Afouras, T., Zisserman, A.: Read and attend: temporal localisation in sign language videos. In: CVPR, pp. 16857–16866 (2021)
Viitaniemi, V., Jantunen, T., Savolainen, L., Karppa, M., Laaksonen, J.: S-pot-a benchmark in spotting signs within continuous signing. In: LREC (2014)
Wang, L., Yang, H., Wu, W., Yao, H., Huang, H.: Temporal action proposal generation with transformers. arXiv (2021)
Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-tad: sub-graph localization for temporal action detection. In: CVPR, pp. 10156–10165 (2020)
Zhang, C., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. arXiv preprint arXiv:2202.07925 (2022)
Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., Tian, Q.: Bottom-up temporal action localization with mutual regularization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 539–555. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_32
Acknowledgement
This work was supported by the National Natural Science Foundation of China under Contract U20A20183. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, L., Zhou, W., Zhao, W., Hu, H., Li, H. (2023). Multi-modal Sign Language Spotting by Multi/One-Shot Learning. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13808. Springer, Cham. https://doi.org/10.1007/978-3-031-25085-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-25085-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25084-2
Online ISBN: 978-3-031-25085-9
eBook Packages: Computer ScienceComputer Science (R0)