
计算机科学, 2022, Vol. 49, Issue (9): 155-161.doi: 10.11896/jsjkx.210800026

计算机图形学&多媒体


周乐员1, 张剑华1, 袁甜甜2, 陈胜勇1   

  1. 1 天津理工大学计算机科学与工程学院 天津 300382
    2 天津理工大学聋人工学院 天津 300382
  • 收稿日期:2021-08-03 修回日期:2021-12-10 出版日期:2022-09-15 发布日期:2022-09-09
  • 通讯作者: 张剑华(zjh@email.tjut.edu.cn)
  • 作者简介:(870185811@qq.com)
  • 基金资助:

Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion

ZHOU Le-yuan1, ZHANG Jian-hua1, YUAN Tian-tian2, CHEN Sheng-yong1   

  1. 1 School of Computer Science and Technology,Tianjin University of Technology,Tianjin 300382,China
    2 Technical College for the Deaf,Tianjin University of Technology,Tianjin 300382,China
  • Received:2021-08-03 Revised:2021-12-10 Online:2022-09-15 Published:2022-09-09
  • About author:ZHOU Le-yuan,born in 1996,postgra-duate.His main research interests include deep learning and computer vision.
    ZHANG Jian-hua,born in 1981,Ph.D,professor,Ph.D supervisor.His main research interests include computer vision,digital image processing and robot intelligent technology.
  • Supported by:
    National Natural Science Foundation of China(61876167),Natural Science Foundation of Zhejiang Province(LY20F030017) and Tianjin Intelligent Manufacturing Special Foundation(20201169).

摘要: 使计算机能够理解手语者的表达一直是一项极具挑战性的任务,不仅需要考虑手语视频的时间和空间信息,同时还要考虑手语语法的复杂性。在连续手语识别任务中,手语词汇和手语动作共享一致的顺序;而在连续手语翻译任务中,生成的自然语言句子应符合口语化描述,词汇顺序和动作顺序可能不一致。为了能够更加准确地学习手语者的表达,提出了一个新颖的能同时进行手语识别和翻译的深度神经网络。该方案探讨了不同的经典预训练卷积神经网络和不同的多层时序注意力分值函数在连续手语识别上的效果,网络将手语视频高级抽象特征和低级时序语义组合在多层时间注意力融合模块中,形成更全面的序列注意力融合特征,从而从连续手语视频中更准确地生成gloss句子。结合Transformer语言模型将手语识别gloss句子转换为符合手语翻译的连续自然语言句子。首先,该方法在第一个大规模的复杂背景的中国连续手语识别和翻译数据集Tslrt上进行评估。利用Tslrt数据集中手语者复杂的背景环境和丰富的动作表达来训练所提神经网络模型,通过不同的对比实验得到了一系列的基准结果。在连续手语识别和翻译的任务上,效果最好的词错误率分别达到了4.8%和5.1%。为了进一步证明所提方法的有效性,在另一个公开的中国连续手语识别数据集Chinese-CSL也进行了验证,并和其他13种公开方法进行了比较,结果表明,所提方法的词错误率达到了最好的识别效果,为1.8%,证明了该方法的有效性。

关键词: 连续手语识别和翻译, 视频理解, 序列模型, 注意力机制融合, 卷积神经网络

Abstract: Enabling computers to understand the expressions of signers has been a challenging task that requires considering not only the temporal and spatial information of sign language videos,but also the complexity of sign language grammar.In the continuous sign language recognition task,sign language words and sign language actions share a consistent order.In contrast,in the continuous sign language translation task,the generated natural language sentences have to conform to the spoken description,and the word order may not coincide with the action order.To enable more accurate learning of signers' expressions,this paper proposes a novel deep neural network for simultaneous sign language recognition and translation.In this scheme,we explore the effectiveness of different classical pre-trained convolutional neural networks,and different multilayer temporal attention score functions on continuous sign language recognition,combined with Transformer language model,to obtain continuous sign language translation conforming to the spoken description based on continuous sign language recognition.First,this method is assessed on the first large-scale complex background Chinese continuous sign language recognition and translation dataset Tslrt.The complex contextual environment and rich action expressions of signers in Tslrt dataset are used to train our neural network model through different comparison experiments,resulting in a series of benchmark results.The best WER are 4.8% and 5.1% on the tasks of continuous sign language recognition and translation,respectively.To further demonstrate the effectiveness of our method,experiments are conducted on another Chinese continuous sign language recognition dataset Chinese-CSL and compared with other 13 methods.The results show that the WER of our method reaches 1.8%,which proves the effectiveness of the proposed method.

Key words: Continuous sign language recognition and translation, Video understanding, Sequence model, Attention mechanism fusion, Convolutional neural network


  • TP391
