Transformer-Based Automatic Speech Recognition with Auxiliary Input of Source Language Text Toward Transcribing Simultaneous Interpretation

Taniguchi, Shuta; Kato, Tsuneo; Tamura, Akihiro; Yasuda, Keiji

doi:10.21437/Interspeech.2022-448

ISCA Archive - Transformer-Based Automatic Speech Recognition with Auxiliary Input of Source Language Text Toward Transcribing Simultaneous Interpretation

ISCA Archive Interspeech 2022

Transformer-Based Automatic Speech Recognition with Auxiliary Input of Source Language Text Toward Transcribing Simultaneous Interpretation

Shuta Taniguchi, Tsuneo Kato, Akihiro Tamura, Keiji Yasuda

In the training programs of human simultaneous interpreters, trainee speech is transcribed into text for quality assessment. Though interpreter speech contains irregular speech events such as hesitations, filled pauses, and self-repairs, automatic speech recognition (ASR) is expected to be introduced to save labor of transcription. In the training programs, source language text can be used for ASR because the training materials are prepared in advance. Thus, we propose a Transformer-based end-to-end ASR with an auxiliary input of a source language text toward transcribing simultaneous interpretation. Because a sufficient amount of human interpreter speech with source language text is not available for training the model, we conducted the initial evaluation of the model by simulating speech with source language text by changing the inputs and outputs of large-scale corpora for developing end-to-end speech translation (ST). Our proposed model significantly reduced word error rates (WERs) for four ST corpora: MuST-C English speech - Netherlandic text, English speech - German text, CoVoST 2 English speech - Japanese text, and our original TED-based English speech - Japanese text corpus.