Abstract
End-to-end speech translation (ST) has attracted substantial attention due to its less error accumulation and lower latency. Based on triplet ST data \(\langle\) speech-transcription-translation\(\rangle\), multitask learning (MTL) that utilizes machine translation \(\langle\) transcription-translation\(\rangle\) or automatic speech recognition \(\langle\) speech-transcription\(\rangle\) task to assist in training ST model is widely employed. However, current MTL methods often suffer from subnet role mismatch, semantic inconsistency, or usually focus only on transferring knowledge from automatic speech recognition (ASR) or machine translation (MT) task, leading to insufficient transferring of cross-task knowledge. To solve these problems, we propose the multitask co-training network (MCTN) to jointly model ST, MT, and ASR tasks. Specifically, the ASR task enables the acoustic encoder to better capture local information of speech frames, and the MT task enhances the translation capability of the model. MCTN benefits from three key aspects: a well-designed multitask framework to fully exploit the association between tasks, a model decoupling and parameter sharing method to maintain consistency in subnet roles, and a co-training strategy to utilize task information in triplet ST data. Our experiments show that MCTN achieves state-of-the-art results, when using only MuST-C dataset, and significantly outperforms strong end-to-end ST baselines and cascaded systems when external data are available.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The methods of getting the associated data are provided within the paper.
References
Fang Q, Ye R, Li L, Feng Y, Wang M (2022) Stemm: self-learning with speech-text manifold mixup for speech translation. In: Proc ACL, pp 7050–7062
Dong Q, Ye R, Wang M, Zhou H, Xu S, Xu B, Li L (2021) Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation. In: Proc AAAI
Zhang P, Ge N, Chen B, Fan K (2019) Lattice transformer for speech translation. In: Proc ACL, pp 6475–6484
Lam TK, Schamoni S, Riezler S (2021) Cascaded models with cyclic feedback for direct speech translation. In: Proc ICASSP, pp 7508–7512 . IEEE
Dong Q, Wang F, Yang Z, Chen W, Xu S, Xu B (2019) Adapting translation models for transcript disfluency detection. In: Proc AAAI, vol 33, pp 6351–6358
Sperber M, Neubig G, Niehues J, Waibel A (2017) Neural lattice-to-sequence models for uncertain inputs. In: Proc EMNLP, pp 1380–1389
Wang C, Wu Y, Liu S, Yang Z, Zhou M (2020) Bridging the gap between pre-training and fine-tuning for end-to-end speech translation. In: Proc AAAI
Wang C, Wu Y, Liu S, Zhou M, Yang Z (2020) Curriculum pre-training for end-to-end speech translation. In: Proc ACL, pp 3728–3738
Tang Y, Pino J, Li X, Wang C, Genzel D (2021) Improving speech translation by understanding and learning from the auxiliary text translation task. In: Proc ACL
Han C, Wang M, Ji H, Li L (2021) Learning shared semantic space for speech-to-text translation. In: Proc ACL - findings, pp 2214–2225
Ye R, Wang M, Li L (2021) End-to-end speech translation via cross-modal progressive training
Weiss RJ, Chorowski J, Jaitly N, Wu Y, Chen Z (2017) Sequence-to-sequence models can directly translate foreign speech. In: Proc Interspeech, pp 2625–2629
Anastasopoulos A, Chiang D (2018) Tied multitask learning for neural speech translation. In: Proc NAACL-HLT
Bahar P, Bieschke T, Ney H (2019) A comparative study on end-to-end speech to text translation. In: Proc ASRU
Bansal S, Kamper H, Livescu K, Lopez A, Goldwater S (2019) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In: Proc NAACL-HLT, pp 58–68
Tang Y, Pino J, Wang C, Ma X, Genzel D (2021) A general multi-task learning framework to leverage text data for speech to text tasks. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6209–6213. IEEE
Ko Y, Sudoh K, Sakti S, Nakamura S (2021) ASR posterior-based loss for multi-task end-to-end speech translation. In: Interspeech, pp 2272–2276
Indurthi S, Han H, Lakumarapu NK, Lee B, Chung I, Kim S, Kim C (2020) Data efficient direct speech-to-text translation with modality agnostic meta-learning. In: Proceedings of ICASSP. IEEE
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings NeurIPS
Le H, Pino J, Wang C, Gu J, Schwab D, Besacier L (2020) Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation. In: Proc of COLING, pp 3520–3533
Du Y, Zhang Z, Wang W, Chen B, Xie J, Xu T (2022) Regularizing end-to-end speech translation with triangular decomposition agreement. In: Proc AAAI
Gaido M, Di Gangi MA, Negri M, Turchi M (2020) End-to-end speech-translation with knowledge distillation: Fbk@ iwslt2020. In: Proc INTERSPEECH, pp 80–88
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proc ICML, pp 369–376
Bérard A, Pietquin O, Besacier L, Servan C (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. In: NIPS workshop on end-to-end learning for speech and audio processing
Cheng Y, Tu Z, Meng F, Zhai J, Liu Y (2018) Towards robust neural machine translation. In: Proc ACL, pp 1756–1766
Lokesh S, Malarvizhi Kumar P, Ramya Devi M, Parthasarathy P, Gokulnath C (2019) An automatic Tamil speech recognition system by using bidirectional recurrent neural network with self-organizing map. Neural Comput. Appl. 31:1521–1531
Qing-dao-er-ji R, Su YL, Liu WW (2020) Research on the LSTM Mongolian and Chinese machine translation based on morpheme encoding. Neural Comput Appl 32:41–49
Gulati A, Qin J, Chiu C-C, Parmar N, Zhang Y, Yu J, Han W, Wang S, Zhang Z, Wu Y, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. Proc Interspeech, 5036–5040
Weiss RJ, Chorowski J, Jaitly N, Wu Y, Chen Z (2017) Sequence-to-sequence models can directly translate foreign speech. In: INTERSPEECH.https://arxiv.org/pdf/1703.08581.pdf
Vila LC, Escolano C, Fonollosa JA, Costa-Jussa MR (2018) End-to-end speech translation with the transformer. In: IberSPEECH, pp 60–63
Salesky E, Sperber M, Waibel A (2019) Fluent translations from disfluent speech in end-to-end speech translation. In: Proc of NAACL-HLT, pp. 2786–2792
Ren Y, Liu J, Tan X, Zhang C, Qin T, Zhao Z, Liu T-Y (2020) Simulspeech: end-to-end simultaneous speech to text translation. In: Proc ACL, pp 3787–3796
Zhao J, Luo W, Chen B, Gilman A (2021) Mutual-learning improves end-to-end speech translation. In: Proc EMNLP, pp 3989–3994
Pino J, Xu Q, Ma X, Dousti MJ, Tang Y (2020) Self-training for end-to-end speech translation. In: Proc Interspeech, pp 1476–1480
Alinejad A, Sarkar A (2020) Effectively pretraining a speech translation decoder with machine translation data. In: Proc EMNLP, pp 8014–8020
Xu C, Hu B, Li Y, Zhang Y, Huang S, Ju Q, Xiao T, Zhu J (2021) Stacked acoustic-and-textual encoding: integrating the pre-trained models into speech translation encoders. In: Proc ACL, pp 2619–2630
Vydana HK, Karafiát M, Zmolikova K, Burget L, Černockỳ H (2021) Jointly trained transformers models for spoken language translation. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7513–7517 . IEEE
Lam TK, Schamoni S, Riezler S (2022) Sample, translate, recombine: Leveraging audio alignments for data augmentation in end-to-end speech translation. In: Proc ACL - Short Papers, pp 245–254
Mi C, Xie L, Zhang Y (2022) Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing. Neural Netw 148:194–205
Liu Y, Xiong H, Zhang J, He Z, Wu H, Wang H, Zong C (2019) End-to-end speech translation with knowledge distillation. Proc Interspeech 2019:1128–1132
Inaguma H, Kawahara T, Watanabe S (2021) Source and target bidirectional knowledge distillation for end-to-end speech translation. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1872–1881
Indurthi S, Han H, Lakumarapu NK, Lee B, Chung I, Kim S, Kim C (2020) End-end speech-to-text translation with modality agnostic meta-learning. In: Proc. ICASSP, pp 7904–7908 . IEEE
Bahar P, Bieschke T, Ney H (2019) A comparative study on end-to-end speech to text translation. In: ASRU
Jia Y, Johnson M, Macherey W, Weiss RJ, Cao Y, Chiu C-C, Ari N, Laurenzo S, Wu Y (2019) Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7180–7184 . IEEE
Lu R-K, Liu J-W, Lian S-M, Zuo X (2020) Multi-view representation learning in multi-task scene. Neural Comput Appl 32:10403–10422
Di Gangi MA, Cattoni R, Bentivogli L, Negri M, Turchi M (2019) Must-c: a multilingual speech translation corpus. In: Proc NAACL-HLT, pp 2012–2017
Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: An ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964
Inaguma H, Kiyono S, Duh K, Karita S, Yalta N, Hayashi T, Watanabe S (2020) Espnet-st: all-in-one speech translation toolkit. In: Proc ACL, pp 302–311
Park DS, Chan W, Zhang Y, Chiu C-C, Zoph B, Cubuk ED, Le QV (2019) Specaugment: a simple data augmentation method for automatic speech recognition. In: Proc Interspeech
Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: Proc. Interspeech
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proc. ICLR
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proc. ACL, pp 311–318
Wang C, Tang Y, Ma X, Wu A, Okhonko D, Pino J (2020) Fairseq s2t: Fast speech-to-text modeling with fairseq. In: Proc NAACL - demonstrations, pp 33–39
Zhao C, Wang M, Dong Q, Ye R, Li L (2021) Neurst: Neural speech translation toolkit. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing: system demonstrations, pp 55–62
Zhang B, Titov I, Haddow B, Sennrich R (2020) Adaptive feature selection for end-to-end speech translation. In: Proc. EMNLP - FIndings, pp 2533–2544
Papi S, Gaido M, Negri M, Turchi M (2021) Speechformer: Reducing information loss in direct speech translation. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 1698–1706
Li X, Wang C, Tang Y, Tran C, Tang Y, Pino J, Baevski A, Conneau A, Auli M (2021) Multilingual speech translation from efficient finetuning of pretrained models. In: Proc ACL, pp 827–838
Chen J, Ma M, Zheng R, Huang L (2021) Specrec: an alternative solution for improving end-to-end speech-to-text translation via spectrogram reconstruction. In: Proc Interspeech, pp 2232–2236
Le H, Pino J, Wang C, Gu J, Schwab D, Besacier L (2021) Lightweight adapter tuning for multilingual speech translation. In: Proc ACL - short papers, pp 817–824
Zheng R, Chen J, Ma M, Huang L (2021) Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation. In: Proc. ICML, pp 12736–12746 . PMLR
Baevski A, Zhou Y, Mohamed A, Auli M (2020) wav2vec 2.0: A framework for self-supervised learning of speech representations
Wang C, Wu A, Gu J, Pino J (2021) Covost 2 and massively multilingual speech translation. Proc Interspeech 2021:2247–2251
Zhang B, Haddow B, Sennrich R (2022) Revisiting end-to-end speech-to-text translation from scratch. In: International conference on machine learning, pp 26193–26205. PMLR
Zhang D, Ye R, Ko T, Wang M, Zhou Y (2023) DUB: Discrete unit back-translation for speech translation. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the association for computational linguistics: ACL 2023. Association for Computational Linguistics, Toronto
Gangi MAD, Cattoni R, Bentivogli L, Negri M, Turchi M (2019) MuST-C: a multilingual speech translation corpus. In: NAACL-HLT . https://www.aclweb.org/anthology/N19-1202.pdf
Gaido M, Cettolo M, Negri M, Turchi M (2021) CTC-based compression for direct speech translation. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 690–696. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.57
Dong L, Xu B (2020) Cif: Continuous integrate-and-fire for end-to-end speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6079–6083. IEEE
Lin J, Song J, Zhou Z, Chen Y, Shi X (2023) Automated scholarly paper review: concepts, technologies, and challenges. Inf Fusion 98:101830
Bai P, Zhou Y, Zheng M, Sun W, Shi X (2023) Improving chinese pop song and hokkien gezi opera singing voice synthesis by enhancing local modeling. In: Proceedings of the 2023 conference on empirical methods in natural language processing, pp 3302–3312
Acknowledgements
We thank the anonymous reviewers for their helpful comments and suggestions. This work is supported by National key R &D Program of China (Grant no. 2020AAA0107904), the Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan (Grant nos. WT135-38), and the Key Support Project of NSFC-Liaoning Joint Foundation (Grant no. U1908216).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, Y., Yuan, Y. & Shi, X. A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks. Neural Comput & Applic 36, 8641–8656 (2024). https://doi.org/10.1007/s00521-024-09547-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09547-8