A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks | Neural Computing and Applications Skip to main content
Log in

A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

End-to-end speech translation (ST) has attracted substantial attention due to its less error accumulation and lower latency. Based on triplet ST data \(\langle\) speech-transcription-translation\(\rangle\), multitask learning (MTL) that utilizes machine translation \(\langle\) transcription-translation\(\rangle\) or automatic speech recognition \(\langle\) speech-transcription\(\rangle\) task to assist in training ST model is widely employed. However, current MTL methods often suffer from subnet role mismatch, semantic inconsistency, or usually focus only on transferring knowledge from automatic speech recognition (ASR) or machine translation (MT) task, leading to insufficient transferring of cross-task knowledge. To solve these problems, we propose the multitask co-training network (MCTN) to jointly model ST, MT, and ASR tasks. Specifically, the ASR task enables the acoustic encoder to better capture local information of speech frames, and the MT task enhances the translation capability of the model. MCTN benefits from three key aspects: a well-designed multitask framework to fully exploit the association between tasks, a model decoupling and parameter sharing method to maintain consistency in subnet roles, and a co-training strategy to utilize task information in triplet ST data. Our experiments show that MCTN achieves state-of-the-art results, when using only MuST-C dataset, and significantly outperforms strong end-to-end ST baselines and cascaded systems when external data are available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The methods of getting the associated data are provided within the paper.

Notes

  1. https://www.statmt.org/wmt16/translation-task.html.

  2. http://opus.nlpl.eu/opus-100.php.

  3. https://www.statmt.org/wmt13/translation-task.html.

  4. https://github.com/kaldi-asr/kaldi.

  5. https://github.com/google/sentencepiece.

  6. https://github.com/espnet/espnet.

  7. https://github.com/mjpost/sacrebleu.

References

  1. Fang Q, Ye R, Li L, Feng Y, Wang M (2022) Stemm: self-learning with speech-text manifold mixup for speech translation. In: Proc ACL, pp 7050–7062

  2. Dong Q, Ye R, Wang M, Zhou H, Xu S, Xu B, Li L (2021) Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation. In: Proc AAAI

  3. Zhang P, Ge N, Chen B, Fan K (2019) Lattice transformer for speech translation. In: Proc ACL, pp 6475–6484

  4. Lam TK, Schamoni S, Riezler S (2021) Cascaded models with cyclic feedback for direct speech translation. In: Proc ICASSP, pp 7508–7512 . IEEE

  5. Dong Q, Wang F, Yang Z, Chen W, Xu S, Xu B (2019) Adapting translation models for transcript disfluency detection. In: Proc AAAI, vol 33, pp 6351–6358

  6. Sperber M, Neubig G, Niehues J, Waibel A (2017) Neural lattice-to-sequence models for uncertain inputs. In: Proc EMNLP, pp 1380–1389

  7. Wang C, Wu Y, Liu S, Yang Z, Zhou M (2020) Bridging the gap between pre-training and fine-tuning for end-to-end speech translation. In: Proc AAAI

  8. Wang C, Wu Y, Liu S, Zhou M, Yang Z (2020) Curriculum pre-training for end-to-end speech translation. In: Proc ACL, pp 3728–3738

  9. Tang Y, Pino J, Li X, Wang C, Genzel D (2021) Improving speech translation by understanding and learning from the auxiliary text translation task. In: Proc ACL

  10. Han C, Wang M, Ji H, Li L (2021) Learning shared semantic space for speech-to-text translation. In: Proc ACL - findings, pp 2214–2225

  11. Ye R, Wang M, Li L (2021) End-to-end speech translation via cross-modal progressive training

  12. Weiss RJ, Chorowski J, Jaitly N, Wu Y, Chen Z (2017) Sequence-to-sequence models can directly translate foreign speech. In: Proc Interspeech, pp 2625–2629

  13. Anastasopoulos A, Chiang D (2018) Tied multitask learning for neural speech translation. In: Proc NAACL-HLT

  14. Bahar P, Bieschke T, Ney H (2019) A comparative study on end-to-end speech to text translation. In: Proc ASRU

  15. Bansal S, Kamper H, Livescu K, Lopez A, Goldwater S (2019) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In: Proc NAACL-HLT, pp 58–68

  16. Tang Y, Pino J, Wang C, Ma X, Genzel D (2021) A general multi-task learning framework to leverage text data for speech to text tasks. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6209–6213. IEEE

  17. Ko Y, Sudoh K, Sakti S, Nakamura S (2021) ASR posterior-based loss for multi-task end-to-end speech translation. In: Interspeech, pp 2272–2276

  18. Indurthi S, Han H, Lakumarapu NK, Lee B, Chung I, Kim S, Kim C (2020) Data efficient direct speech-to-text translation with modality agnostic meta-learning. In: Proceedings of ICASSP. IEEE

  19. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings NeurIPS

  20. Le H, Pino J, Wang C, Gu J, Schwab D, Besacier L (2020) Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation. In: Proc of COLING, pp 3520–3533

  21. Du Y, Zhang Z, Wang W, Chen B, Xie J, Xu T (2022) Regularizing end-to-end speech translation with triangular decomposition agreement. In: Proc AAAI

  22. Gaido M, Di Gangi MA, Negri M, Turchi M (2020) End-to-end speech-translation with knowledge distillation: Fbk@ iwslt2020. In: Proc INTERSPEECH, pp 80–88

  23. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proc ICML, pp 369–376

  24. Bérard A, Pietquin O, Besacier L, Servan C (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. In: NIPS workshop on end-to-end learning for speech and audio processing

  25. Cheng Y, Tu Z, Meng F, Zhai J, Liu Y (2018) Towards robust neural machine translation. In: Proc ACL, pp 1756–1766

  26. Lokesh S, Malarvizhi Kumar P, Ramya Devi M, Parthasarathy P, Gokulnath C (2019) An automatic Tamil speech recognition system by using bidirectional recurrent neural network with self-organizing map. Neural Comput. Appl. 31:1521–1531

    Article  Google Scholar 

  27. Qing-dao-er-ji R, Su YL, Liu WW (2020) Research on the LSTM Mongolian and Chinese machine translation based on morpheme encoding. Neural Comput Appl 32:41–49

    Article  Google Scholar 

  28. Gulati A, Qin J, Chiu C-C, Parmar N, Zhang Y, Yu J, Han W, Wang S, Zhang Z, Wu Y, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. Proc Interspeech, 5036–5040

  29. Weiss RJ, Chorowski J, Jaitly N, Wu Y, Chen Z (2017) Sequence-to-sequence models can directly translate foreign speech. In: INTERSPEECH.https://arxiv.org/pdf/1703.08581.pdf

  30. Vila LC, Escolano C, Fonollosa JA, Costa-Jussa MR (2018) End-to-end speech translation with the transformer. In: IberSPEECH, pp 60–63

  31. Salesky E, Sperber M, Waibel A (2019) Fluent translations from disfluent speech in end-to-end speech translation. In: Proc of NAACL-HLT, pp. 2786–2792

  32. Ren Y, Liu J, Tan X, Zhang C, Qin T, Zhao Z, Liu T-Y (2020) Simulspeech: end-to-end simultaneous speech to text translation. In: Proc ACL, pp 3787–3796

  33. Zhao J, Luo W, Chen B, Gilman A (2021) Mutual-learning improves end-to-end speech translation. In: Proc EMNLP, pp 3989–3994

  34. Pino J, Xu Q, Ma X, Dousti MJ, Tang Y (2020) Self-training for end-to-end speech translation. In: Proc Interspeech, pp 1476–1480

  35. Alinejad A, Sarkar A (2020) Effectively pretraining a speech translation decoder with machine translation data. In: Proc EMNLP, pp 8014–8020

  36. Xu C, Hu B, Li Y, Zhang Y, Huang S, Ju Q, Xiao T, Zhu J (2021) Stacked acoustic-and-textual encoding: integrating the pre-trained models into speech translation encoders. In: Proc ACL, pp 2619–2630

  37. Vydana HK, Karafiát M, Zmolikova K, Burget L, Černockỳ H (2021) Jointly trained transformers models for spoken language translation. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7513–7517 . IEEE

  38. Lam TK, Schamoni S, Riezler S (2022) Sample, translate, recombine: Leveraging audio alignments for data augmentation in end-to-end speech translation. In: Proc ACL - Short Papers, pp 245–254

  39. Mi C, Xie L, Zhang Y (2022) Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing. Neural Netw 148:194–205

    Article  Google Scholar 

  40. Liu Y, Xiong H, Zhang J, He Z, Wu H, Wang H, Zong C (2019) End-to-end speech translation with knowledge distillation. Proc Interspeech 2019:1128–1132

    Google Scholar 

  41. Inaguma H, Kawahara T, Watanabe S (2021) Source and target bidirectional knowledge distillation for end-to-end speech translation. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1872–1881

  42. Indurthi S, Han H, Lakumarapu NK, Lee B, Chung I, Kim S, Kim C (2020) End-end speech-to-text translation with modality agnostic meta-learning. In: Proc. ICASSP, pp 7904–7908 . IEEE

  43. Bahar P, Bieschke T, Ney H (2019) A comparative study on end-to-end speech to text translation. In: ASRU

  44. Jia Y, Johnson M, Macherey W, Weiss RJ, Cao Y, Chiu C-C, Ari N, Laurenzo S, Wu Y (2019) Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7180–7184 . IEEE

  45. Lu R-K, Liu J-W, Lian S-M, Zuo X (2020) Multi-view representation learning in multi-task scene. Neural Comput Appl 32:10403–10422

    Article  Google Scholar 

  46. Di Gangi MA, Cattoni R, Bentivogli L, Negri M, Turchi M (2019) Must-c: a multilingual speech translation corpus. In: Proc NAACL-HLT, pp 2012–2017

  47. Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: An ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964

  48. Inaguma H, Kiyono S, Duh K, Karita S, Yalta N, Hayashi T, Watanabe S (2020) Espnet-st: all-in-one speech translation toolkit. In: Proc ACL, pp 302–311

  49. Park DS, Chan W, Zhang Y, Chiu C-C, Zoph B, Cubuk ED, Le QV (2019) Specaugment: a simple data augmentation method for automatic speech recognition. In: Proc Interspeech

  50. Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: Proc. Interspeech

  51. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proc. ICLR

  52. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proc. ACL, pp 311–318

  53. Wang C, Tang Y, Ma X, Wu A, Okhonko D, Pino J (2020) Fairseq s2t: Fast speech-to-text modeling with fairseq. In: Proc NAACL - demonstrations, pp 33–39

  54. Zhao C, Wang M, Dong Q, Ye R, Li L (2021) Neurst: Neural speech translation toolkit. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing: system demonstrations, pp 55–62

  55. Zhang B, Titov I, Haddow B, Sennrich R (2020) Adaptive feature selection for end-to-end speech translation. In: Proc. EMNLP - FIndings, pp 2533–2544

  56. Papi S, Gaido M, Negri M, Turchi M (2021) Speechformer: Reducing information loss in direct speech translation. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 1698–1706

  57. Li X, Wang C, Tang Y, Tran C, Tang Y, Pino J, Baevski A, Conneau A, Auli M (2021) Multilingual speech translation from efficient finetuning of pretrained models. In: Proc ACL, pp 827–838

  58. Chen J, Ma M, Zheng R, Huang L (2021) Specrec: an alternative solution for improving end-to-end speech-to-text translation via spectrogram reconstruction. In: Proc Interspeech, pp 2232–2236

  59. Le H, Pino J, Wang C, Gu J, Schwab D, Besacier L (2021) Lightweight adapter tuning for multilingual speech translation. In: Proc ACL - short papers, pp 817–824

  60. Zheng R, Chen J, Ma M, Huang L (2021) Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation. In: Proc. ICML, pp 12736–12746 . PMLR

  61. Baevski A, Zhou Y, Mohamed A, Auli M (2020) wav2vec 2.0: A framework for self-supervised learning of speech representations

  62. Wang C, Wu A, Gu J, Pino J (2021) Covost 2 and massively multilingual speech translation. Proc Interspeech 2021:2247–2251

    Google Scholar 

  63. Zhang B, Haddow B, Sennrich R (2022) Revisiting end-to-end speech-to-text translation from scratch. In: International conference on machine learning, pp 26193–26205. PMLR

  64. Zhang D, Ye R, Ko T, Wang M, Zhou Y (2023) DUB: Discrete unit back-translation for speech translation. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the association for computational linguistics: ACL 2023. Association for Computational Linguistics, Toronto

    Google Scholar 

  65. Gangi MAD, Cattoni R, Bentivogli L, Negri M, Turchi M (2019) MuST-C: a multilingual speech translation corpus. In: NAACL-HLT . https://www.aclweb.org/anthology/N19-1202.pdf

  66. Gaido M, Cettolo M, Negri M, Turchi M (2021) CTC-based compression for direct speech translation. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 690–696. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.57

  67. Dong L, Xu B (2020) Cif: Continuous integrate-and-fire for end-to-end speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6079–6083. IEEE

  68. Lin J, Song J, Zhou Z, Chen Y, Shi X (2023) Automated scholarly paper review: concepts, technologies, and challenges. Inf Fusion 98:101830

    Article  Google Scholar 

  69. Bai P, Zhou Y, Zheng M, Sun W, Shi X (2023) Improving chinese pop song and hokkien gezi opera singing voice synthesis by enhancing local modeling. In: Proceedings of the 2023 conference on empirical methods in natural language processing, pp 3302–3312

Download references

Acknowledgements

We thank the anonymous reviewers for their helpful comments and suggestions. This work is supported by National key R &D Program of China (Grant no. 2020AAA0107904), the Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan (Grant nos. WT135-38), and the Key Support Project of NSFC-Liaoning Joint Foundation (Grant no. U1908216).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaodong Shi.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Yuan, Y. & Shi, X. A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks. Neural Comput & Applic 36, 8641–8656 (2024). https://doi.org/10.1007/s00521-024-09547-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09547-8

Keywords