TST: Time-Sparse Transducer for Automatic Speech Recognition | SpringerLink
Skip to main content

TST: Time-Sparse Transducer for Automatic Speech Recognition

  • Conference paper
  • First Online:
Artificial Intelligence (CICAI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14474))

Included in the following conference series:

  • 790 Accesses

Abstract

End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain the intermediate representations by reducing the time resolution of the hidden states. Then the weighted average algorithm is used to combine these representations into sparse hidden states followed by the decoder. All the experiments are conducted on a Mandarin dataset AISHELL-1. Compared with RNN-T, the character error rate of the time-sparse transducer is close to RNN-T and the real-time factor is 50.00% of the original. By adjusting the time resolution, the time-sparse transducer can also reduce the real-time factor to 16.54% of the original at the expense of a 4.94% loss of precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  2. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772. PMLR (2014)

    Google Scholar 

  3. Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182. PMLR (2016)

    Google Scholar 

  4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  5. Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)

    Google Scholar 

  6. Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. IEEE (2018)

    Google Scholar 

  7. Kim, S., Hori, T., Watanabe, S.: Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. IEEE (2017)

    Google Scholar 

  8. Graves, A.: Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012)

  9. Graves, A., Mohamed, A.r., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE (2013)

    Google Scholar 

  10. Rao, K., Sak, H., Prabhavalkar, R.: Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. IEEE (2017)

    Google Scholar 

  11. He, Y., et al.: Streaming end-to-end speech recognition for mobile devices. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE (2019)

    Google Scholar 

  12. Han, W., et al.: Contextnet: improving convolutional neural networks for automatic speech recognition with global context. ArXiv abs/2005.03191 (2020)

    Google Scholar 

  13. Zhang, Q., et al.: Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053896

  14. Kannan, A., et al.: Large-scale multilingual speech recognition with a streaming end-to-end model. arXiv preprint arXiv:1909.05330 (2019)

  15. Variani, E., Rybach, D., Allauzen, C., Riley, M.: Hybrid autoregressive transducer (hat). In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6139–6143 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053600

  16. Li, J., Zhao, R., Hu, H., Gong, Y.: Improving RNN transducer modeling for end-to-end speech recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 114–121. IEEE (2019)

    Google Scholar 

  17. Venkatesh, G., et al.: Memory-efficient speech recognition on smart devices. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8368–8372 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414502

  18. Han, Y., Zhang, C., Li, X., Liu, Y., Wu, X.: Query-based composition for large-scale language model in lvcsr. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4898–4902 (2014). https://doi.org/10.1109/ICASSP.2014.6854533

  19. Zhang, Y., Sun, S., Ma, L.: Tiny transducer: a highly-efficient speech recognition model on edge devices. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6024–6028 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413854

  20. Chen, X., Wu, Y., Wang, Z., Liu, S., Li, J.: Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5904–5908 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413535

  21. Kim, C., et al.: A review of on-device fully neural end-to-end automatic speech recognition algorithms. In: 2020 54th Asilomar Conference on Signals, Systems, and Computers, pp. 277–283 (2020). https://doi.org/10.1109/IEEECONF51394.2020.9443456

  22. Tian, Z., Yi, J., Tao, J., Zhang, S., Wen, Z.: Hybrid autoregressive and non-autoregressive transformer models for speech recognition. IEEE Signal Process. Lett. 29, 762–766 (2022)

    Article  Google Scholar 

  23. Ostmeyer, J., Cowell, L.: Machine learning on sequential data using a recurrent weighted average. Neurocomputing 331, 281–288 (2019)

    Article  Google Scholar 

  24. Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)

    Google Scholar 

  25. Moritz, N., Hori, T., Le, J.: Streaming automatic speech recognition with the transformer model. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6074–6078. IEEE (2020)

    Google Scholar 

  26. Tian, Z., Yi, J., Tao, J., Bai, Y., Zhang, S., Wen, Z.: Spike-triggered non-autoregressive transformer for end-to-end speech recognition. arXiv preprint arXiv:2005.07903 (2020)

  27. Ghodsi, M., Liu, X., Apfel, J., Cabrera, R., Weinstein, E.: Rnn-transducer with stateless prediction network. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7049–7053. IEEE (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mangui Liang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, X., Liang, M., Tian, Z., Yi, J., Tao, J. (2024). TST: Time-Sparse Transducer for Automatic Speech Recognition. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14474. Springer, Singapore. https://doi.org/10.1007/978-981-99-9119-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-9119-8_7

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-9118-1

  • Online ISBN: 978-981-99-9119-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics