TST: Time-Sparse Transducer for Automatic Speech Recognition

Zhang, Xiaohui; Liang, Mangui; Tian, Zhengkun; Yi, Jiangyan; Tao, Jianhua

doi:10.1007/978-981-99-9119-8_7

Xiaohui Zhang^11,12,
Mangui Liang¹¹,
Zhengkun Tian¹²,
Jiangyan Yi¹² &
…
Jianhua Tao¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14474))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

790 Accesses

Abstract

End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain the intermediate representations by reducing the time resolution of the hidden states. Then the weighted average algorithm is used to combine these representations into sparse hidden states followed by the decoder. All the experiments are conducted on a Mandarin dataset AISHELL-1. Compared with RNN-T, the character error rate of the time-sparse transducer is close to RNN-T and the real-time factor is 50.00% of the original. By adjusting the time resolution, the time-sparse transducer can also reduce the real-time factor to 16.54% of the original at the expense of a 4.94% loss of precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Speech Enhancement Based on Two-Stage Neural Network with Structured State Space for Sequence Transformation

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

References

Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772. PMLR (2014)
Google Scholar
Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182. PMLR (2016)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Google Scholar
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. IEEE (2018)
Google Scholar
Kim, S., Hori, T., Watanabe, S.: Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. IEEE (2017)
Google Scholar
Graves, A.: Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012)
Graves, A., Mohamed, A.r., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE (2013)
Google Scholar
Rao, K., Sak, H., Prabhavalkar, R.: Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. IEEE (2017)
Google Scholar
He, Y., et al.: Streaming end-to-end speech recognition for mobile devices. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE (2019)
Google Scholar
Han, W., et al.: Contextnet: improving convolutional neural networks for automatic speech recognition with global context. ArXiv abs/2005.03191 (2020)
Google Scholar
Zhang, Q., et al.: Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053896
Kannan, A., et al.: Large-scale multilingual speech recognition with a streaming end-to-end model. arXiv preprint arXiv:1909.05330 (2019)
Variani, E., Rybach, D., Allauzen, C., Riley, M.: Hybrid autoregressive transducer (hat). In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6139–6143 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053600
Li, J., Zhao, R., Hu, H., Gong, Y.: Improving RNN transducer modeling for end-to-end speech recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 114–121. IEEE (2019)
Google Scholar
Venkatesh, G., et al.: Memory-efficient speech recognition on smart devices. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8368–8372 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414502
Han, Y., Zhang, C., Li, X., Liu, Y., Wu, X.: Query-based composition for large-scale language model in lvcsr. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4898–4902 (2014). https://doi.org/10.1109/ICASSP.2014.6854533
Zhang, Y., Sun, S., Ma, L.: Tiny transducer: a highly-efficient speech recognition model on edge devices. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6024–6028 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413854
Chen, X., Wu, Y., Wang, Z., Liu, S., Li, J.: Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5904–5908 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413535
Kim, C., et al.: A review of on-device fully neural end-to-end automatic speech recognition algorithms. In: 2020 54th Asilomar Conference on Signals, Systems, and Computers, pp. 277–283 (2020). https://doi.org/10.1109/IEEECONF51394.2020.9443456
Tian, Z., Yi, J., Tao, J., Zhang, S., Wen, Z.: Hybrid autoregressive and non-autoregressive transformer models for speech recognition. IEEE Signal Process. Lett. 29, 762–766 (2022)
Article Google Scholar
Ostmeyer, J., Cowell, L.: Machine learning on sequential data using a recurrent weighted average. Neurocomputing 331, 281–288 (2019)
Article Google Scholar
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)
Google Scholar
Moritz, N., Hori, T., Le, J.: Streaming automatic speech recognition with the transformer model. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6074–6078. IEEE (2020)
Google Scholar
Tian, Z., Yi, J., Tao, J., Bai, Y., Zhang, S., Wen, Z.: Spike-triggered non-autoregressive transformer for end-to-end speech recognition. arXiv preprint arXiv:2005.07903 (2020)
Ghodsi, M., Liu, X., Apfel, J., Cabrera, R., Weinstein, E.: Rnn-transducer with stateless prediction network. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7049–7053. IEEE (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Xiaohui Zhang & Mangui Liang
Institute of Automation, Chinese Academy of Science, Beijing, China
Xiaohui Zhang, Zhengkun Tian & Jiangyan Yi
Department of Automation, Tsinghua University, Beijing, China
Jianhua Tao

Authors

Xiaohui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mangui Liang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengkun Tian
View author publications
You can also search for this author in PubMed Google Scholar
Jiangyan Yi
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mangui Liang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Jian Pei
Shanghai Jiao Tong Univeristy, Shanghai, China
Guangtao Zhai
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Liang, M., Tian, Z., Yi, J., Tao, J. (2024). TST: Time-Sparse Transducer for Automatic Speech Recognition. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14474. Springer, Singapore. https://doi.org/10.1007/978-981-99-9119-8_7

Download citation

DOI: https://doi.org/10.1007/978-981-99-9119-8_7
Published: 03 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9118-1
Online ISBN: 978-981-99-9119-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TST: Time-Sparse Transducer for Automatic Speech Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Enhancement Based on Two-Stage Neural Network with Structured State Space for Sequence Transformation

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

TST: Time-Sparse Transducer for Automatic Speech Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Enhancement Based on Two-Stage Neural Network with Structured State Space for Sequence Transformation

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation