Abstract
End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain the intermediate representations by reducing the time resolution of the hidden states. Then the weighted average algorithm is used to combine these representations into sparse hidden states followed by the decoder. All the experiments are conducted on a Mandarin dataset AISHELL-1. Compared with RNN-T, the character error rate of the time-sparse transducer is close to RNN-T and the real-time factor is 50.00% of the original. By adjusting the time resolution, the time-sparse transducer can also reduce the real-time factor to 16.54% of the original at the expense of a 4.94% loss of precision.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772. PMLR (2014)
Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182. PMLR (2016)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. IEEE (2018)
Kim, S., Hori, T., Watanabe, S.: Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. IEEE (2017)
Graves, A.: Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012)
Graves, A., Mohamed, A.r., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE (2013)
Rao, K., Sak, H., Prabhavalkar, R.: Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. IEEE (2017)
He, Y., et al.: Streaming end-to-end speech recognition for mobile devices. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE (2019)
Han, W., et al.: Contextnet: improving convolutional neural networks for automatic speech recognition with global context. ArXiv abs/2005.03191 (2020)
Zhang, Q., et al.: Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053896
Kannan, A., et al.: Large-scale multilingual speech recognition with a streaming end-to-end model. arXiv preprint arXiv:1909.05330 (2019)
Variani, E., Rybach, D., Allauzen, C., Riley, M.: Hybrid autoregressive transducer (hat). In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6139–6143 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053600
Li, J., Zhao, R., Hu, H., Gong, Y.: Improving RNN transducer modeling for end-to-end speech recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 114–121. IEEE (2019)
Venkatesh, G., et al.: Memory-efficient speech recognition on smart devices. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8368–8372 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414502
Han, Y., Zhang, C., Li, X., Liu, Y., Wu, X.: Query-based composition for large-scale language model in lvcsr. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4898–4902 (2014). https://doi.org/10.1109/ICASSP.2014.6854533
Zhang, Y., Sun, S., Ma, L.: Tiny transducer: a highly-efficient speech recognition model on edge devices. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6024–6028 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413854
Chen, X., Wu, Y., Wang, Z., Liu, S., Li, J.: Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5904–5908 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413535
Kim, C., et al.: A review of on-device fully neural end-to-end automatic speech recognition algorithms. In: 2020 54th Asilomar Conference on Signals, Systems, and Computers, pp. 277–283 (2020). https://doi.org/10.1109/IEEECONF51394.2020.9443456
Tian, Z., Yi, J., Tao, J., Zhang, S., Wen, Z.: Hybrid autoregressive and non-autoregressive transformer models for speech recognition. IEEE Signal Process. Lett. 29, 762–766 (2022)
Ostmeyer, J., Cowell, L.: Machine learning on sequential data using a recurrent weighted average. Neurocomputing 331, 281–288 (2019)
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)
Moritz, N., Hori, T., Le, J.: Streaming automatic speech recognition with the transformer model. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6074–6078. IEEE (2020)
Tian, Z., Yi, J., Tao, J., Bai, Y., Zhang, S., Wen, Z.: Spike-triggered non-autoregressive transformer for end-to-end speech recognition. arXiv preprint arXiv:2005.07903 (2020)
Ghodsi, M., Liu, X., Apfel, J., Cabrera, R., Weinstein, E.: Rnn-transducer with stateless prediction network. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7049–7053. IEEE (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, X., Liang, M., Tian, Z., Yi, J., Tao, J. (2024). TST: Time-Sparse Transducer for Automatic Speech Recognition. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14474. Springer, Singapore. https://doi.org/10.1007/978-981-99-9119-8_7
Download citation
DOI: https://doi.org/10.1007/978-981-99-9119-8_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9118-1
Online ISBN: 978-981-99-9119-8
eBook Packages: Computer ScienceComputer Science (R0)