Abstract
This paper presents a novel deep learning architecture for word-level lipreading. Previous works suggest a potential for incorporating a pretrained deep 3D Convolutional Neural Networks as a front-end feature extractor. We introduce SpotFast networks, a variant of the state-of-the-art SlowFast networks for action recognition, which utilizes a temporal window as a spot pathway and all frames as a fast pathway. The spot pathway uses word boundaries information while the fast pathway implicitly models other contexts. Both pathways are fused with dual temporal convolutions, which speed up training. We further incorporate memory augmented lateral transformers to learn sequential features for classification. We evaluate the proposed model on the LRW dataset. The experiments show that our proposed model outperforms various state-of-the-art models, and incorporating the memory augmented lateral transformers makes a \(3.7\%\) improvement to the SpotFast networks and \(16.1\%\) compared to finetuning the original SlowFast networks. The temporal window utilizing word boundaries helps improve the performance up to \(12.1\%\) by eliminating visual silences from coarticulations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Bear, H.L., Harvey, R.: Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Commun. 95, 40–67 (2017)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)
Lample, G., Sablayrolles, A., Ranzato, M., Denoyer, L., Jégou, H.: Large memory layers with product keys. In: Advances in Neural Information Processing Systems, pp. 8546–8557 (2019)
Mann, V.A., Repp, B.H.: Influence of vocalic context on perception of the [\(\int \)]-[s] distinction. Percept. Psychophys. 28(3), 213–228 (1980)
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746 (1976)
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. Proc. Interspeech 2017, 3652–3656 (2017)
Taylor, S.L., Mahler, M., Theobald, B.J., Matthews, I.: Dynamic units of visual speech. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 275–284. Eurographics Association (2012)
Thangthai, K.: Computer lipreading via hybrid deep neural network hidden Markov models. Ph.D. thesis, University of East Anglia (2018)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Weng, X., Kitani, K.: Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading. BMVC (2019). https://bmvc2019.org/wp-content/uploads/papers/0016-paper.pdf
Zhang, X., Cheng, F., Wang, S.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wiriyathammabhum, P. (2020). SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1332. Springer, Cham. https://doi.org/10.1007/978-3-030-63820-7_63
Download citation
DOI: https://doi.org/10.1007/978-3-030-63820-7_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63819-1
Online ISBN: 978-3-030-63820-7
eBook Packages: Computer ScienceComputer Science (R0)