SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading

Wiriyathammabhum, Peratham

doi:10.1007/978-3-030-63820-7_63

Peratham Wiriyathammabhum ORCID: orcid.org/0000-0001-5567-3104¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1332))

Included in the following conference series:

International Conference on Neural Information Processing

2513 Accesses
9 Citations

Abstract

This paper presents a novel deep learning architecture for word-level lipreading. Previous works suggest a potential for incorporating a pretrained deep 3D Convolutional Neural Networks as a front-end feature extractor. We introduce SpotFast networks, a variant of the state-of-the-art SlowFast networks for action recognition, which utilizes a temporal window as a spot pathway and all frames as a fast pathway. The spot pathway uses word boundaries information while the fast pathway implicitly models other contexts. Both pathways are fused with dual temporal convolutions, which speed up training. We further incorporate memory augmented lateral transformers to learn sequential features for classification. We evaluate the proposed model on the LRW dataset. The experiments show that our proposed model outperforms various state-of-the-art models, and incorporating the memory augmented lateral transformers makes a \(3.7\%\) improvement to the SpotFast networks and \(16.1\%\) compared to finetuning the original SlowFast networks. The temporal window utilizing word boundaries helps improve the performance up to \(12.1\%\) by eliminating visual silences from coarticulations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Channel Enhanced Temporal-Shift Module for Efficient Lipreading

Lipreading with DenseNet and resBi-LSTM

Article 24 January 2020

Lip Reading Using Temporal Adaptive Module

References

Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Google Scholar
Bear, H.L., Harvey, R.: Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Commun. 95, 40–67 (2017)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017)
Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
Chapter Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Lample, G., Sablayrolles, A., Ranzato, M., Denoyer, L., Jégou, H.: Large memory layers with product keys. In: Advances in Neural Information Processing Systems, pp. 8546–8557 (2019)
Google Scholar
Mann, V.A., Repp, B.H.: Influence of vocalic context on perception of the [\(\int \)]-[s] distinction. Percept. Psychophys. 28(3), 213–228 (1980)
Article Google Scholar
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746 (1976)
Article Google Scholar
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
Google Scholar
Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)
Article Google Scholar
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. Proc. Interspeech 2017, 3652–3656 (2017)
Article Google Scholar
Taylor, S.L., Mahler, M., Theobald, B.J., Matthews, I.: Dynamic units of visual speech. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 275–284. Eurographics Association (2012)
Google Scholar
Thangthai, K.: Computer lipreading via hybrid deep neural network hidden Markov models. Ph.D. thesis, University of East Anglia (2018)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Weng, X., Kitani, K.: Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading. BMVC (2019). https://bmvc2019.org/wp-content/uploads/papers/0016-paper.pdf
Zhang, X., Cheng, F., Wang, S.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar

Download references

Author information

Authors and Affiliations

University of Maryland, College Park, MD, USA
Peratham Wiriyathammabhum

Authors

Peratham Wiriyathammabhum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peratham Wiriyathammabhum .

Editor information

Editors and Affiliations

Department of AI, Ping An Life, Shenzhen, China
Haiqin Yang
Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Kitsuchart Pasupa
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi-Sing Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
James T. Kwok
School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan
The Chinese University of Hong Kong, New Territories, Hong Kong
Irwin King

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wiriyathammabhum, P. (2020). SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1332. Springer, Cham. https://doi.org/10.1007/978-3-030-63820-7_63

Download citation

DOI: https://doi.org/10.1007/978-3-030-63820-7_63
Published: 17 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63819-1
Online ISBN: 978-3-030-63820-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Channel Enhanced Temporal-Shift Module for Efficient Lipreading

Lipreading with DenseNet and resBi-LSTM

Lip Reading Using Temporal Adaptive Module

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Channel Enhanced Temporal-Shift Module for Efficient Lipreading

Lipreading with DenseNet and resBi-LSTM

Lip Reading Using Temporal Adaptive Module

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation