SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading | SpringerLink
Skip to main content

SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2020)

Abstract

This paper presents a novel deep learning architecture for word-level lipreading. Previous works suggest a potential for incorporating a pretrained deep 3D Convolutional Neural Networks as a front-end feature extractor. We introduce SpotFast networks, a variant of the state-of-the-art SlowFast networks for action recognition, which utilizes a temporal window as a spot pathway and all frames as a fast pathway. The spot pathway uses word boundaries information while the fast pathway implicitly models other contexts. Both pathways are fused with dual temporal convolutions, which speed up training. We further incorporate memory augmented lateral transformers to learn sequential features for classification. We evaluate the proposed model on the LRW dataset. The experiments show that our proposed model outperforms various state-of-the-art models, and incorporating the memory augmented lateral transformers makes a \(3.7\%\) improvement to the SpotFast networks and \(16.1\%\) compared to finetuning the original SlowFast networks. The temporal window utilizing word boundaries helps improve the performance up to \(12.1\%\) by eliminating visual silences from coarticulations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)

    Google Scholar 

  2. Bear, H.L., Harvey, R.: Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Commun. 95, 40–67 (2017)

    Article  Google Scholar 

  3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  4. Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017)

    Google Scholar 

  5. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6

    Chapter  Google Scholar 

  6. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  7. Lample, G., Sablayrolles, A., Ranzato, M., Denoyer, L., Jégou, H.: Large memory layers with product keys. In: Advances in Neural Information Processing Systems, pp. 8546–8557 (2019)

    Google Scholar 

  8. Mann, V.A., Repp, B.H.: Influence of vocalic context on perception of the [\(\int \)]-[s] distinction. Percept. Psychophys. 28(3), 213–228 (1980)

    Article  Google Scholar 

  9. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746 (1976)

    Article  Google Scholar 

  10. Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)

    Google Scholar 

  11. Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)

    Article  Google Scholar 

  12. Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. Proc. Interspeech 2017, 3652–3656 (2017)

    Article  Google Scholar 

  13. Taylor, S.L., Mahler, M., Theobald, B.J., Matthews, I.: Dynamic units of visual speech. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 275–284. Eurographics Association (2012)

    Google Scholar 

  14. Thangthai, K.: Computer lipreading via hybrid deep neural network hidden Markov models. Ph.D. thesis, University of East Anglia (2018)

    Google Scholar 

  15. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  16. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  17. Weng, X., Kitani, K.: Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading. BMVC (2019). https://bmvc2019.org/wp-content/uploads/papers/0016-paper.pdf

  18. Zhang, X., Cheng, F., Wang, S.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: The IEEE International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peratham Wiriyathammabhum .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wiriyathammabhum, P. (2020). SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1332. Springer, Cham. https://doi.org/10.1007/978-3-030-63820-7_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63820-7_63

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63819-1

  • Online ISBN: 978-3-030-63820-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics