BenAV: a Bengali Audio-Visual Corpus for Visual Speech Recognition | SpringerLink
Skip to main content

BenAV: a Bengali Audio-Visual Corpus for Visual Speech Recognition

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13109))

Included in the following conference series:

  • 1797 Accesses

Abstract

Visual speech recognition (VSR) is a very challenging task. It has many applications such as facilitating speech recognition when the acoustic data is noisy or missing, assisting hearing impaired people, etc. Modern VSR systems require a large amount of data to achieve a good performance. Popular VSR datasets are mostly available for the English language and none in Bengali. In this paper, we have introduced a large-scale Bengali audio-visual dataset, named “BenAV”. To the best of our knowledge, BenAV is the first publicly available large-scale dataset in the Bengali language. BenAV contains a lexicon of 50 words from 128 speakers with a total number of 26,300 utterances. We have also applied three existing deep learning based VSR models to provide a baseline performance of our BenAV dataset. We run extensive experiments in two different configurations of the dataset to study the robustness of those models and achieved 98.70% and 82.5% accuracy, respectively. We believe that this research provides a basis to develop Bengali lip reading systems and opens the doors to conduct further research on this topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.pcfreetime.com/formatfactory.

  2. 2.

    https://github.com/jiaaro/pydub.

  3. 3.

    https://github.com/AnikNicks/BenAV-A-New-Bengali-Audio-Visual-Corpus.

  4. 4.

    https://github.com/itseez/opencv.

References

  1. Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 11th International Conference and Workshops on Automatic Face and Gesture Recognition, vol. 1, pp. 1–5. IEEE (2015)

    Google Scholar 

  2. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: End-to-end sentence-level lipreading. arXiv (2016)

    Google Scholar 

  3. Barr, J.R., Bowyer, K.W., Flynn, P.J., Biswas, S.: Face recognition from video: a review. Int. J. Patt. Recogon. Artif. Intell. 26(05), 1266002 (2012)

    Article  MathSciNet  Google Scholar 

  4. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6

    Chapter  Google Scholar 

  5. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)

    Article  Google Scholar 

  6. Fernandez-Lopez, A., Sukno, F.M.: Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput. 78, 53–72 (2018)

    Article  Google Scholar 

  7. Hilder, S., Harvey, R.W., Theobald, B.J.: Comparison of human and machine-based lip-reading. In: International Conference on Auditory-Visual Speech Processing (AVSP), pp. 86–89. ISCA (2009)

    Google Scholar 

  8. Jitaru, A.C., Abdulamit, Ş., Ionescu, B.: LRRO: a lip reading data set for the under-resourced Romanian language. In: Proceedings of the 11th ACM Multimedia Systems Conference, pp. 267–272 (2020)

    Google Scholar 

  9. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874. IEEE (2014)

    Google Scholar 

  10. Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Patt. Anal. Mach. Intell. 24(2), 198–213 (2002)

    Article  Google Scholar 

  11. Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: 15th Annual Conference of the International Speech Communication Association, pp. 1149–1153. ISCA (2014)

    Google Scholar 

  12. Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Cuave: A new audio-visual database for multimodal human-computer interface research. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. II-2017-II-2020. IEEE (2002)

    Google Scholar 

  13. Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)

    Google Scholar 

  14. Petridis, S., Wang, Y., Ma, P., Li, Z., Pantic, M.: End-to-end visual speech recognition for small-scale datasets. Patt. Recogn. Lett. 131, 421–427 (2020)

    Article  Google Scholar 

  15. Puviarasan, N., Palanivel, S.: Lip reading of hearing impaired persons using HMM. Exp. Syst. Appl. 38(4), 4477–4481 (2011)

    Article  Google Scholar 

  16. Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. In: 16th Annual Conference of the International Speech Communication Association, pp. 1468–1472. ISCA (2015)

    Google Scholar 

  17. Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMS. Comput. Vis. Image Underst. 176, 22–32 (2018)

    Article  Google Scholar 

  18. Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: 18th Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 3652–3656. ISCA (2017)

    Google Scholar 

  19. Sun, K., Yu, C., Shi, W., Liu, L., Shi, Y.: Lip-interact: improving mobile device interaction with silent speech commands. In: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pp. 581–593. ACM (2018)

    Google Scholar 

  20. Xu, B., Lu, C., Guo, Y., Wang, J.: Discriminative multi-modality speech recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14433–14442. IEEE (2020)

    Google Scholar 

  21. Yang, S., et al.: LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild. In: 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2019)

    Google Scholar 

  22. Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 15th International Conference on Automatic Face and Gesture Recognition (FG), pp. 420–427. IEEE (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anik Das .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pondit, A., Rukon, M.E.A., Das, A., Kabir, M.A. (2021). BenAV: a Bengali Audio-Visual Corpus for Visual Speech Recognition. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13109. Springer, Cham. https://doi.org/10.1007/978-3-030-92270-2_45

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92270-2_45

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92269-6

  • Online ISBN: 978-3-030-92270-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics