BenAV: a Bengali Audio-Visual Corpus for Visual Speech Recognition

Pondit, Ashish; Rukon, Muhammad Eshaque Ali; Das, Anik; Kabir, Muhammad Ashad

doi:10.1007/978-3-030-92270-2_45

Ashish Pondit¹³,
Muhammad Eshaque Ali Rukon¹³,
Anik Das^14,15 &
…
Muhammad Ashad Kabir¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13109))

Included in the following conference series:

International Conference on Neural Information Processing

1797 Accesses

Abstract

Visual speech recognition (VSR) is a very challenging task. It has many applications such as facilitating speech recognition when the acoustic data is noisy or missing, assisting hearing impaired people, etc. Modern VSR systems require a large amount of data to achieve a good performance. Popular VSR datasets are mostly available for the English language and none in Bengali. In this paper, we have introduced a large-scale Bengali audio-visual dataset, named “BenAV”. To the best of our knowledge, BenAV is the first publicly available large-scale dataset in the Bengali language. BenAV contains a lexicon of 50 words from 128 speakers with a total number of 26,300 utterances. We have also applied three existing deep learning based VSR models to provide a baseline performance of our BenAV dataset. We run extensive experiments in two different configurations of the dataset to study the robustness of those models and achieved 98.70% and 82.5% accuracy, respectively. We believe that this research provides a basis to develop Bengali lip reading systems and opens the doors to conduct further research on this topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Continuous lipreading based on acoustic temporal alignments

Article Open access 06 May 2024

Enhancing Visual Speech Recognition for Deaf Individuals: A Hybrid LSTM and CNN 3D Model for Improved Accuracy

Article 17 November 2023

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

Article 18 May 2024

Notes

References

Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 11th International Conference and Workshops on Automatic Face and Gesture Recognition, vol. 1, pp. 1–5. IEEE (2015)
Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: End-to-end sentence-level lipreading. arXiv (2016)
Google Scholar
Barr, J.R., Bowyer, K.W., Flynn, P.J., Biswas, S.: Face recognition from video: a review. Int. J. Patt. Recogon. Artif. Intell. 26(05), 1266002 (2012)
Article MathSciNet Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
Chapter Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Fernandez-Lopez, A., Sukno, F.M.: Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput. 78, 53–72 (2018)
Article Google Scholar
Hilder, S., Harvey, R.W., Theobald, B.J.: Comparison of human and machine-based lip-reading. In: International Conference on Auditory-Visual Speech Processing (AVSP), pp. 86–89. ISCA (2009)
Google Scholar
Jitaru, A.C., Abdulamit, Ş., Ionescu, B.: LRRO: a lip reading data set for the under-resourced Romanian language. In: Proceedings of the 11th ACM Multimedia Systems Conference, pp. 267–272 (2020)
Google Scholar
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874. IEEE (2014)
Google Scholar
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Patt. Anal. Mach. Intell. 24(2), 198–213 (2002)
Article Google Scholar
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: 15th Annual Conference of the International Speech Communication Association, pp. 1149–1153. ISCA (2014)
Google Scholar
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Cuave: A new audio-visual database for multimodal human-computer interface research. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. II-2017-II-2020. IEEE (2002)
Google Scholar
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
Google Scholar
Petridis, S., Wang, Y., Ma, P., Li, Z., Pantic, M.: End-to-end visual speech recognition for small-scale datasets. Patt. Recogn. Lett. 131, 421–427 (2020)
Article Google Scholar
Puviarasan, N., Palanivel, S.: Lip reading of hearing impaired persons using HMM. Exp. Syst. Appl. 38(4), 4477–4481 (2011)
Article Google Scholar
Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. In: 16th Annual Conference of the International Speech Communication Association, pp. 1468–1472. ISCA (2015)
Google Scholar
Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMS. Comput. Vis. Image Underst. 176, 22–32 (2018)
Article Google Scholar
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: 18th Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 3652–3656. ISCA (2017)
Google Scholar
Sun, K., Yu, C., Shi, W., Liu, L., Shi, Y.: Lip-interact: improving mobile device interaction with silent speech commands. In: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pp. 581–593. ACM (2018)
Google Scholar
Xu, B., Lu, C., Guo, Y., Wang, J.: Discriminative multi-modality speech recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14433–14442. IEEE (2020)
Google Scholar
Yang, S., et al.: LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild. In: 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2019)
Google Scholar
Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 15th International Conference on Automatic Face and Gesture Recognition (FG), pp. 420–427. IEEE (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Chittagong University of Engineering and Technology (CUET), Chattogram, 4349, Bangladesh
Ashish Pondit & Muhammad Eshaque Ali Rukon
Department of Computer Science, St. Francis Xavier University, Nova Scotia, B2G 2W5, Canada
Anik Das
Department of Computer Science and Engineering, Bangladesh University, Dhaka, 1207, Bangladesh
Anik Das
School of Computing, Mathematics and Engineering, Charles Sturt University, NSW, 2795, Australia
Muhammad Ashad Kabir

Authors

Ashish Pondit
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Eshaque Ali Rukon
View author publications
You can also search for this author in PubMed Google Scholar
Anik Das
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Ashad Kabir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anik Das .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pondit, A., Rukon, M.E.A., Das, A., Kabir, M.A. (2021). BenAV: a Bengali Audio-Visual Corpus for Visual Speech Recognition. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13109. Springer, Cham. https://doi.org/10.1007/978-3-030-92270-2_45

Download citation

DOI: https://doi.org/10.1007/978-3-030-92270-2_45
Published: 07 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92269-6
Online ISBN: 978-3-030-92270-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BenAV: a Bengali Audio-Visual Corpus for Visual Speech Recognition