Abstract
Visual speech recognition (VSR) is a very challenging task. It has many applications such as facilitating speech recognition when the acoustic data is noisy or missing, assisting hearing impaired people, etc. Modern VSR systems require a large amount of data to achieve a good performance. Popular VSR datasets are mostly available for the English language and none in Bengali. In this paper, we have introduced a large-scale Bengali audio-visual dataset, named “BenAV”. To the best of our knowledge, BenAV is the first publicly available large-scale dataset in the Bengali language. BenAV contains a lexicon of 50 words from 128 speakers with a total number of 26,300 utterances. We have also applied three existing deep learning based VSR models to provide a baseline performance of our BenAV dataset. We run extensive experiments in two different configurations of the dataset to study the robustness of those models and achieved 98.70% and 82.5% accuracy, respectively. We believe that this research provides a basis to develop Bengali lip reading systems and opens the doors to conduct further research on this topic.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 11th International Conference and Workshops on Automatic Face and Gesture Recognition, vol. 1, pp. 1–5. IEEE (2015)
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: End-to-end sentence-level lipreading. arXiv (2016)
Barr, J.R., Bowyer, K.W., Flynn, P.J., Biswas, S.: Face recognition from video: a review. Int. J. Patt. Recogon. Artif. Intell. 26(05), 1266002 (2012)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Fernandez-Lopez, A., Sukno, F.M.: Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput. 78, 53–72 (2018)
Hilder, S., Harvey, R.W., Theobald, B.J.: Comparison of human and machine-based lip-reading. In: International Conference on Auditory-Visual Speech Processing (AVSP), pp. 86–89. ISCA (2009)
Jitaru, A.C., Abdulamit, Ş., Ionescu, B.: LRRO: a lip reading data set for the under-resourced Romanian language. In: Proceedings of the 11th ACM Multimedia Systems Conference, pp. 267–272 (2020)
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874. IEEE (2014)
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Patt. Anal. Mach. Intell. 24(2), 198–213 (2002)
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: 15th Annual Conference of the International Speech Communication Association, pp. 1149–1153. ISCA (2014)
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Cuave: A new audio-visual database for multimodal human-computer interface research. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. II-2017-II-2020. IEEE (2002)
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
Petridis, S., Wang, Y., Ma, P., Li, Z., Pantic, M.: End-to-end visual speech recognition for small-scale datasets. Patt. Recogn. Lett. 131, 421–427 (2020)
Puviarasan, N., Palanivel, S.: Lip reading of hearing impaired persons using HMM. Exp. Syst. Appl. 38(4), 4477–4481 (2011)
Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. In: 16th Annual Conference of the International Speech Communication Association, pp. 1468–1472. ISCA (2015)
Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMS. Comput. Vis. Image Underst. 176, 22–32 (2018)
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: 18th Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 3652–3656. ISCA (2017)
Sun, K., Yu, C., Shi, W., Liu, L., Shi, Y.: Lip-interact: improving mobile device interaction with silent speech commands. In: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pp. 581–593. ACM (2018)
Xu, B., Lu, C., Guo, Y., Wang, J.: Discriminative multi-modality speech recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14433–14442. IEEE (2020)
Yang, S., et al.: LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild. In: 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2019)
Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 15th International Conference on Automatic Face and Gesture Recognition (FG), pp. 420–427. IEEE (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Pondit, A., Rukon, M.E.A., Das, A., Kabir, M.A. (2021). BenAV: a Bengali Audio-Visual Corpus for Visual Speech Recognition. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13109. Springer, Cham. https://doi.org/10.1007/978-3-030-92270-2_45
Download citation
DOI: https://doi.org/10.1007/978-3-030-92270-2_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92269-6
Online ISBN: 978-3-030-92270-2
eBook Packages: Computer ScienceComputer Science (R0)