Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

Ivanko, Denis; Karpov, Alexey; Ryumin, Dmitry; Kipyatkova, Irina; Saveliev, Anton; Budkov, Victor; Ivanko, Dmitriy; Železný, Miloš

doi:10.1007/978-3-319-66429-3_76

Denis Ivanko^16,18,19,
Alexey Karpov^16,18,
Dmitry Ryumin^16,18,
Irina Kipyatkova¹⁶,
Anton Saveliev¹⁶,
Victor Budkov¹⁶,
Dmitriy Ivanko¹⁸ &
…
Miloš Železný¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

International Conference on Speech and Computer

2428 Accesses
12 Citations

Abstract

The purpose of this study is to develop a robust audio-visual speech recognition system and to investigate the influence of a high-speed video data on the recognition accuracy of continuous Russian speech under different noisy conditions. Developed experimental setup and collected multimodal database allow us to explore the impact brought by the high-speed video recordings with various frames per second (fps) starting from standard 25 fps up to high-speed 200 fps. At the moment there is no research objectively reflecting the dependence of the speech recognition accuracy from the video frame rate. Also there are no relevant audio-visual databases for model training. In this paper, we try to fill in this gap for continuous Russian speech. Our evaluation experiments show the increase of absolute recognition accuracy up to 3% and prove that the use of the high-speed camera JAI Pulnix with 200 fps allows achieving better recognition results under different acoustically noisy conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multimodal speech recognition: increasing accuracy using high speed video data

Article 01 August 2018

An audio-visual corpus for multimodal automatic speech recognition

Article Open access 07 January 2017

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

References

Katsaggelos, K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)
Article Google Scholar
Corradini, A., Mehta, M., Bernsen, N.O., Martin, J., Abrilian, S.: Multimodal input fusion in human-computer interaction. Nato Sci. Ser. Comput. Syst. Sci. 198, 223 (2005)
Google Scholar
Lahat, D., Adall, T., Jutten, C.: Challenges in multimodal data fusion. In: Proceedings of the European Signal Processing Conference, pp. 101–105 (2014)
Google Scholar
Shao, X., Barker, J.: Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment. Speech Commun. 50(4), 337–353 (2008)
Article Google Scholar
Chitu, A.G., Rothkrantz, L.J.M.: The influence of video sampling rate on lipreading performance. In: Proceedings of the International Conference on Speech and Computer, SPECOM 2007, Moscow, Russia, pp. 678–684 (2007)
Google Scholar
Chitu, A.G., Driel, K., Rothkrantz, L.J.M.: Automatic lip reading in the Dutch language using active appearance models on high speed recordings. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS(LNAI), vol. 6231, pp. 259–266. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15760-8_33
Chapter Google Scholar
Polykovsky, S., Kameda, Y., Ohta, Y.: Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor. In: Proceedings of the 3rd International Conference on Crime Detection and Prevention (ICDP), Tsukuba, Japan, pp. 1–6 (2009)
Google Scholar
Bettadapura, V.: Face expression recognition and analysis: the state of the art. Technical report, pp. 1–27. College of Computing, Georgia Institute of Technology, USA (2012)
Google Scholar
Ohzeki, K.: Video analysis for detecting eye blinking using a high-speed camera. In: Proceedings of the 40th Asilomar Conference on Signals, Systems and Computers (ACSSC), Part 1, Pacific Grove, USA, pp. 1081–1085 (2006)
Google Scholar
Chitu, A.G., Rothkrantz, L.J.M.: On dual view lipreading using high speed camera. In: Proceedings of the 14th Annual Scientific Conference Euromedia, Ghent, Belgium, pp. 43–51 (2008)
Google Scholar
Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 338–345. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_40
Chapter Google Scholar
Karpov, A., Ronzhin, A., Markov, K., Železný, M.: Viseme-dependent weight optimization for CHMM-based audio-visual speech recognition. In: Proceedings of the Interspeech 2010, pp. 2678–2681 (2010)
Google Scholar
Karpov, A.: An automatic multimodal speech recognition system with audio and video information. Autom. Remote Control 75(12), 2190–2200 (2014)
Article MathSciNet Google Scholar
Zelezny, M., Csar, P.: Czech audio-visual speech corpus of a car driver for in-vehicle audio-visual speech recognition. In: Proceedings of the International Conference on Audio-Visual Speech Processing (AVSP 2003), pp. 169–173 (2003)
Google Scholar
Csar, P., Zelezny, M., Krnoul, Z., Kanis, J., Zelinka, J., Muller, L.: Design and recording of Czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of the International Conference on the Auditory-Visual Speech Processing, pp. 1–4 (2005)
Google Scholar
Grishina E.: Multimodal Russian corpus (MURCO): first steps. In: Proceedings of the 7th Language Resources and Evaluation Conference (LREC 2010), pp. 2953–2960 (2010)
Google Scholar
Karpov, A., Kipyatkova, I., Železný, M.: A framework for recording audio-visual speech corpora with a microphone and a high-speed camera. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 50–57. Springer, Cham (2014). doi:10.1007/978-3-319-11581-8_6
Google Scholar
Chu, S.M., Huang, T.S.: Multi-Modal sensory fusion with application to audio-visual speech recognition. In: Proceedings of the Multi-Modal Speech Recognition Workshop 2002, Greensboro, USA (2002)
Google Scholar
Stewart, D., Seymour, R., Pass, A., Ming, J.: Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Trans. Cybern. 44(2), 175–184 (2014)
Article Google Scholar
Huang, J., Kingsbury, B.: Audio-visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7596–7599 (2013)
Google Scholar
Estellers, V., Gurban, M., Thiran, J.: On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)
Article Google Scholar

Download references

Acknowledgments

This research is partially supported by the Russian Foundation for Basic Research (projects No. 15-07-04415, 15-07-04322 and 16-37-60085), by the Council for Grants of the President of the Russian Federation (projects No. MD-254.2017.8, MK-1000.2017.8 and MК-7925.2016.9), by the Government of Russia (grant No. 074-U01), by grant of the University of West Bohemia (project No. SGS-2016-039), and by the Ministry of Education, Youth and Sports of Czech Republic, (project No. LO1506).

Author information

Authors and Affiliations

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia
Denis Ivanko, Alexey Karpov, Dmitry Ryumin, Irina Kipyatkova, Anton Saveliev & Victor Budkov
University of West Bohemia, Pilsen, Czech Republic
Miloš Železný
ITMO University, St. Petersburg, Russia
Denis Ivanko, Alexey Karpov, Dmitry Ryumin & Dmitriy Ivanko
Ulm University, Ulm, Germany
Denis Ivanko

Authors

Denis Ivanko
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Karpov
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Ryumin
View author publications
You can also search for this author in PubMed Google Scholar
Irina Kipyatkova
View author publications
You can also search for this author in PubMed Google Scholar
Anton Saveliev
View author publications
You can also search for this author in PubMed Google Scholar
Victor Budkov
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Ivanko
View author publications
You can also search for this author in PubMed Google Scholar
Miloš Železný
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denis Ivanko .

Editor information

Editors and Affiliations

SPIIRAS, Saint Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Hertfordshire, Hatfield, United Kingdom
Iosif Mporas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ivanko, D. et al. (2017). Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_76

Download citation

DOI: https://doi.org/10.1007/978-3-319-66429-3_76
Published: 13 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics