Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions | SpringerLink
Skip to main content

Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

Abstract

The purpose of this study is to develop a robust audio-visual speech recognition system and to investigate the influence of a high-speed video data on the recognition accuracy of continuous Russian speech under different noisy conditions. Developed experimental setup and collected multimodal database allow us to explore the impact brought by the high-speed video recordings with various frames per second (fps) starting from standard 25 fps up to high-speed 200 fps. At the moment there is no research objectively reflecting the dependence of the speech recognition accuracy from the video frame rate. Also there are no relevant audio-visual databases for model training. In this paper, we try to fill in this gap for continuous Russian speech. Our evaluation experiments show the increase of absolute recognition accuracy up to 3% and prove that the use of the high-speed camera JAI Pulnix with 200 fps allows achieving better recognition results under different acoustically noisy conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Katsaggelos, K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)

    Article  Google Scholar 

  2. Corradini, A., Mehta, M., Bernsen, N.O., Martin, J., Abrilian, S.: Multimodal input fusion in human-computer interaction. Nato Sci. Ser. Comput. Syst. Sci. 198, 223 (2005)

    Google Scholar 

  3. Lahat, D., Adall, T., Jutten, C.: Challenges in multimodal data fusion. In: Proceedings of the European Signal Processing Conference, pp. 101–105 (2014)

    Google Scholar 

  4. Shao, X., Barker, J.: Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment. Speech Commun. 50(4), 337–353 (2008)

    Article  Google Scholar 

  5. Chitu, A.G., Rothkrantz, L.J.M.: The influence of video sampling rate on lipreading performance. In: Proceedings of the International Conference on Speech and Computer, SPECOM 2007, Moscow, Russia, pp. 678–684 (2007)

    Google Scholar 

  6. Chitu, A.G., Driel, K., Rothkrantz, L.J.M.: Automatic lip reading in the Dutch language using active appearance models on high speed recordings. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS(LNAI), vol. 6231, pp. 259–266. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15760-8_33

    Chapter  Google Scholar 

  7. Polykovsky, S., Kameda, Y., Ohta, Y.: Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor. In: Proceedings of the 3rd International Conference on Crime Detection and Prevention (ICDP), Tsukuba, Japan, pp. 1–6 (2009)

    Google Scholar 

  8. Bettadapura, V.: Face expression recognition and analysis: the state of the art. Technical report, pp. 1–27. College of Computing, Georgia Institute of Technology, USA (2012)

    Google Scholar 

  9. Ohzeki, K.: Video analysis for detecting eye blinking using a high-speed camera. In: Proceedings of the 40th Asilomar Conference on Signals, Systems and Computers (ACSSC), Part 1, Pacific Grove, USA, pp. 1081–1085 (2006)

    Google Scholar 

  10. Chitu, A.G., Rothkrantz, L.J.M.: On dual view lipreading using high speed camera. In: Proceedings of the 14th Annual Scientific Conference Euromedia, Ghent, Belgium, pp. 43–51 (2008)

    Google Scholar 

  11. Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 338–345. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_40

    Chapter  Google Scholar 

  12. Karpov, A., Ronzhin, A., Markov, K., Železný, M.: Viseme-dependent weight optimization for CHMM-based audio-visual speech recognition. In: Proceedings of the Interspeech 2010, pp. 2678–2681 (2010)

    Google Scholar 

  13. Karpov, A.: An automatic multimodal speech recognition system with audio and video information. Autom. Remote Control 75(12), 2190–2200 (2014)

    Article  MathSciNet  Google Scholar 

  14. Zelezny, M., Csar, P.: Czech audio-visual speech corpus of a car driver for in-vehicle audio-visual speech recognition. In: Proceedings of the International Conference on Audio-Visual Speech Processing (AVSP 2003), pp. 169–173 (2003)

    Google Scholar 

  15. Csar, P., Zelezny, M., Krnoul, Z., Kanis, J., Zelinka, J., Muller, L.: Design and recording of Czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of the International Conference on the Auditory-Visual Speech Processing, pp. 1–4 (2005)

    Google Scholar 

  16. Grishina E.: Multimodal Russian corpus (MURCO): first steps. In: Proceedings of the 7th Language Resources and Evaluation Conference (LREC 2010), pp. 2953–2960 (2010)

    Google Scholar 

  17. Karpov, A., Kipyatkova, I., Železný, M.: A framework for recording audio-visual speech corpora with a microphone and a high-speed camera. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 50–57. Springer, Cham (2014). doi:10.1007/978-3-319-11581-8_6

    Google Scholar 

  18. Chu, S.M., Huang, T.S.: Multi-Modal sensory fusion with application to audio-visual speech recognition. In: Proceedings of the Multi-Modal Speech Recognition Workshop 2002, Greensboro, USA (2002)

    Google Scholar 

  19. Stewart, D., Seymour, R., Pass, A., Ming, J.: Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Trans. Cybern. 44(2), 175–184 (2014)

    Article  Google Scholar 

  20. Huang, J., Kingsbury, B.: Audio-visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7596–7599 (2013)

    Google Scholar 

  21. Estellers, V., Gurban, M., Thiran, J.: On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)

    Article  Google Scholar 

Download references

Acknowledgments

This research is partially supported by the Russian Foundation for Basic Research (projects No. 15-07-04415, 15-07-04322 and 16-37-60085), by the Council for Grants of the President of the Russian Federation (projects No. MD-254.2017.8, MK-1000.2017.8 and MК-7925.2016.9), by the Government of Russia (grant No. 074-U01), by grant of the University of West Bohemia (project No. SGS-2016-039), and by the Ministry of Education, Youth and Sports of Czech Republic, (project No. LO1506).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Denis Ivanko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Ivanko, D. et al. (2017). Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_76

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66429-3_76

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66428-6

  • Online ISBN: 978-3-319-66429-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics