An automatic multimodal speech recognition system with audio and video information | Automation and Remote Control Skip to main content
Log in

An automatic multimodal speech recognition system with audio and video information

  • Intellectual Control Systems
  • Published:
Automation and Remote Control Aims and scope Submit manuscript

Abstract

The mathematical model and software implementation of an automatic Russian speech recognition system that employs techniques of digital processing and analysis of audiovisual signals from a microphone and a video camera are presented. The description of probabilistic modeling of audiovisual speech based on coupled hidden Markov models, information fusion methods with weight coefficients for audio and video speech modalities, and parametric representation of signals is provided. Quantitative results in multimodal recognition of continuous Russian speech indicate high accuracy and reliability of the automatic system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Kipyatkova, I.S. and Karpov, A.A., An Analytical Survey of Large Vocabulary Russian Speech Recognition Systems, Tr. SPIIRAN, 2010, no. 12, pp. 7–20.

    Google Scholar 

  2. Soldatov, S., Lip Reading: Preparing Feature Vectors, in Proc. Int. Conf. Graphicon’03, Moscow, 2003, pp. 254–256.

    Google Scholar 

  3. Krak, Yu.V., Barmak, A.V., and Ternov, A.S., Information Technology Designed for Automatic Lip Reading for Ukrainian Language, Komp’yut. Mat., 2009, no. 1, pp. 86–95.

    Google Scholar 

  4. Nefian, A., Liang, L., Pi, X., et al., A Coupled HMM for Audio-Visual Speech Recognition, Proc. Int. Conf. ICASSP’02, Orlando, USA, 2002, pp. 2013–2016.

    Google Scholar 

  5. Karpov, A.A., Automatic Recognition of Audio-visual Russian Speech by Asynchronous Model, Inform.-Izm. Upravl. Sist., 2010, vol. 8, no. 7, pp. 91–96.

    Google Scholar 

  6. Young, S., Evermann, G., Gales, M., et al., The HTK Book. HTK Version 3.4, Cambridge: Cambridge Univ. Press, 2009.

    Google Scholar 

  7. Benesty, J., Sondhi, M., Huang, Y., et al., Springer Handbook of Speech Processing, New York: Springer, 2008.

    Book  Google Scholar 

  8. Vezhnevets, A. and Vezhnevets, V., Boosting—Strengthening Simple Classifiers, Komp’yut. Grafika Mul’timedia, 2006, no. 4, no. 2 (http://cgm.computergraphics.ru/content/view/112).

    Google Scholar 

  9. Castrillyn, M., Deniz, O., Hernandez, D., et al., A Comparison of Face and Facial Feature Detectors Based on the Viola-Jones General Object Detection Framework, Machine Vision Appl., 2011, vol. 22, no. 3, pp. 481–494.

    Google Scholar 

  10. Bradsky, G. and Kaehler, A., Learning OpenCV, Sebastopol, California: O’Reilly, 2008.

    Google Scholar 

  11. Liang, L., Liu, X., Zhao, Y., et al., Speaker Independent Audio-Visual Continuous Speech Recognition, Proc. Int. Conf. on Multimedia and Expo ICME’02, Lausanne, Switzerland, 2002, vol. 2, pp. 25–28.

    Article  Google Scholar 

  12. Levenshtein, V.I., Binary Codes Capable of Correcting Deletions, Insertions, and Reversals, Dokl. Akad. Nauk USSR, 1965, vol. 163, no. 4, pp. 845–848.

    MathSciNet  Google Scholar 

  13. Saakyan, A.A., Investigation of Quality Measures for Speech Recognition Systems, Probl. Upravlen., 2009, no. 4, pp. 66–73.

    Google Scholar 

  14. Bisani, M. and Ney, H., Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation, Proc. 29th IEEE Int. Conf. on Acoustics, Speech, and Signal Processing ICASSP’04, Montreal, Canada, 2004, pp. 409–412.

    Google Scholar 

  15. Heckmann, M., Berthommier, F., and Kroschel, K., Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition, EURASIP J. Appl. Signal Process., 2002, no. 1, pp. 1260–1273.

    Google Scholar 

  16. Gurban, M., Thiran, J.P., Drugman, T., et al., Dynamic Modality Weighting for Multi-Stream HMMs in Audio-Visual Speech Recognition, Proc. Int. Conf. on Multimodal Interfaces ICMI’08, Chania, 2008, pp. 237–240.

    Google Scholar 

  17. Yusupov, R.M., Ronzhin, A.L., Prishchepa, M.V., et al., Models and Hardware-Software Solutions for Automatic Control of Intelligent Hall, Autom. Remote Control, 2011, vol. 72, no. 7, pp. 1389–1397.

    Article  Google Scholar 

  18. Bilik, R.V., Zhozhikashvili, V.A., Petukhova, N.V., et al., Analysis of the Oral Interface in the Interactive Servicing Systems. II, Autom. Remote Control, 2009, vol. 70, no. 4, pp. 434–448.

    Article  MATH  Google Scholar 

  19. Karpov, A.A. and Ronzhin, A.L., Information Enquiry Kiosk with Multimodal User Interface, Pattern Recogn. Image Anal., 2009, vol. 19, no. 3, pp. 546–558.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. A. Karpov.

Additional information

Original Russian Text © A.A. Karpov, 2014, published in Avtomatika i Telemekhanika, 2014, No. 12, pp. 125–138.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karpov, A.A. An automatic multimodal speech recognition system with audio and video information. Autom Remote Control 75, 2190–2200 (2014). https://doi.org/10.1134/S000511791412008X

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S000511791412008X

Keywords

Navigation