Abstract
The mathematical model and software implementation of an automatic Russian speech recognition system that employs techniques of digital processing and analysis of audiovisual signals from a microphone and a video camera are presented. The description of probabilistic modeling of audiovisual speech based on coupled hidden Markov models, information fusion methods with weight coefficients for audio and video speech modalities, and parametric representation of signals is provided. Quantitative results in multimodal recognition of continuous Russian speech indicate high accuracy and reliability of the automatic system.
Similar content being viewed by others
References
Kipyatkova, I.S. and Karpov, A.A., An Analytical Survey of Large Vocabulary Russian Speech Recognition Systems, Tr. SPIIRAN, 2010, no. 12, pp. 7–20.
Soldatov, S., Lip Reading: Preparing Feature Vectors, in Proc. Int. Conf. Graphicon’03, Moscow, 2003, pp. 254–256.
Krak, Yu.V., Barmak, A.V., and Ternov, A.S., Information Technology Designed for Automatic Lip Reading for Ukrainian Language, Komp’yut. Mat., 2009, no. 1, pp. 86–95.
Nefian, A., Liang, L., Pi, X., et al., A Coupled HMM for Audio-Visual Speech Recognition, Proc. Int. Conf. ICASSP’02, Orlando, USA, 2002, pp. 2013–2016.
Karpov, A.A., Automatic Recognition of Audio-visual Russian Speech by Asynchronous Model, Inform.-Izm. Upravl. Sist., 2010, vol. 8, no. 7, pp. 91–96.
Young, S., Evermann, G., Gales, M., et al., The HTK Book. HTK Version 3.4, Cambridge: Cambridge Univ. Press, 2009.
Benesty, J., Sondhi, M., Huang, Y., et al., Springer Handbook of Speech Processing, New York: Springer, 2008.
Vezhnevets, A. and Vezhnevets, V., Boosting—Strengthening Simple Classifiers, Komp’yut. Grafika Mul’timedia, 2006, no. 4, no. 2 (http://cgm.computergraphics.ru/content/view/112).
Castrillyn, M., Deniz, O., Hernandez, D., et al., A Comparison of Face and Facial Feature Detectors Based on the Viola-Jones General Object Detection Framework, Machine Vision Appl., 2011, vol. 22, no. 3, pp. 481–494.
Bradsky, G. and Kaehler, A., Learning OpenCV, Sebastopol, California: O’Reilly, 2008.
Liang, L., Liu, X., Zhao, Y., et al., Speaker Independent Audio-Visual Continuous Speech Recognition, Proc. Int. Conf. on Multimedia and Expo ICME’02, Lausanne, Switzerland, 2002, vol. 2, pp. 25–28.
Levenshtein, V.I., Binary Codes Capable of Correcting Deletions, Insertions, and Reversals, Dokl. Akad. Nauk USSR, 1965, vol. 163, no. 4, pp. 845–848.
Saakyan, A.A., Investigation of Quality Measures for Speech Recognition Systems, Probl. Upravlen., 2009, no. 4, pp. 66–73.
Bisani, M. and Ney, H., Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation, Proc. 29th IEEE Int. Conf. on Acoustics, Speech, and Signal Processing ICASSP’04, Montreal, Canada, 2004, pp. 409–412.
Heckmann, M., Berthommier, F., and Kroschel, K., Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition, EURASIP J. Appl. Signal Process., 2002, no. 1, pp. 1260–1273.
Gurban, M., Thiran, J.P., Drugman, T., et al., Dynamic Modality Weighting for Multi-Stream HMMs in Audio-Visual Speech Recognition, Proc. Int. Conf. on Multimodal Interfaces ICMI’08, Chania, 2008, pp. 237–240.
Yusupov, R.M., Ronzhin, A.L., Prishchepa, M.V., et al., Models and Hardware-Software Solutions for Automatic Control of Intelligent Hall, Autom. Remote Control, 2011, vol. 72, no. 7, pp. 1389–1397.
Bilik, R.V., Zhozhikashvili, V.A., Petukhova, N.V., et al., Analysis of the Oral Interface in the Interactive Servicing Systems. II, Autom. Remote Control, 2009, vol. 70, no. 4, pp. 434–448.
Karpov, A.A. and Ronzhin, A.L., Information Enquiry Kiosk with Multimodal User Interface, Pattern Recogn. Image Anal., 2009, vol. 19, no. 3, pp. 546–558.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © A.A. Karpov, 2014, published in Avtomatika i Telemekhanika, 2014, No. 12, pp. 125–138.
Rights and permissions
About this article
Cite this article
Karpov, A.A. An automatic multimodal speech recognition system with audio and video information. Autom Remote Control 75, 2190–2200 (2014). https://doi.org/10.1134/S000511791412008X
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S000511791412008X