Using the Tandem Approach for AF Classification in an AVSR System

Gan, Tian; Menzel, Wolfgang; Zhang, Jianwei

doi:10.1007/978-3-540-87734-9_94

Tian Gan⁶,
Wolfgang Menzel⁶ &
Jianwei Zhang⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5264))

Included in the following conference series:

International Symposium on Neural Networks

3021 Accesses

Abstract

This paper describes an audio visual speech recognition (AVSR) system based on articulatory features (AF). It implements a tandem approach where artificial neural networks (ANN), in particular multi-layer perceptrons (MLP), are used as posterior probability estimators for transforming raw input data into the more abstract articulatory features. Such an approach is particularly well suited if relatively few training data are available, a situation which is typical for AVSR. In addition, the MLP feature extraction results and some analysis in terms of recognition accuracy and confusions are presented. Our AF-based AVSR system has been trained on the audio-visual speech corpus VIDTIMIT, which contains conversational speech based on a medium size vocabulary including more than 1200 words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Combining audio and visual speech recognition using LSTM and deep convolutional neural network

Article 24 February 2022

A Critical Insight into Automatic Visual Speech Recognition System

Articulatory Features for Phone Recognition

References

Petajan, E.D.: Automatic Lipreading to Enhance Speech Recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 265–272. IEEE Press, San Francisco (1985)
Google Scholar
Potamianos, G., Neti, L.J., Matthews, I.: Audio-Visual Automatic Speech Recognition: An Overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing, pp. 356–396. MIT Press, Cambridge (2004)
Google Scholar
Watson, R.: A Survey of Gesture Recognition Techniques. Technical report, Trinity College Dublin (1993)
Google Scholar
Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. J. Pat. Rec. 36, 259–275 (2003)
Article MATH Google Scholar
Kirchhoff, K.: Robust Speech Recognition Using Articulatory Information. PhD Thesis, University of Bielefeld (1999)
Google Scholar
Abu-Amer, T., Carson-Berndsen, J.: HARTFEX: A Multi-Dimensional System of HMM Based Recognizers for Articulatory Feature Extraction. In: 8th European Conference on Speech Communication and Technology, Geneva, Switzerland, pp. 2541–2544 (2003)
Google Scholar
Papcun, J., Hochberg, J., Thomas, T.R., Laroche, F., Zacks, J., Levy, S.: Inferring Articulation and Recognizing Gestures from Acoustics with A Neural Network Trained on X-ray Microbeam Data. J. Acoust. Soc. Am. 92, 688–700 (1992)
Article Google Scholar
Zacks, J., Thomas, T.R.: A new Neural Network for Articulatory Speech Recognition and Its Application to Vowel Identification. J. Com. Sp. Lan. 8, 189–209 (1994)
Article Google Scholar
Frankel, J., King, S.: ASR - Articulatory Speech Recognition. In: 7th European Conference on Speech Communication and Technology, Scandinavia, Aalborg, pp. 599–602 (2003)
Google Scholar
Eide, E., Rohlicek, J.R., Gish, H., Mitter, S.: A Linguistic Feature Representation of the Speech Waveform. In: 18th International Conference on Acoustics Speech and Signal Processing, pp. 483–486. IEEE Press, Minneapolis (1993)
Chapter Google Scholar
Deng, L., Erler, K.: Microstructural Speech Units and Their HMM Representations for Discrete Utterance Speech Recognition. In: International Conference on Acoustics Speech and Signal Processing, pp. 193–196. IEEE Press, Washington (1991)
Google Scholar
Gan, T., Menzel, W.: An Audio Visual Speech Recognition Framework Based on Articulatory Features. In: 7th International Coference on Auditory-Visual Speech Processing, pp. 137–141. Tilburg University, Tilburg (2007)
Google Scholar
Hennecke, M.E., Stork, D.G., Prasad, K.V.: Visionary Speech: Looking ahead to Practical Speechreading Systems. J. Spe. Hum. Mach., 331–349 (1996)
Google Scholar
Luettin, J.: Visual Speech and Speaker Recognition. PhD thesis, University of Sheffeld (1997)
Google Scholar
Livescu, K., Cetin, O., Johnson, M.H., King, S., Bartels, C., Borges, N., Kantor, A., Lal, P., Yung, L., Bezman, A., Haggerty, S.D., Woods, B., Frankel, J., Doss, M.M., Saenko, K.: Articulatory Feature-based Methods for Acoustic and Audio-Visual Speech Recognition: JHU Summer Workshop Final Report. In: 32nd IEEE International Conference on Acoustics Speech and Signal Processing, pp. 621–624. IEEE Press, Honolulu (2007)
Google Scholar
Saenko, K., Darrell, T., Glass, J.: Articulatory Features for Robust Visual Speech Recognition. In: 6th International Conference on Multimodal Interfaces, pp. 152–158. ACM, New York (2004)
Chapter Google Scholar
Sanderson, C., Paliwal, K.K.: Identity Verification Using Speech and Face Information. J. Dig. Sig. Proc. 14, 449–480 (2004)
Article Google Scholar
Hermansky, H., Ellis, D.I.W., Shamza, S.: Tandem Connectionist Feature Extraction for Conventional HMM Systems. In: 25th International Conference on Acoustics Speech and Signal Processing, pp. 1635–1638. IEEE Press, Istanbul (2000)
Google Scholar
Hermansky, H., Morgan, N., Bayya, A., Kohn, P.: RASTA-PLP Speech Analysis Technique. In: Proceedings of the International Conference Acoustics Speech Signal Processing, San Francisco, California, pp. 1121–1124 (1991)
Google Scholar
Fisher, W.M., Doddington, G.R., Goudie-Marshall, K.M.: The DARPA Speech Recognition Research Database: Specifications and Status. In: The DARPA Speech Recognition Workshop, Palo Alto, Canada, pp. 93–99 (1986)
Google Scholar
Viola, P., Jones, M.: Robust Real-Time Face Detection. J. Com. Vis. 57, 137–154 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Cinacs, Department of Informatics, University of Hamburg, Vogt-Koelln-Str. 30, 22527, Hamburg, Germany
Tian Gan, Wolfgang Menzel & Jianwei Zhang

Authors

Tian Gan
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Menzel
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghu University, 100084, Beijing, China
Fuchun Sun
Institute TAMS (Technical Aspects of Multimodal Systems), department of Informatics, University of Hamburg, Vogt-Koelln-Straße 30, 22527, Hamburg, Germany
Jianwei Zhang
Intel China Research Center, 8/F, Peking University, Department of Machine Intelligence, 100871, Beijing, China
Ying Tan
Department of Mathematics, Southeast University, 210096, Nanjing, China
Jinde Cao
Departamento de Control Automático, CINVESTAV-IPN, A.P. 14-740, Av.IPN 2508, 07360, México D.F., México
Wen Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gan, T., Menzel, W., Zhang, J. (2008). Using the Tandem Approach for AF Classification in an AVSR System. In: Sun, F., Zhang, J., Tan, Y., Cao, J., Yu, W. (eds) Advances in Neural Networks - ISNN 2008. ISNN 2008. Lecture Notes in Computer Science, vol 5264. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87734-9_94

Download citation

DOI: https://doi.org/10.1007/978-3-540-87734-9_94
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87733-2
Online ISBN: 978-3-540-87734-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics