Abstract
Speech is the most used communication method between humans and it is considered a multisensory process. Even though there is a popular belief that speech is something that we hear, there is overwhelming evidence that the brain treats speech as something that we hear and see. Much of the research has focused on Automatic Speech Recognition (ASR) systems, treating speech primarily as an acoustic form of communication. In the last years, there has been an increasing interest in systems for Automatic Lip-Reading (ALR), although exploiting the visual information has been proved to be challenging. One of the main problems in ALR is how to make the system robust to the visual ambiguities that appear at the word level. These ambiguities make confused and imprecise the definition of the minimum distinguishable unit of the video domain. In contrast to the audio domain, where the phoneme is the standard minimum auditory unit, there is no consensus on the definition of the minimum visual unit (the viseme). In this work, we focus on the automatic construction of a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition. We investigate the usefulness of different phoneme-to-viseme mappings, obtaining the best results for intermediate vocabulary lengths. We construct an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. We test our system in two Spanish corpora with continuous speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our results indicate that we are able to recognize 47% (resp. 51%) of the phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show additional results that support the usefulness of visemes. Experiments on a comparable ALR system trained exclusively using phonemes at all its stages confirm the existence of strong visual ambiguities between groups of phonemes. This fact and the higher word accuracy obtained when using phoneme-to-viseme mappings, justify the usefulness of visemes instead of the direct use of phonemes for ALR.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 1306–1326 (2003)
Rosenblum, L.D.: Speech perception as a multimodal phenomenon. Curr. Dir. Psychol. Sci. 17, 405–409 (2008)
Chiţu, A., Rothkrantz, L.J.: Automatic visual speech recognition. In: Speech Enhancement, Modeling and Recognition-Algorithms and Applications, pp. 95–120 (2012)
Erber, N.P.: Auditory-visual perception of speech. J. Speech Hear. Disord. 40, 481–492 (1975)
Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215 (1954)
Hilder, S., Harvey, R., Theobald, B.J.: Comparison of human and machine-based lip-reading. In: Proceedings of Conference on Auditory-Visual Speech Processing, pp. 86–89 (2009)
Ronquest, R.E., Levi, S.V., Pisoni, D.B.: Language identification from visual-only speech signals. Atten. Percept. Psychophys. 72, 1601–1613 (2010)
Seymour, R., Stewart, D., Ming, J.: Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. J. Image Video Process. 2008, 14 (2008)
Antonakos, E., Roussos, A., Zafeiriou, S.: A survey on mouth modeling and analysis for sign language recognition. In: Proceedings of Conference on Automatic Face and Gesture Recognition, vol. 1, pp. 1–7 (2015)
Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2, 141–151 (2000)
Nefian, A.V., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K.: A coupled HMM for audio-visual speech recognition. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 2013–2016 (2002)
Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014)
Yau, W.C., Kumar, D.K., Weghorn, H.: Visual speech recognition using motion features and hidden Markov models. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 832–839. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74272-2_103
Sui, C., Bennamoun, M., Togneri, R.: Listening with your eyes: towards a practical visual speech recognition system using deep Boltzmann machines. In: Proceedings of IEEE International Conference on Computer Vision, pp. 154–162 (2015)
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 3444–3453 (2017)
Petridis, S., Pantic, M.: Deep complementary bottleneck features for visual speech recognition. In: Proceedings of Conference on Acoustics, Speech and Signal Processing, pp. 2304–2308 (2016)
Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: Proceedings of Conference on Acoustics, Speech and Signal Processing, pp. 2722–2726 (2016)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of Asian Conference on Computer Vision, pp. 87–103 (2016)
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: Proceedings Conference on Acoustics, Speech and Signal Processing, pp. 6115–6119 (2016)
Twaddell, W.F.: On defining the phoneme. Language 11, 5–62 (1935)
Fisher, C.G.: Confusions among visually perceived consonants. J. Speech Lang. Hear. Res. 11, 796–804 (1968)
Moll, K.L., Daniloff, R.G.: Investigation of the timing of velar movements during speech. J. Acoust. Soc. Am. 50, 678–684 (1971)
Luettin, J., Thacker, N.A., Beet, S.W.: Visual speech recognition using active shape models and hidden Markov models. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 817–820 (1996)
Sahu, V., Sharma, M.: Result based analysis of various lip tracking systems. In: Proceedings of Conference on Green High Performance Computing, pp. 1–7 (2013)
Cappelletta, L., Harte, N.: Viseme definitions comparison for visual-only speech recognition. In: Proceedings of Conference on Signal Processing, pp. 2109–2113 (2011)
Chung, J.S., Zisserman, A.: Lip reading in profile. In: Proceedings of British Machine Vision Conference (2017)
Bear, H.L., Harvey, R.W., Theobald, B.-J., Lan, Y.: Which phoneme-to-viseme maps best improve visual-only computer lip-reading? In: Bebis, G., et al. (eds.) ISVC 2014. LNCS, vol. 8888, pp. 230–239. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14364-4_22
Hazen, T.J., Saenko, K., La, C.H., Glass, J.R.: A segment-based audio-visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of Conference on Multimodal Interfaces, pp. 235–242 (2004)
Neti, C., et al.: Audio visual speech recognition. Technical report, IDIAP (2000)
Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: Proceedings of Conference on Signals, Systems and Computers, vol. 1, pp. 572–577 (1994)
Jeffers, J., Barley, M.: Speechreading (Lipreading). Charles C. Thomas Publisher, Springfield (1980)
Bozkurt, E., Erdem, C.E., Erzin, E., Erdem, T., Ozkan, M.: Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In: Proceedings of Signal Processing and Communications Applications, pp. 1–4 (2007)
Ezzat, T., Poggio, T.: MikeTalk: a talking facial display based on morphing visemes. In: Proceedings of Conference on Computer Animation, pp. 96–102 (1998)
Fernandez-Lopez, A., Sukno, F.M.: Automatic viseme vocabulary construction to enhance continuous lip-reading. In: Proceedings of Conference on Computer Vision Theory and Applications, vol. 5, pp. 52–63 (2017)
Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: the extended M2VTS database. In: Proceedings of Conference on Audio and Video-Based Biometric Person Authentication, vol. 964, pp. 965–966 (1999)
Estellers, V., Gurban, M., Thiran, J.P.: On dynamic stream weighting for audio-visual speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 20, 1145–1157 (2012)
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audio-visual speech recognition. In: Proceedings of Multimodal Processing and Interaction, pp. 1–15 (2008)
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 17, 423–435 (2009)
Estival, D., Cassidy, S., Cox, F., Burnham, D.: AusTalk: an audio-visual corpus of Australian English. In: Proceeding of Conference on Language Resources and Evaluation (2014)
Ukai, N., Seko, T., Tamura, S., Hayamizu, S.: GIF-LR: GA-based informative feature for lipreading. In: Proceedings Conference on Signal and Information Processing Association Annual Summit and Conference, pp. 1–4 (2012)
Ninomiya, H., Kitaoka, N., Tamura, S., Iribe, Y., Takeda, K.: Integration of deep bottleneck features for audio-visual speech recognition. In: Proceedings of Interspeech, pp. 563–567 (2015)
Tamura, S., et al.: CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition. In: Proceedings of Conference on Auditory-Visual Speech Processing (2010)
Lan, Y., Harvey, R., Theobald, B., Ong, E.J., Bowden, R.: Comparing visual features for lipreading. In: Proceedings of Conference on Auditory-Visual Speech Processing, pp. 102–106 (2009)
Kolossa, D., Zeiler, S., Vorwerk, A., Orglmeister, R.: Audiovisual speech recognition with missing or unreliable data. In: Proceedings of Conference on Auditory-Visual Speech Processing, pp. 117–122 (2009)
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11, 1254–1265 (2009)
Zhou, Z., Hong, X., Zhao, G., Pietikäinen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1 (2014)
Pei, Y., Kim, T.K., Zha, H.: Unsupervised random forest manifold alignment for lipreading. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 129–136 (2013)
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: sentence-level lipreading. In: Proceedings of GPU Technology Conference (2017)
Thangthai, K., Harvey, R., Cox, S., Theobald, B.J.: Improving lip-reading performance for robust audiovisual speech recognition using DNNs. In: Proceedings of Conference on Auditory-Visual Speech Processing (2015)
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: Proceedings of Interspeech, pp. 1149–1153 (2014)
Lan, Y., Harvey, R., Theobald, B.J.: Insights into machine lip reading. In: Proceedings of Conference on Acoustics, Speech and Signal Processing, pp. 4825–4828 (2012)
Ortega, A., et al.: AV@CAR: a Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: Proceedings of Conference on Language Resources and Evaluation (2004)
Fernandez-Lopez, A., Martinez, O., Sukno, F.M.: Towards estimating the upper bound of visual-speech recognition: the visual lip-reading feasibility database. In: Proceedings of Conference on Automatic Face and Gesture Recognition (2017)
Sukno, F.M., Ordas, S., Butakoff, C., Cruz, S., Frangi, A.F.: Active shape models with invariant optimal features: application to facial analysis. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1105–1117 (2007)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)
Franklin, S.B., Gibson, D.J., Robertson, P.A., Pohlmann, J.T., Fralish, J.S.: Parallel analysis: a method for determining significant principal components. J. Veg. Sci. 6, 99–106 (1995)
Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 41, 552–568 (2011)
Verbaeten, S., Van Assche, A.: Ensemble methods for noise elimination in classification problems. In: Windeatt, T., Roli, F. (eds.) MCS 2003. LNCS, vol. 2709, pp. 317–325. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44938-8_32
Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. Neural Netw. Learn. Syst. 25, 845–869 (2014)
Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 33, 275–306 (2010)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)
Petrushin, V.A.: Hidden Markov models: fundamentals and applications. In: Proceedings of Conference on Online Symposium for Electronics Engineer (2000)
Wells, J.C., et al.: SAMPA computer readable phonetic alphabet. In: Handbook of Standards and Resources for Spoken Language Systems, vol. 4 (1997)
Llisterri, J., Mariño, J.B.: Spanish adaptation of SAMPA and automatic phonetic transcription. Reporte técnico del ESPRIT PROJECT, vol. 6819 (1993)
Acknowledgements
This work is partly supported by the Spanish Ministry of Economy and Competitiveness under project grant TIN2017-90124-P, the Ramon y Cajal programme, the Maria de Maeztu Units of Excellence Programme, and the Kristina project funded by the European Union Horizon 2020 research and innovation programme under grant agreement No. 645012.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Fernandez-Lopez, A., Sukno, F.M. (2019). Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish. In: Cláudio, A., et al. Computer Vision, Imaging and Computer Graphics – Theory and Applications. VISIGRAPP 2017. Communications in Computer and Information Science, vol 983. Springer, Cham. https://doi.org/10.1007/978-3-030-12209-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-12209-6_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12208-9
Online ISBN: 978-3-030-12209-6
eBook Packages: Computer ScienceComputer Science (R0)