Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish

Fernandez-Lopez, Adriana; Sukno, Federico M.

doi:10.1007/978-3-030-12209-6_15

Adriana Fernandez-Lopez¹⁷ &
Federico M. Sukno¹⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 983))

Included in the following conference series:

International Joint Conference on Computer Vision, Imaging and Computer Graphics

680 Accesses

Abstract

Speech is the most used communication method between humans and it is considered a multisensory process. Even though there is a popular belief that speech is something that we hear, there is overwhelming evidence that the brain treats speech as something that we hear and see. Much of the research has focused on Automatic Speech Recognition (ASR) systems, treating speech primarily as an acoustic form of communication. In the last years, there has been an increasing interest in systems for Automatic Lip-Reading (ALR), although exploiting the visual information has been proved to be challenging. One of the main problems in ALR is how to make the system robust to the visual ambiguities that appear at the word level. These ambiguities make confused and imprecise the definition of the minimum distinguishable unit of the video domain. In contrast to the audio domain, where the phoneme is the standard minimum auditory unit, there is no consensus on the definition of the minimum visual unit (the viseme). In this work, we focus on the automatic construction of a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition. We investigate the usefulness of different phoneme-to-viseme mappings, obtaining the best results for intermediate vocabulary lengths. We construct an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. We test our system in two Spanish corpora with continuous speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our results indicate that we are able to recognize 47% (resp. 51%) of the phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show additional results that support the usefulness of visemes. Experiments on a comparable ALR system trained exclusively using phonemes at all its stages confirm the existence of strong visual ambiguities between groups of phonemes. This fact and the higher word accuracy obtained when using phoneme-to-viseme mappings, justify the usefulness of visemes instead of the direct use of phonemes for ALR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading?

Emotional Speech Recognition Based on Lip-Reading

Continuous lipreading based on acoustic temporal alignments

Article Open access 06 May 2024

References

McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Article Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 1306–1326 (2003)
Article Google Scholar
Rosenblum, L.D.: Speech perception as a multimodal phenomenon. Curr. Dir. Psychol. Sci. 17, 405–409 (2008)
Article Google Scholar
Chiţu, A., Rothkrantz, L.J.: Automatic visual speech recognition. In: Speech Enhancement, Modeling and Recognition-Algorithms and Applications, pp. 95–120 (2012)
Google Scholar
Erber, N.P.: Auditory-visual perception of speech. J. Speech Hear. Disord. 40, 481–492 (1975)
Article Google Scholar
Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215 (1954)
Article Google Scholar
Hilder, S., Harvey, R., Theobald, B.J.: Comparison of human and machine-based lip-reading. In: Proceedings of Conference on Auditory-Visual Speech Processing, pp. 86–89 (2009)
Google Scholar
Ronquest, R.E., Levi, S.V., Pisoni, D.B.: Language identification from visual-only speech signals. Atten. Percept. Psychophys. 72, 1601–1613 (2010)
Article Google Scholar
Seymour, R., Stewart, D., Ming, J.: Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. J. Image Video Process. 2008, 14 (2008)
Google Scholar
Antonakos, E., Roussos, A., Zafeiriou, S.: A survey on mouth modeling and analysis for sign language recognition. In: Proceedings of Conference on Automatic Face and Gesture Recognition, vol. 1, pp. 1–7 (2015)
Google Scholar
Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2, 141–151 (2000)
Article Google Scholar
Nefian, A.V., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K.: A coupled HMM for audio-visual speech recognition. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 2013–2016 (2002)
Google Scholar
Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014)
Article Google Scholar
Yau, W.C., Kumar, D.K., Weghorn, H.: Visual speech recognition using motion features and hidden Markov models. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 832–839. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74272-2_103
Chapter Google Scholar
Sui, C., Bennamoun, M., Togneri, R.: Listening with your eyes: towards a practical visual speech recognition system using deep Boltzmann machines. In: Proceedings of IEEE International Conference on Computer Vision, pp. 154–162 (2015)
Google Scholar
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 3444–3453 (2017)
Google Scholar
Petridis, S., Pantic, M.: Deep complementary bottleneck features for visual speech recognition. In: Proceedings of Conference on Acoustics, Speech and Signal Processing, pp. 2304–2308 (2016)
Google Scholar
Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: Proceedings of Conference on Acoustics, Speech and Signal Processing, pp. 2722–2726 (2016)
Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of Asian Conference on Computer Vision, pp. 87–103 (2016)
Chapter Google Scholar
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: Proceedings Conference on Acoustics, Speech and Signal Processing, pp. 6115–6119 (2016)
Google Scholar
Twaddell, W.F.: On defining the phoneme. Language 11, 5–62 (1935)
Article Google Scholar
Fisher, C.G.: Confusions among visually perceived consonants. J. Speech Lang. Hear. Res. 11, 796–804 (1968)
Article Google Scholar
Moll, K.L., Daniloff, R.G.: Investigation of the timing of velar movements during speech. J. Acoust. Soc. Am. 50, 678–684 (1971)
Article Google Scholar
Luettin, J., Thacker, N.A., Beet, S.W.: Visual speech recognition using active shape models and hidden Markov models. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 817–820 (1996)
Google Scholar
Sahu, V., Sharma, M.: Result based analysis of various lip tracking systems. In: Proceedings of Conference on Green High Performance Computing, pp. 1–7 (2013)
Google Scholar
Cappelletta, L., Harte, N.: Viseme definitions comparison for visual-only speech recognition. In: Proceedings of Conference on Signal Processing, pp. 2109–2113 (2011)
Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in profile. In: Proceedings of British Machine Vision Conference (2017)
Google Scholar
Bear, H.L., Harvey, R.W., Theobald, B.-J., Lan, Y.: Which phoneme-to-viseme maps best improve visual-only computer lip-reading? In: Bebis, G., et al. (eds.) ISVC 2014. LNCS, vol. 8888, pp. 230–239. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14364-4_22
Chapter Google Scholar
Hazen, T.J., Saenko, K., La, C.H., Glass, J.R.: A segment-based audio-visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of Conference on Multimodal Interfaces, pp. 235–242 (2004)
Google Scholar
Neti, C., et al.: Audio visual speech recognition. Technical report, IDIAP (2000)
Google Scholar
Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: Proceedings of Conference on Signals, Systems and Computers, vol. 1, pp. 572–577 (1994)
Google Scholar
Jeffers, J., Barley, M.: Speechreading (Lipreading). Charles C. Thomas Publisher, Springfield (1980)
Google Scholar
Bozkurt, E., Erdem, C.E., Erzin, E., Erdem, T., Ozkan, M.: Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In: Proceedings of Signal Processing and Communications Applications, pp. 1–4 (2007)
Google Scholar
Ezzat, T., Poggio, T.: MikeTalk: a talking facial display based on morphing visemes. In: Proceedings of Conference on Computer Animation, pp. 96–102 (1998)
Google Scholar
Fernandez-Lopez, A., Sukno, F.M.: Automatic viseme vocabulary construction to enhance continuous lip-reading. In: Proceedings of Conference on Computer Vision Theory and Applications, vol. 5, pp. 52–63 (2017)
Google Scholar
Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: the extended M2VTS database. In: Proceedings of Conference on Audio and Video-Based Biometric Person Authentication, vol. 964, pp. 965–966 (1999)
Google Scholar
Estellers, V., Gurban, M., Thiran, J.P.: On dynamic stream weighting for audio-visual speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 20, 1145–1157 (2012)
Article Google Scholar
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audio-visual speech recognition. In: Proceedings of Multimodal Processing and Interaction, pp. 1–15 (2008)
Google Scholar
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 17, 423–435 (2009)
Article Google Scholar
Estival, D., Cassidy, S., Cox, F., Burnham, D.: AusTalk: an audio-visual corpus of Australian English. In: Proceeding of Conference on Language Resources and Evaluation (2014)
Google Scholar
Ukai, N., Seko, T., Tamura, S., Hayamizu, S.: GIF-LR: GA-based informative feature for lipreading. In: Proceedings Conference on Signal and Information Processing Association Annual Summit and Conference, pp. 1–4 (2012)
Google Scholar
Ninomiya, H., Kitaoka, N., Tamura, S., Iribe, Y., Takeda, K.: Integration of deep bottleneck features for audio-visual speech recognition. In: Proceedings of Interspeech, pp. 563–567 (2015)
Google Scholar
Tamura, S., et al.: CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition. In: Proceedings of Conference on Auditory-Visual Speech Processing (2010)
Google Scholar
Lan, Y., Harvey, R., Theobald, B., Ong, E.J., Bowden, R.: Comparing visual features for lipreading. In: Proceedings of Conference on Auditory-Visual Speech Processing, pp. 102–106 (2009)
Google Scholar
Kolossa, D., Zeiler, S., Vorwerk, A., Orglmeister, R.: Audiovisual speech recognition with missing or unreliable data. In: Proceedings of Conference on Auditory-Visual Speech Processing, pp. 117–122 (2009)
Google Scholar
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11, 1254–1265 (2009)
Article Google Scholar
Zhou, Z., Hong, X., Zhao, G., Pietikäinen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1 (2014)
Article Google Scholar
Pei, Y., Kim, T.K., Zha, H.: Unsupervised random forest manifold alignment for lipreading. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 129–136 (2013)
Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: sentence-level lipreading. In: Proceedings of GPU Technology Conference (2017)
Google Scholar
Thangthai, K., Harvey, R., Cox, S., Theobald, B.J.: Improving lip-reading performance for robust audiovisual speech recognition using DNNs. In: Proceedings of Conference on Auditory-Visual Speech Processing (2015)
Google Scholar
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: Proceedings of Interspeech, pp. 1149–1153 (2014)
Google Scholar
Lan, Y., Harvey, R., Theobald, B.J.: Insights into machine lip reading. In: Proceedings of Conference on Acoustics, Speech and Signal Processing, pp. 4825–4828 (2012)
Google Scholar
Ortega, A., et al.: AV@CAR: a Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: Proceedings of Conference on Language Resources and Evaluation (2004)
Google Scholar
Fernandez-Lopez, A., Martinez, O., Sukno, F.M.: Towards estimating the upper bound of visual-speech recognition: the visual lip-reading feasibility database. In: Proceedings of Conference on Automatic Face and Gesture Recognition (2017)
Google Scholar
Sukno, F.M., Ordas, S., Butakoff, C., Cruz, S., Frangi, A.F.: Active shape models with invariant optimal features: application to facial analysis. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1105–1117 (2007)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)
Article Google Scholar
Franklin, S.B., Gibson, D.J., Robertson, P.A., Pohlmann, J.T., Fralish, J.S.: Parallel analysis: a method for determining significant principal components. J. Veg. Sci. 6, 99–106 (1995)
Article Google Scholar
Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 41, 552–568 (2011)
Article Google Scholar
Verbaeten, S., Van Assche, A.: Ensemble methods for noise elimination in classification problems. In: Windeatt, T., Roli, F. (eds.) MCS 2003. LNCS, vol. 2709, pp. 317–325. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44938-8_32
Chapter Google Scholar
Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. Neural Netw. Learn. Syst. 25, 845–869 (2014)
Article Google Scholar
Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 33, 275–306 (2010)
Article Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)
Article Google Scholar
Petrushin, V.A.: Hidden Markov models: fundamentals and applications. In: Proceedings of Conference on Online Symposium for Electronics Engineer (2000)
Google Scholar
Wells, J.C., et al.: SAMPA computer readable phonetic alphabet. In: Handbook of Standards and Resources for Spoken Language Systems, vol. 4 (1997)
Google Scholar
Llisterri, J., Mariño, J.B.: Spanish adaptation of SAMPA and automatic phonetic transcription. Reporte técnico del ESPRIT PROJECT, vol. 6819 (1993)
Google Scholar

Download references

Acknowledgements

This work is partly supported by the Spanish Ministry of Economy and Competitiveness under project grant TIN2017-90124-P, the Ramon y Cajal programme, the Maria de Maeztu Units of Excellence Programme, and the Kristina project funded by the European Union Horizon 2020 research and innovation programme under grant agreement No. 645012.

Author information

Authors and Affiliations

Department of Information and Communication Technologies, Pompeu Fabra University, Barcelona, Spain
Adriana Fernandez-Lopez & Federico M. Sukno

Authors

Adriana Fernandez-Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Federico M. Sukno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Federico M. Sukno .

Editor information

Editors and Affiliations

University of Lisbon, Lisbon, Portugal
Ana Paula Cláudio
University of Strasbourg, Strasbourg, France
Dominique Bechmann
University of Angers, Angers, France
Paul Richard
Tokyo University of Science, Tokyo, Japan
Takehiko Yamaguchi
University of Münster, Münster, Nordrhein-Westfalen, Germany
Lars Linsen
University of Groningen, Groningen, The Netherlands
Alexandru Telea
Apple Inc., Cupertino, CA, USA
Francisco Imai
Jean Monnet University, Saint-Etienne, France
Alain Tremeau

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fernandez-Lopez, A., Sukno, F.M. (2019). Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish. In: Cláudio, A., et al. Computer Vision, Imaging and Computer Graphics – Theory and Applications. VISIGRAPP 2017. Communications in Computer and Information Science, vol 983. Springer, Cham. https://doi.org/10.1007/978-3-030-12209-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-12209-6_15
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12208-9
Online ISBN: 978-3-030-12209-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading?

Emotional Speech Recognition Based on Lip-Reading

Continuous lipreading based on acoustic temporal alignments

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading?

Emotional Speech Recognition Based on Lip-Reading

Continuous lipreading based on acoustic temporal alignments

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation