Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish | SpringerLink
Skip to main content

Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish

  • Conference paper
  • First Online:
Computer Vision, Imaging and Computer Graphics – Theory and Applications (VISIGRAPP 2017)

Abstract

Speech is the most used communication method between humans and it is considered a multisensory process. Even though there is a popular belief that speech is something that we hear, there is overwhelming evidence that the brain treats speech as something that we hear and see. Much of the research has focused on Automatic Speech Recognition (ASR) systems, treating speech primarily as an acoustic form of communication. In the last years, there has been an increasing interest in systems for Automatic Lip-Reading (ALR), although exploiting the visual information has been proved to be challenging. One of the main problems in ALR is how to make the system robust to the visual ambiguities that appear at the word level. These ambiguities make confused and imprecise the definition of the minimum distinguishable unit of the video domain. In contrast to the audio domain, where the phoneme is the standard minimum auditory unit, there is no consensus on the definition of the minimum visual unit (the viseme). In this work, we focus on the automatic construction of a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition. We investigate the usefulness of different phoneme-to-viseme mappings, obtaining the best results for intermediate vocabulary lengths. We construct an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. We test our system in two Spanish corpora with continuous speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our results indicate that we are able to recognize 47% (resp. 51%) of the phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show additional results that support the usefulness of visemes. Experiments on a comparable ALR system trained exclusively using phonemes at all its stages confirm the existence of strong visual ambiguities between groups of phonemes. This fact and the higher word accuracy obtained when using phoneme-to-viseme mappings, justify the usefulness of visemes instead of the direct use of phonemes for ALR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)

    Article  Google Scholar 

  2. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 1306–1326 (2003)

    Article  Google Scholar 

  3. Rosenblum, L.D.: Speech perception as a multimodal phenomenon. Curr. Dir. Psychol. Sci. 17, 405–409 (2008)

    Article  Google Scholar 

  4. Chiţu, A., Rothkrantz, L.J.: Automatic visual speech recognition. In: Speech Enhancement, Modeling and Recognition-Algorithms and Applications, pp. 95–120 (2012)

    Google Scholar 

  5. Erber, N.P.: Auditory-visual perception of speech. J. Speech Hear. Disord. 40, 481–492 (1975)

    Article  Google Scholar 

  6. Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215 (1954)

    Article  Google Scholar 

  7. Hilder, S., Harvey, R., Theobald, B.J.: Comparison of human and machine-based lip-reading. In: Proceedings of Conference on Auditory-Visual Speech Processing, pp. 86–89 (2009)

    Google Scholar 

  8. Ronquest, R.E., Levi, S.V., Pisoni, D.B.: Language identification from visual-only speech signals. Atten. Percept. Psychophys. 72, 1601–1613 (2010)

    Article  Google Scholar 

  9. Seymour, R., Stewart, D., Ming, J.: Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. J. Image Video Process. 2008, 14 (2008)

    Google Scholar 

  10. Antonakos, E., Roussos, A., Zafeiriou, S.: A survey on mouth modeling and analysis for sign language recognition. In: Proceedings of Conference on Automatic Face and Gesture Recognition, vol. 1, pp. 1–7 (2015)

    Google Scholar 

  11. Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2, 141–151 (2000)

    Article  Google Scholar 

  12. Nefian, A.V., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K.: A coupled HMM for audio-visual speech recognition. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 2013–2016 (2002)

    Google Scholar 

  13. Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014)

    Article  Google Scholar 

  14. Yau, W.C., Kumar, D.K., Weghorn, H.: Visual speech recognition using motion features and hidden Markov models. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 832–839. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74272-2_103

    Chapter  Google Scholar 

  15. Sui, C., Bennamoun, M., Togneri, R.: Listening with your eyes: towards a practical visual speech recognition system using deep Boltzmann machines. In: Proceedings of IEEE International Conference on Computer Vision, pp. 154–162 (2015)

    Google Scholar 

  16. Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 3444–3453 (2017)

    Google Scholar 

  17. Petridis, S., Pantic, M.: Deep complementary bottleneck features for visual speech recognition. In: Proceedings of Conference on Acoustics, Speech and Signal Processing, pp. 2304–2308 (2016)

    Google Scholar 

  18. Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: Proceedings of Conference on Acoustics, Speech and Signal Processing, pp. 2722–2726 (2016)

    Google Scholar 

  19. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of Asian Conference on Computer Vision, pp. 87–103 (2016)

    Chapter  Google Scholar 

  20. Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: Proceedings Conference on Acoustics, Speech and Signal Processing, pp. 6115–6119 (2016)

    Google Scholar 

  21. Twaddell, W.F.: On defining the phoneme. Language 11, 5–62 (1935)

    Article  Google Scholar 

  22. Fisher, C.G.: Confusions among visually perceived consonants. J. Speech Lang. Hear. Res. 11, 796–804 (1968)

    Article  Google Scholar 

  23. Moll, K.L., Daniloff, R.G.: Investigation of the timing of velar movements during speech. J. Acoust. Soc. Am. 50, 678–684 (1971)

    Article  Google Scholar 

  24. Luettin, J., Thacker, N.A., Beet, S.W.: Visual speech recognition using active shape models and hidden Markov models. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 817–820 (1996)

    Google Scholar 

  25. Sahu, V., Sharma, M.: Result based analysis of various lip tracking systems. In: Proceedings of Conference on Green High Performance Computing, pp. 1–7 (2013)

    Google Scholar 

  26. Cappelletta, L., Harte, N.: Viseme definitions comparison for visual-only speech recognition. In: Proceedings of Conference on Signal Processing, pp. 2109–2113 (2011)

    Google Scholar 

  27. Chung, J.S., Zisserman, A.: Lip reading in profile. In: Proceedings of British Machine Vision Conference (2017)

    Google Scholar 

  28. Bear, H.L., Harvey, R.W., Theobald, B.-J., Lan, Y.: Which phoneme-to-viseme maps best improve visual-only computer lip-reading? In: Bebis, G., et al. (eds.) ISVC 2014. LNCS, vol. 8888, pp. 230–239. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14364-4_22

    Chapter  Google Scholar 

  29. Hazen, T.J., Saenko, K., La, C.H., Glass, J.R.: A segment-based audio-visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of Conference on Multimodal Interfaces, pp. 235–242 (2004)

    Google Scholar 

  30. Neti, C., et al.: Audio visual speech recognition. Technical report, IDIAP (2000)

    Google Scholar 

  31. Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: Proceedings of Conference on Signals, Systems and Computers, vol. 1, pp. 572–577 (1994)

    Google Scholar 

  32. Jeffers, J., Barley, M.: Speechreading (Lipreading). Charles C. Thomas Publisher, Springfield (1980)

    Google Scholar 

  33. Bozkurt, E., Erdem, C.E., Erzin, E., Erdem, T., Ozkan, M.: Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In: Proceedings of Signal Processing and Communications Applications, pp. 1–4 (2007)

    Google Scholar 

  34. Ezzat, T., Poggio, T.: MikeTalk: a talking facial display based on morphing visemes. In: Proceedings of Conference on Computer Animation, pp. 96–102 (1998)

    Google Scholar 

  35. Fernandez-Lopez, A., Sukno, F.M.: Automatic viseme vocabulary construction to enhance continuous lip-reading. In: Proceedings of Conference on Computer Vision Theory and Applications, vol. 5, pp. 52–63 (2017)

    Google Scholar 

  36. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: the extended M2VTS database. In: Proceedings of Conference on Audio and Video-Based Biometric Person Authentication, vol. 964, pp. 965–966 (1999)

    Google Scholar 

  37. Estellers, V., Gurban, M., Thiran, J.P.: On dynamic stream weighting for audio-visual speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 20, 1145–1157 (2012)

    Article  Google Scholar 

  38. Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audio-visual speech recognition. In: Proceedings of Multimodal Processing and Interaction, pp. 1–15 (2008)

    Google Scholar 

  39. Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 17, 423–435 (2009)

    Article  Google Scholar 

  40. Estival, D., Cassidy, S., Cox, F., Burnham, D.: AusTalk: an audio-visual corpus of Australian English. In: Proceeding of Conference on Language Resources and Evaluation (2014)

    Google Scholar 

  41. Ukai, N., Seko, T., Tamura, S., Hayamizu, S.: GIF-LR: GA-based informative feature for lipreading. In: Proceedings Conference on Signal and Information Processing Association Annual Summit and Conference, pp. 1–4 (2012)

    Google Scholar 

  42. Ninomiya, H., Kitaoka, N., Tamura, S., Iribe, Y., Takeda, K.: Integration of deep bottleneck features for audio-visual speech recognition. In: Proceedings of Interspeech, pp. 563–567 (2015)

    Google Scholar 

  43. Tamura, S., et al.: CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition. In: Proceedings of Conference on Auditory-Visual Speech Processing (2010)

    Google Scholar 

  44. Lan, Y., Harvey, R., Theobald, B., Ong, E.J., Bowden, R.: Comparing visual features for lipreading. In: Proceedings of Conference on Auditory-Visual Speech Processing, pp. 102–106 (2009)

    Google Scholar 

  45. Kolossa, D., Zeiler, S., Vorwerk, A., Orglmeister, R.: Audiovisual speech recognition with missing or unreliable data. In: Proceedings of Conference on Auditory-Visual Speech Processing, pp. 117–122 (2009)

    Google Scholar 

  46. Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11, 1254–1265 (2009)

    Article  Google Scholar 

  47. Zhou, Z., Hong, X., Zhao, G., Pietikäinen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1 (2014)

    Article  Google Scholar 

  48. Pei, Y., Kim, T.K., Zha, H.: Unsupervised random forest manifold alignment for lipreading. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 129–136 (2013)

    Google Scholar 

  49. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: sentence-level lipreading. In: Proceedings of GPU Technology Conference (2017)

    Google Scholar 

  50. Thangthai, K., Harvey, R., Cox, S., Theobald, B.J.: Improving lip-reading performance for robust audiovisual speech recognition using DNNs. In: Proceedings of Conference on Auditory-Visual Speech Processing (2015)

    Google Scholar 

  51. Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: Proceedings of Interspeech, pp. 1149–1153 (2014)

    Google Scholar 

  52. Lan, Y., Harvey, R., Theobald, B.J.: Insights into machine lip reading. In: Proceedings of Conference on Acoustics, Speech and Signal Processing, pp. 4825–4828 (2012)

    Google Scholar 

  53. Ortega, A., et al.: AV@CAR: a Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: Proceedings of Conference on Language Resources and Evaluation (2004)

    Google Scholar 

  54. Fernandez-Lopez, A., Martinez, O., Sukno, F.M.: Towards estimating the upper bound of visual-speech recognition: the visual lip-reading feasibility database. In: Proceedings of Conference on Automatic Face and Gesture Recognition (2017)

    Google Scholar 

  55. Sukno, F.M., Ordas, S., Butakoff, C., Cruz, S., Frangi, A.F.: Active shape models with invariant optimal features: application to facial analysis. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1105–1117 (2007)

    Article  Google Scholar 

  56. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)

    Article  Google Scholar 

  57. Franklin, S.B., Gibson, D.J., Robertson, P.A., Pohlmann, J.T., Fralish, J.S.: Parallel analysis: a method for determining significant principal components. J. Veg. Sci. 6, 99–106 (1995)

    Article  Google Scholar 

  58. Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 41, 552–568 (2011)

    Article  Google Scholar 

  59. Verbaeten, S., Van Assche, A.: Ensemble methods for noise elimination in classification problems. In: Windeatt, T., Roli, F. (eds.) MCS 2003. LNCS, vol. 2709, pp. 317–325. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44938-8_32

    Chapter  Google Scholar 

  60. Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. Neural Netw. Learn. Syst. 25, 845–869 (2014)

    Article  Google Scholar 

  61. Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 33, 275–306 (2010)

    Article  Google Scholar 

  62. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)

    Article  Google Scholar 

  63. Petrushin, V.A.: Hidden Markov models: fundamentals and applications. In: Proceedings of Conference on Online Symposium for Electronics Engineer (2000)

    Google Scholar 

  64. Wells, J.C., et al.: SAMPA computer readable phonetic alphabet. In: Handbook of Standards and Resources for Spoken Language Systems, vol. 4 (1997)

    Google Scholar 

  65. Llisterri, J., Mariño, J.B.: Spanish adaptation of SAMPA and automatic phonetic transcription. Reporte técnico del ESPRIT PROJECT, vol. 6819 (1993)

    Google Scholar 

Download references

Acknowledgements

This work is partly supported by the Spanish Ministry of Economy and Competitiveness under project grant TIN2017-90124-P, the Ramon y Cajal programme, the Maria de Maeztu Units of Excellence Programme, and the Kristina project funded by the European Union Horizon 2020 research and innovation programme under grant agreement No. 645012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Federico M. Sukno .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fernandez-Lopez, A., Sukno, F.M. (2019). Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish. In: Cláudio, A., et al. Computer Vision, Imaging and Computer Graphics – Theory and Applications. VISIGRAPP 2017. Communications in Computer and Information Science, vol 983. Springer, Cham. https://doi.org/10.1007/978-3-030-12209-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-12209-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-12208-9

  • Online ISBN: 978-3-030-12209-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics