Abstract
The integration of virtual agents in real-time interactive virtual applications raises several challenges. The rendering of the movements of the virtual character in the virtual scene (locomotion of the character or rotation of its head) and the binaural rendering in 3D of the synthetic speech during these movements need to be spatially coordinated. Furthermore, the system must enable real-time adaptation of the agent’s expressive audiovisual signals to user’s on-going actions. In this paper, we describe a platform that we have designed to address these challenges as follows: (1) the modules enabling real time synthesis and spatial rendering of the synthetic speech, (2) the modules enabling 3D real time rendering of facial expressions using a GPU-based 3D graphic engine, and (3) the integration of these modules within an experimental platform using gesture as an input modality. A new model of phoneme-dependent human speech directivity patterns is included in the speech synthesis system, so that the agent can move in the virtual scene with realistic 3D visual and audio rendering. Future applications of this platform include perceptual studies about multimodal perception and interaction, expressive real time question and answer system and interactive arts.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bailly, G., Bérar, M., Elisei, F., Odisi, M.: Audiovisual Speech Synthesis. International Journal of Speech Technology. Special Issue on Speech Synthesis: Part II. 6(4) (2003)
Cohen, M.M., Massaro, D.W.: Modeling coarticulation in synthetic visual speech. Models and Techniques in Computer Animation. AAAI/MIT Press, Cambridge (1993)
Ma, J., Cole, R., Pellom, B., Ward, W., Wise, B.: Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data. Computer Animation and Virtual Worlds 15(5) (2004)
Beskow, J.: Talking Heads - Models and Applications for Multimodal Speech Synthesis. PhD Thesis. Stockholm (2003), http://www.speech.kth.se/~beskow/thesis/index.html
Reveret, L., Essa, I.: Visual Coding and Tracking of Speech Related Facial Motion, Hawai, USA
Bevacqua, E., Pelachaud, C.: Expressive audio-visual speech. Comp. Anim. Virtual Worlds, 15 (2004)
DeCarlo, D., Stone, M., Revilla, C., Venditti, J.: Specifying and Animating Facial Signals for Discourse in Embodied Conversational Agents. Computer Animation and Virtual Worlds 15(1) (2004)
Cohen, M., Beskow, J., Massaro, D.: Recent developments in facial animation: an inside view. In: AVSP 1998 (1998)
Ostermann, J.: Animation of synthetic faces in MPEG-4. In: Computer Animation 1998, Philadelphia, USA, pp. 49–51 (1998)
Schröder, M.: Speech and Emotion Research: An overview of research frameworks and a dimensional approach to emotional speech synthesis. PhD Thesis (2004)
Kob, M., Jers, H.: Directivity measurement of a singer. Journal of the Acoustical Society of America 105(2) (1999)
Prudon, R., d’Alessandro, C.: selection/concatenation text-to-speech synthesis system: databases development, system design, comparative evaluation. In: 4th ISCA/IEEE International Workshop on Speech Synthesis, IEEE Computer Society Press, Los Alamitos (2001)
Katz, B., Prezat, F., d’Alessandro, C.: Human voice phoneme directivity pattern measurements. In: 4th Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan, Honolulu, Hawaï 3359 (2006)
D’Alessandro, C., D’Alessandro, N., Le Beux, S., Simko, J., Cetin, F., Pirker, H.: The speech conductor: gestural control of speech synthesis. In: eNTERFACE 2005. The SIMILAR NoE Summer Workshop on Multimodal Interfaces, Mons, Belgium, pp. 52–61 (2005)
Campbell, N.: Speech & Expression; the value of a longitudinal corpus. In: LREC 2004, pp. 183–186 (2004)
Martin, J.-C., Abrilian, S., Devillers, L.: Annotating Multimodal Behaviors Occurring during Non Basic Emotions. In: Tao, J., Tan, T., Picard, R.W. (eds.) ACII 2005. LNCS, vol. 3784, pp. 550–557. Springer, Heidelberg (2005)
d’Alessandro, C., Rilliard, A., Le Beux, S.: Computerized chironomy: evaluation of hand-controlled Intonation reiteration. In: Interspeech 2007, Antwerp, Belgium (2007)
Le Beux, S., Rilliard, A.: A real-time intonation controller for expressive speech synthesis. In: SSW-6. 6th ISCA Workshop on Speech Synthesis, Bonn, Germany (2007)
d’Alessandro, N., Doval, B., d’Alessandro, C., Beux, L., Woodruff, P., Fabre, Y., Dutoit, T.: RAMCESS: Realtime and Accurate Musical Control of Expression in Singing Synthesis. Journal on Multimodal User Interfaces 1(1) (2007)
Pandzic, I.S., Forchheimer, R.: MPEG-4 Facial Animation. The Standard, Implementation and Applications. John Wiley & Sons, LTD, Chichester (2002)
Tsapatsoulis, N., Raouzaiou, A., Kollias, S., Cowie, R., Douglas-Cowie, E.: Emotion Recognition and Synthesis based on MPEG-4 FAPs. MPEG-4 Facial Animation. John Wiley & Sons, Chichester (2002)
Balci, K.: MPEG-4 based open source toolkit for 3D Facial Animation. In: Working conference on Advanced visual interfaces, New York, NY, USA, pp. 399–402 (2004)
Kshirsagar, S., Garchery, S., Magnenat-Thalmann, N.: Feature Point Based Mesh Deformation Applied to MPEG-4 Facial Animation. In: IFIP Tc5/Wg5.10 Deform 2000 Workshop and Avatars 2000 Workshop on Deformable Avatars, pp. 24–34 (2000)
Beeson, C.: Animation in the ”Dawn” demo. GPU Gems, Programming Techniques, Tips, and Tricks for Real-Time Graphics. Wiley, Chichester, UK (2004)
Fagel, S.: Video-realistic Synthetic Speech with a Parametric Visual Speech Synthesizer. In: International Conference on Spoken Language Processing (INTERSPEECH/ICSLP 2004) (2004)
Jacquemin, C.: Pogany: A tangible cephalomorphic interface for expressive facial animation. In: Paiva, A., Prada, R., Picard, R.W. (eds.) ACII 2007. LNCS, vol. 4738, pp. 558–569. Springer, Heidelberg (2007)
Benamara, F.: WebCoop: un système de Questions-Réponses coopératif sur le Web. PhD Thesis (2004)
Bosma, W.: Extending answers using discourse structure. In: Proceedings of RANLP workshop on Crossing Barriers in Text Summarization Research, Borovets, Bulgaria (2005)
Boves, L., den Os, E.: Interactivity and multimodality in the IMIX demonstrator. In: ICME 2005. IEEE conference on Multimedia and Expo (2005)
Rosset, S., Galibert, O., Illouz, G., Max, A.: Integrating spoken dialog and question answering: the Ritel project. In: Interspeech’06, Pittsburgh, USA (2006)
Marsi, E., van Rooden, F.: Expressing uncertainty with a Talking Head in a Multimodal Question-Answering System, Aberdeen, UK
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martin, JC. et al. (2007). 3D Audiovisual Rendering and Real-Time Interactive Control of Expressivity in a Talking Head. In: Pelachaud, C., Martin, JC., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds) Intelligent Virtual Agents. IVA 2007. Lecture Notes in Computer Science(), vol 4722. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74997-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-74997-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74996-7
Online ISBN: 978-3-540-74997-4
eBook Packages: Computer ScienceComputer Science (R0)