Abstract
Emotional state of a speaker is accompanied by physiological changes affecting respiration, phonation, and articulation. These changes are manifested mainly in prosodic patterns of F0, energy, and duration, but also in segmental parameters of speech spectrum. Therefore, our new emotional speech synthesis method is supplemented with spectrum modification. It comprises non-linear frequency scale transformation of speech spectral envelope, filtering for emphasizing low or high frequency range, and controlling of spectral noise by spectral flatness measure according to knowledge of psychological and phonetic research. The proposed spectral modification is combined with linear modification of F0 mean, F0 range, energy, and duration. Speech resynthesis with applied modification that should represent joy, anger and sadness is evaluated by a listening test.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Scherer, K.R.: Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication 40, 227–256 (2003)
Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech Emotion Recognition Using Hidden Markov Models. Speech Communication 41, 603–623 (2003)
Ververidis, D., Kotropoulos, C.: Emotional Speech Recognition: Resources, Features, and Methods. Speech Communication 48, 1162–1181 (2006)
Shami, M., Verhelst, W.: An Evaluation of the Robustness of Existing Supervised Machine Learning Approaches to the Classification of Emotions in Speech. Speech Communication 49, 201–212 (2007)
Tóth, S.L., Sztahó, D., Vicsi, K.: Speech Emotion Perception by Human and Machine. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction. LNCS (LNAI), vol. 5042, pp. 213–224. Springer, Heidelberg (2008)
Murray, I.R., Arnott, J.L.: Applying an Analysis of Acted Vocal Emotions to Improve the Simulation of Synthetic Speech. Computer Speech and Language 22, 107–129 (2008)
Bänziger, T., Scherer, K.R.: The Role of Intonation in Emotional Expressions. Speech Communication 46, 252–267 (2005)
Vích, R.: Cepstral Speech Model, Padé Approximation, Excitation, and Gain Matching in Cepstral Speech Synthesis. In: Proceedings of Biosignal, Brno, pp. 77–82 (2000)
Fant, G.: Acoustical Analysis of Speech. In: Crocker, M.J. (ed.) Encyclopedia of Acoustics, pp. 1589–1598. John Wiley & Sons, Chichester (1997)
Fant, G.: Speech Acoustics and Phonetics. Kluwer Academic Publishers, Dordrecht (2004)
Laroche, J.: Time and Pitch Scale Modification of Audio Signals. In: Kahrs, M., Brandenburg, K. (eds.) Applications of Digital Signal Processing to Audio and Acoustics, pp. 279–309. Kluwer Academic Publishers, Dordrecht (2001)
Morrison, D., Wang, R., De Silva, L.C.: Ensemble Methods for Spoken Emotion Recognition in Call-Centres. Speech Communication 49, 98–112 (2007)
Dutilleux, P., Zölzer, U.: Filters. In: Zölzer, U. (ed.) DAFX – Digital Audio Effects, pp. 31–62. John Wiley & Sons, Chichester (2002)
Drioli, C., Tisato, G., Cosi, P., Tesser, F.: Emotions and Voice Quality: Experiments with Sinusoidal Modeling. In: Proceedings of Voice Quality, Geneva, pp. 127–132 (2003)
Hirose, K., Sato, K., Asano, Y., Minematsu, N.: Synthesis of F0 Contours Using Generation Process Model Parameters Predicted form Unlabeled Corpora: Application to Emotional Speech Synthesis. Speech Communication 46, 385–404 (2005)
Navas, E., Hernáez, I., Luengo, I.: An Objective and Subjective Study of the Role of Semantics and Prosodic Features in Building Corpora for Emotional TTS. IEEE Transactions on Audio, Speech, and Language Processing 14, 1117–1127 (2006)
Cabral, J.P., Oliveira, L.C.: EmoVoice: A System to Generate Emotions in Speech. In: Proceedings of Interspeech – ICSLP. pp. 1798–1801. Pittsburgh (2006)
Přibil, J., Přibilová, A.: Emotional Style Conversion in the TTS System with Cepstral Description. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 65–73. Springer, Heidelberg (2007)
Přibilová, A., Přibil, J.: Non-Linear Frequency Scale Mapping for Voice Conversion in Text-to-Speech System with Cepstral Description. Speech Communication 48, 1691–1703 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Přibilová, A., Přibil, J. (2009). Spectrum Modification for Emotional Speech Synthesis. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds) Multimodal Signals: Cognitive and Algorithmic Issues. Lecture Notes in Computer Science(), vol 5398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00525-1_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-00525-1_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00524-4
Online ISBN: 978-3-642-00525-1
eBook Packages: Computer ScienceComputer Science (R0)