Abstract
The primary objective of this work is to compare patterns for vocal expression across distinct linguistic contexts. Five language (datasets) are taken for experimentation viz. German (EmoDB), English (SAVEE), and Indian languages: Telugu (IITKGP), Malayalam and Tamil, each varying systematically with reference to typology and linguistic proximity. The hypothesis put forth for experimentation is that though the languages selected exploit the prosodic parameters in distinct measure to express a set of basic emotions, viz. anger, fear and happiness, there exist certain underlying similarities in terms of prosodic perception. A methodology for estimating and incorporating supra-segmental parameters contributing to emotional expression viz. pitch, duration and intensity is developed and tested against all five datasets. The main contribution in this work is the use of same prosodic transformation scales for emotion conversion across multi-lingual test cases for generation of vocal affect in multiple languages. Objective evaluation revealed maximum correlation for anger expression synthesised by adapting transformation scales from Tamil (0.95), that for fear from Telugu (0.89) while for happiness, scales from English dataset yielded superior conversion results (0.94). They are re-emphasised with perception test using comparative mean opinion scores (CMOS). Maximum CMOS of 3.8 is obtained for anger and fear emotions while conversion to happiness yielded a score of 3.3. Experimental findings indicate that though significant information embedded in prosodic parameters is dependent on language structure, common trends can be observed across certain languages in the context of emotion perception which can provide insights into development of emotion conversion systems in a multilingual context.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Figure and equation used after obtaining written permission from Verhelst and Roelands (1993).
References
Aihara, R., Takashima, R., Takiguchi, T., & Ariki, Y. (2012). GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing, 2(5), 134–138.
Aihara, R., Ueda, R., Takiguchi, T., & Ariki, Y. (2014). Exemplar-based emotional voice conversion using non-negative matrix factorization. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (pp. 1–7).
Akagi, M., Han, X., Elbarougy, R., Hamada, Y., & Li, J. (2014). Toward affective speech-to-speech translation: Strategy for emotional speech recognition and synthesis in multiple languages. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (pp. 1–10).
Bakshi, P. M., & Kashyap, S. C. (1982). The constitution of India. Prayagraj: Universal Law Publishing.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Ninth European Conference on Speech Communication and Technology (pp. 1517–1520).
Burkhardt, F., & Sendlmeier, W. F. (2000). Verification of acoustical correlates of emotional speech using formant-synthesis. In ISCA Tutorial and Research Workshop (ITRW) on speech and emotion (pp. 151–156).
Cabral, J. P., & Oliveira, L. C. (2006). Emovoice: a system to generate emotions in speech. In Ninth International Conference on Spoken Language Processing (pp. 1798–1801).
Cahn, J. E. (1990). The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8(1), 1–19.
Cen, L., Chan, P., Dong, M., & Li, H. (2010). Generating emotional speech from neutral speech. In 7th International Symposium on Chinese Spoken Language Processing (pp. 383–386).
Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.
Govind, D., & Joy, T. T. (2016). Improving the flexibility of dynamic prosody modification using instants of significant excitation. Circuits, Systems, and Signal Processing, 35(7), 2518–2543.
Govind, D., & Prasanna, S. R. M. (2012). Epoch extraction from emotional speech. In International Conference on Signal Processing and Communications (SPCOM) (pp. 1–5).
Govind, D., & Prasanna, S. M. (2013). Dynamic prosody modification using zero frequency filtered signal. International Journal of Speech Technology, 16(1), 41–54.
Govind, D., Prasanna, S. M., & Yegnanarayana, B. (2011). Neutral to target emotion conversion using source and suprasegmental information. In Twelfth Annual Conference of the International Speech Communication Association (pp. 2969–2972).
Haq, S., Jackson, P. J., & Edge, J. (2009). Speaker-dependent audio-visual emotion recognition. In AVSP (pp. 53–58).
Helander, E., Silén, H., Virtanen, T., & Gabbouj, M. (2011). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 806–817.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol. 1, pp. 373–376).
Kadiri, S. R., & Yegnanarayana, B. (2015). Analysis of singing voice for epoch extraction using zero frequency filtering method. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4260–4264).
Kadiri, S. R., & Yegnanarayana, B. (2017). Epoch extraction from emotional speech using single frequency filtering approach. Speech Communication, 86, 52–63.
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: Speech database for emotion analysis. In International Conference on Contemporary Computing (pp. 485–492). Springer, Berlin.
Luo, Z., Takiguchi, T., & Ariki, Y. (2016). Emotional voice conversion using deep neural networks with MCC and F0 features. In IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS) (pp. 1–5).
Ming, H., Huang, D., Xie, L., Wu, J., Dong, M., & Li, H. (2016a). Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion. In Proceeding of INTERSPEECH
Ming, H., Huang, D., Xie, L., Zhang, S., Dong, M., & Li, H. (2016b). Exemplar-based sparse representation of timbre and prosody for voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5175–5179).
Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16(4), 369–390.
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.
Nguyen, H. Q., Lee, S. W., Tian, X., Dong, M., & Chng, E. S. (2016). High quality voice conversion using prosodic and high-resolution spectral features. Multimedia Tools and Applications, 75(9), 5265–5285.
Pravena, D., & Govind, D. (2016). Expressive speech analysis for epoch extraction using zero frequency filtering approach. In IEEE Students’ Technology Symposium (TechSym) (pp. 240–244).
Pravena, D., & Govind, D. (2017). Development of simulated emotion speech database for excitation source analysis. International Journal of Speech Technology, 20(2), 327–338.
Rachman, L., Liuni, M., Arias, P., Lind, A., Johansson, P., Hall, L., et al. (2018). DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in running speech. Behavior Research Methods, 50(1), 323–343.
Rao, K. S., & Vuppala, A. K. (2013). Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Communication, 55(6), 745–756.
Sarkar, P., Haque, A., Dutta, A. K., Reddy, G., Harikrishna, D. M., Dhara, P., & Rao, K. S. (2014). Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for indian languages: Bengali, Hindi and Telugu. In 2014 Seventh International Conference on Contemporary Computing (IC3) (pp. 473–477).
Schröder, M. (2009). Expressive speech synthesis: past, present, and possible futures (pp. 111–126)., Affective information processing London: Springer.
Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1145–1154.
Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for storytelling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1137–1144.
Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.
Vekkot, S., Gupta, D., Zakariah, M., & Alotaibi, Y. A. (2019). Hybrid framework for speaker-independent emotion conversion using i-vector PLDA and neural network. IEEE Access, 7, 81883–81902.
Vekkot, S., & Tripathi, S. (2016a). Significance of glottal closure instants detection algorithms in vocal emotion conversion. In International Workshop Soft Computing Applications (pp. 462–473). Springer, Cham.
Vekkot, S., & Tripathi, S. (2016b). Inter-emotion conversion using dynamic time warping and prosody imposition. In International Symposium on Intelligent Systems Technologies and Applications (pp. 913–924). Springer, Cham.
Vekkot, S., & Tripathi, S. (2017). Vocal emotion conversion using WSOLA and linear prediction. In International Conference on Speech and Computer (pp. 777–787). Springer, Cham.
Verhelst, W., & Roelands, M. (1993). An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 2, pp. 554–557).
Verma, R., Sarkar, P., & Rao, K. S. (2015). Conversion of neutral speech to storytelling style speech. In Eighth International Conference on Advances in Pattern Recognition (ICAPR) (pp. 1–6).
Vuppala, A. K., & Kadiri, S. R. (2014). Neutral to anger speech conversion using non-uniform duration modification. In 9th International Conference on Industrial and Information Systems (ICIIS) (pp. 1–4)
Vydana, H. K., Kadiri, S. R., & Vuppala, A. K. (2016). Vowel-based non-uniform prosody modification for emotion conversion. Circuits, Systems, and Signal Processing, 35(5), 1643–1663.
Vydana, H. K., Raju, V. V., Gangashetty, S. V., & Vuppala, A. K. (2015). Significance of emotionally significant areas of speech for emotive to neutral conversion. In International Conference on Mining Intelligence and Knowledge Exploration (pp. 287–296). Springer, Cham.
Wu, C. H., Hsia, C. C., Lee, C. H., & Lin, M. C. (2009). Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1394–1405.
Wu, Z., Virtanen, T., Chng, E. S., & Li, H. (2014). Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1506–1521.
Yadav, J., & Rao, K. S. (2016). Prosodic mapping using neural networks for emotion conversion in Hindi language. Circuits, Systems, and Signal Processing, 35(1), 139–162.
Acknowledgements
This research is supported by Govt. of India’s Visveswaraya Ph.D. scheme by means of scholarship for the first author towards completion of her Ph.D.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vekkot, S., Gupta, D. Prosodic transformation in vocal emotion conversion for multi-lingual scenarios: a pilot study. Int J Speech Technol 22, 533–549 (2019). https://doi.org/10.1007/s10772-019-09626-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-019-09626-5