Prosodic transformation in vocal emotion conversion for multi-lingual scenarios: a pilot study

Vekkot, Susmitha; Gupta, Deepa

doi:10.1007/s10772-019-09626-5

Prosodic transformation in vocal emotion conversion for multi-lingual scenarios: a pilot study

Published: 03 September 2019

Volume 22, pages 533–549, (2019)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Susmitha Vekkot¹ &
Deepa Gupta²

250 Accesses
8 Citations
Explore all metrics

Abstract

The primary objective of this work is to compare patterns for vocal expression across distinct linguistic contexts. Five language (datasets) are taken for experimentation viz. German (EmoDB), English (SAVEE), and Indian languages: Telugu (IITKGP), Malayalam and Tamil, each varying systematically with reference to typology and linguistic proximity. The hypothesis put forth for experimentation is that though the languages selected exploit the prosodic parameters in distinct measure to express a set of basic emotions, viz. anger, fear and happiness, there exist certain underlying similarities in terms of prosodic perception. A methodology for estimating and incorporating supra-segmental parameters contributing to emotional expression viz. pitch, duration and intensity is developed and tested against all five datasets. The main contribution in this work is the use of same prosodic transformation scales for emotion conversion across multi-lingual test cases for generation of vocal affect in multiple languages. Objective evaluation revealed maximum correlation for anger expression synthesised by adapting transformation scales from Tamil (0.95), that for fear from Telugu (0.89) while for happiness, scales from English dataset yielded superior conversion results (0.94). They are re-emphasised with perception test using comparative mean opinion scores (CMOS). Maximum CMOS of 3.8 is obtained for anger and fear emotions while conversion to happiness yielded a score of 3.3. Experimental findings indicate that though significant information embedded in prosodic parameters is dependent on language structure, common trends can be observed across certain languages in the context of emotion perception which can provide insights into development of emotion conversion systems in a multilingual context.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Analysis of Mandarin vs English Language for Emotional Voice Conversion

Vowel-Based Non-uniform Prosody Modification for Emotion Conversion

Article 06 August 2015

Inter-Emotion Conversion using Dynamic Time Warping and Prosody Imposition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Figure and equation used after obtaining written permission from Verhelst and Roelands (1993).

References

Aihara, R., Takashima, R., Takiguchi, T., & Ariki, Y. (2012). GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing, 2(5), 134–138.
Article Google Scholar
Aihara, R., Ueda, R., Takiguchi, T., & Ariki, Y. (2014). Exemplar-based emotional voice conversion using non-negative matrix factorization. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (pp. 1–7).
Akagi, M., Han, X., Elbarougy, R., Hamada, Y., & Li, J. (2014). Toward affective speech-to-speech translation: Strategy for emotional speech recognition and synthesis in multiple languages. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (pp. 1–10).
Bakshi, P. M., & Kashyap, S. C. (1982). The constitution of India. Prayagraj: Universal Law Publishing.
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Ninth European Conference on Speech Communication and Technology (pp. 1517–1520).
Burkhardt, F., & Sendlmeier, W. F. (2000). Verification of acoustical correlates of emotional speech using formant-synthesis. In ISCA Tutorial and Research Workshop (ITRW) on speech and emotion (pp. 151–156).
Cabral, J. P., & Oliveira, L. C. (2006). Emovoice: a system to generate emotions in speech. In Ninth International Conference on Spoken Language Processing (pp. 1798–1801).
Cahn, J. E. (1990). The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8(1), 1–19.
MathSciNet Google Scholar
Cen, L., Chan, P., Dong, M., & Li, H. (2010). Generating emotional speech from neutral speech. In 7th International Symposium on Chinese Spoken Language Processing (pp. 383–386).
Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.
Article Google Scholar
Govind, D., & Joy, T. T. (2016). Improving the flexibility of dynamic prosody modification using instants of significant excitation. Circuits, Systems, and Signal Processing, 35(7), 2518–2543.
Article Google Scholar
Govind, D., & Prasanna, S. R. M. (2012). Epoch extraction from emotional speech. In International Conference on Signal Processing and Communications (SPCOM) (pp. 1–5).
Govind, D., & Prasanna, S. M. (2013). Dynamic prosody modification using zero frequency filtered signal. International Journal of Speech Technology, 16(1), 41–54.
Article Google Scholar
Govind, D., Prasanna, S. M., & Yegnanarayana, B. (2011). Neutral to target emotion conversion using source and suprasegmental information. In Twelfth Annual Conference of the International Speech Communication Association (pp. 2969–2972).
Haq, S., Jackson, P. J., & Edge, J. (2009). Speaker-dependent audio-visual emotion recognition. In AVSP (pp. 53–58).
Helander, E., Silén, H., Virtanen, T., & Gabbouj, M. (2011). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 806–817.
Article Google Scholar
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol. 1, pp. 373–376).
Kadiri, S. R., & Yegnanarayana, B. (2015). Analysis of singing voice for epoch extraction using zero frequency filtering method. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4260–4264).
Kadiri, S. R., & Yegnanarayana, B. (2017). Epoch extraction from emotional speech using single frequency filtering approach. Speech Communication, 86, 52–63.
Article Google Scholar
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: Speech database for emotion analysis. In International Conference on Contemporary Computing (pp. 485–492). Springer, Berlin.
Luo, Z., Takiguchi, T., & Ariki, Y. (2016). Emotional voice conversion using deep neural networks with MCC and F0 features. In IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS) (pp. 1–5).
Ming, H., Huang, D., Xie, L., Wu, J., Dong, M., & Li, H. (2016a). Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion. In Proceeding of INTERSPEECH
Ming, H., Huang, D., Xie, L., Zhang, S., Dong, M., & Li, H. (2016b). Exemplar-based sparse representation of timbre and prosody for voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5175–5179).
Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16(4), 369–390.
Article Google Scholar
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.
Article Google Scholar
Nguyen, H. Q., Lee, S. W., Tian, X., Dong, M., & Chng, E. S. (2016). High quality voice conversion using prosodic and high-resolution spectral features. Multimedia Tools and Applications, 75(9), 5265–5285.
Article Google Scholar
Pravena, D., & Govind, D. (2016). Expressive speech analysis for epoch extraction using zero frequency filtering approach. In IEEE Students’ Technology Symposium (TechSym) (pp. 240–244).
Pravena, D., & Govind, D. (2017). Development of simulated emotion speech database for excitation source analysis. International Journal of Speech Technology, 20(2), 327–338.
Article Google Scholar
Rachman, L., Liuni, M., Arias, P., Lind, A., Johansson, P., Hall, L., et al. (2018). DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in running speech. Behavior Research Methods, 50(1), 323–343.
Article Google Scholar
Rao, K. S., & Vuppala, A. K. (2013). Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Communication, 55(6), 745–756.
Article Google Scholar
Sarkar, P., Haque, A., Dutta, A. K., Reddy, G., Harikrishna, D. M., Dhara, P., & Rao, K. S. (2014). Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for indian languages: Bengali, Hindi and Telugu. In 2014 Seventh International Conference on Contemporary Computing (IC3) (pp. 473–477).
Schröder, M. (2009). Expressive speech synthesis: past, present, and possible futures (pp. 111–126)., Affective information processing London: Springer.
Google Scholar
Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1145–1154.
Article Google Scholar
Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for storytelling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1137–1144.
Article Google Scholar
Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.
Article Google Scholar
Vekkot, S., Gupta, D., Zakariah, M., & Alotaibi, Y. A. (2019). Hybrid framework for speaker-independent emotion conversion using i-vector PLDA and neural network. IEEE Access, 7, 81883–81902.
Article Google Scholar
Vekkot, S., & Tripathi, S. (2016a). Significance of glottal closure instants detection algorithms in vocal emotion conversion. In International Workshop Soft Computing Applications (pp. 462–473). Springer, Cham.
Vekkot, S., & Tripathi, S. (2016b). Inter-emotion conversion using dynamic time warping and prosody imposition. In International Symposium on Intelligent Systems Technologies and Applications (pp. 913–924). Springer, Cham.
Vekkot, S., & Tripathi, S. (2017). Vocal emotion conversion using WSOLA and linear prediction. In International Conference on Speech and Computer (pp. 777–787). Springer, Cham.
Verhelst, W., & Roelands, M. (1993). An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 2, pp. 554–557).
Verma, R., Sarkar, P., & Rao, K. S. (2015). Conversion of neutral speech to storytelling style speech. In Eighth International Conference on Advances in Pattern Recognition (ICAPR) (pp. 1–6).
Vuppala, A. K., & Kadiri, S. R. (2014). Neutral to anger speech conversion using non-uniform duration modification. In 9th International Conference on Industrial and Information Systems (ICIIS) (pp. 1–4)
Vydana, H. K., Kadiri, S. R., & Vuppala, A. K. (2016). Vowel-based non-uniform prosody modification for emotion conversion. Circuits, Systems, and Signal Processing, 35(5), 1643–1663.
Article Google Scholar
Vydana, H. K., Raju, V. V., Gangashetty, S. V., & Vuppala, A. K. (2015). Significance of emotionally significant areas of speech for emotive to neutral conversion. In International Conference on Mining Intelligence and Knowledge Exploration (pp. 287–296). Springer, Cham.
Wu, C. H., Hsia, C. C., Lee, C. H., & Lin, M. C. (2009). Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1394–1405.
Google Scholar
Wu, Z., Virtanen, T., Chng, E. S., & Li, H. (2014). Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1506–1521.
Article Google Scholar
Yadav, J., & Rao, K. S. (2016). Prosodic mapping using neural networks for emotion conversion in Hindi language. Circuits, Systems, and Signal Processing, 35(1), 139–162.
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research is supported by Govt. of India’s Visveswaraya Ph.D. scheme by means of scholarship for the first author towards completion of her Ph.D.

Author information

Authors and Affiliations

Department of Electronics & Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bengaluru, India
Susmitha Vekkot
Department of Computer Science & Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bengaluru, India
Deepa Gupta

Authors

Susmitha Vekkot
View author publications
You can also search for this author in PubMed Google Scholar
Deepa Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepa Gupta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vekkot, S., Gupta, D. Prosodic transformation in vocal emotion conversion for multi-lingual scenarios: a pilot study. Int J Speech Technol 22, 533–549 (2019). https://doi.org/10.1007/s10772-019-09626-5

Download citation

Received: 31 July 2018
Accepted: 20 August 2019
Published: 03 September 2019
Issue Date: September 2019
DOI: https://doi.org/10.1007/s10772-019-09626-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Prosodic transformation in vocal emotion conversion for multi-lingual scenarios: a pilot study

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of Mandarin vs English Language for Emotional Voice Conversion

Vowel-Based Non-uniform Prosody Modification for Emotion Conversion

Inter-Emotion Conversion using Dynamic Time Warping and Prosody Imposition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Prosodic transformation in vocal emotion conversion for multi-lingual scenarios: a pilot study

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of Mandarin vs English Language for Emotional Voice Conversion

Vowel-Based Non-uniform Prosody Modification for Emotion Conversion

Inter-Emotion Conversion using Dynamic Time Warping and Prosody Imposition

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation