Abstract
Advances in emotional speech recognition and synthesis essentially rely on the availability of annotated emotional speech corpora. As a low resource language, the Thai language critically lacks corpora of emotional speech, although a few corpora have been constructed for speech recognition and synthesis. This paper presents the design of a Thai emotional speech corpus (namely EMOLA), its construction and annotation process, and its analysis. In the corpus design, four basic types with twelve subtypes of emotions are defined with consideration of the Pleasure-Arousal-Dominance emotional state model. To construct the corpus, a series of Thai dramas (1397 min) were selected and its video clips of approximately 868 min were annotated. As a result, 8987 transcriptions (of conversation turns) were derived in total, with each transcription tagged as one basic type and a few subtypes. Finally, an analysis was conducted to describe the characteristics of this corpus in three sets of statistics: collection-level, annotator-oriented and actor-oriented statistics.
Similar content being viewed by others
References
Abrilian, S., Devillers, L., Buisine, S., & Martin, J.-C. (2005). EmoTV1: Annotation of real-life emotions for the specification of multimodal affective interfaces. In HCI International.
Arimoto, Y., Ohno, S., & Iida, H. (2008). Automatic emotional degree labeling for speakers’ anger utterance during natural Japanese Dialog. In LREC.
Arimoto, Y., Ohno, S., & Iida, H. (2011). Assessment of spontaneous emotional speech database toward emotion recognition: Intensity and similarity of perceived emotion from spontaneously expressed emotional speech. Acoustical Science and Technology, 32(1), 26–29.
Asghar, D., Moloud, P., & Peymaneh, S. (2008). The pattern of Facial Expression among Iranian Children. In Proceedings of Measuring Behavior (pp. 172–173). Maastricht.
Bachorowski, J.-A. (1999). Vocal expression and perception of emotion. Current Directions in Psychological Science, 8(2), 53–57.
Bann, E. Y., & Bryson, J. J. (2012). The conceptualisation of emotion qualia: Semantic clustering of emotional tweets. In Computational models of cognitive processes: Proceedings of the 13th neural computation and psychology workshop (pp. 249–263). World Scientific.
Bao, W., Li, Y. A., Yang, M., Li, H., Chao, L., & Tao, J. (2014). Building a Chinese Natural Emotional Audio-visual Database. In 12th international conference on signal processing (ICSP) (pp. 583–587).
Batliner, A., Fischer, K., Huber, R., Spilker, J., & Nöth, E. (2003). How to find trouble in communication. Speech Communication, 40(1), 117–143.
Burkhardt, F. A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. Interspeech, 5, 1517–1520.
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42, 335–359.
Campbell, N. (2003). Databases of expressive speech. In Proceedings of oriental COCOSDA workshop.
Cichosz, J., & Slot, K. (2005). Low-Dimensional feature space derivation for emotion recognition. In Ninth European conference on speech communication and technology (pp. 477–480).
Cichosz, J., & Slot, K. (2007). Emotion recognition in speech signal using emotion-extracting binary decision trees. In Proceedings of affective computing and intelligent interaction.
Cole, R. (2005). The CU kids’ speech corpus. The Center for Spoken Language Research (CSLR). http://cslr.colorado.edu/.
Colombetti, G. (2009). From affect programs to dynamical discrete emotions. Philosophical Psychology, 22(4), 407–425.
Costantini, G., Iaderola, I., Paoloni, A., & Todisco, M. (2014). EMOVO Corpus: an Italian emotional speech database. In LREC (pp. 3501–3504).
Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are express in speech. Speech Communication, 40(1), 5–32.
Crystal, D. (1975). The English tone of voice. London: Edward Arnold.
Crystal, D. (1976). Prosodic systems and intonation in English. Cambridge: Cambridge University Press.
Dadkhah, A., Pourmohammadi, M., & Shirinbayan, P. (2008). The pattern of Facial Expression among Iranian Children. In: Measuring behavior 2008. Psychonomic Soc Inc, 1710 Fortview Rd, Austin, TX 78704, USA.
Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40(1), 33–60.
Douglas-Cowie, E., Cowie, R., & Schroder, M. (2000). A new emotion database: considerations, sources and scope. In ISCA tutorial and research workshop (ITRW) on speech and emotion.
Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, O., Mcrorie, M., et al. (2007). The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. In Affective computing and intelligent interaction (pp. 488–500).
Douglas-Cowie, E., Devillers, L., Martin, J.-C., Cowie, R., Savvidou, S., Abrilian, S., et al. (2005). Multimodal databases of everyday emotion: Facing up to complexity. In Ninth European conference on speech communication and technology.
Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6(3/4), 169–200.
Ekman, P., Friesen, W. V., & Ellsworth, P. (1972). Emotion in the human face: Guide-lines for research and an integration of findings: Guidelines for research and an integration of findings. Oxford: Pergamon.
Fersini, E., Messina, E., & Archetti, F. (2012). Emotional states in judicial courtrooms: an experimental investigation. Speech Communication, 54(1), 11–22.
Fersini, E., Messina, E., Arosio, G., & Archetti, F. (2009). Audio-based emotion recognition in judicial domain: A multilayer support vector machines approach. In International workshop on machine learning and data mining in pattern recognition (pp. 594–602). Springer.
Fu, L., Mao, X., & Chen, L. (2008). Speaker independent emotion recognition based on SVM/HMMs fusion system. In International conference on audio, language and image processing, 2008 (ICALIP2008) (pp. 61–65). IEEE.
Greasley, P., Setter, J., Waterman, M., Sherrard, C., Roach, P., Arnfield, S., et al. (1995). Representation of prosodic and emotional features in a spoken language database. In Proceedings of the XIIIth ICPhS.
Grimm, M., Kroschel, K., & Narayanan, S. (2008). The Vera am Mittag German audio-visual emotional speech database. In IEEE international conference on multimedia and expo.
Haq, S., Jackson, P. J., & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification. In Proceedings of AVSP (pp. 185–190).
Havlena, W. J., & Holbrook, M. B. (1986). The varieties of consumption experience: Comparing two typologies of emotion in consumer behavior. Journal of Consumer Research, 13(3), 394–404.
Hozjan, V., Kacic, Z., Moreno, A., Bonafonte, A., & Nogueiras, A. (2002). Interface databases: Design and collection of a multilingual emotional speech database. In LREC.
Iida, A., Campbell, N., Iga, S., Higuchi, F., & Yasumura, M. (1998). Acoustic nature and perceptual testing of corpora of emotional speech. In ICSLP.
Johnstone, T., & Scherer, K. R. (1999). The effects of emotions on voice quality. In Proceedings of the XIVth international congress of phonetic sciences (pp. 2029–2032). Citeseer.
Kaiser, S., & Scherer, K. R. (1998). Models of ‘normal’ emotions applied to facial and vocal expression in clinical disorders. In J. Flack, F. William & J. D. Laird (Eds.), Emotions in psychopathology: Theory and research (pp. 81–98).
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: Speech database for emotion analysis. In International conference on contemporary computing (pp. 485–492). Berlin: Springer.
Kostoulas, T., Ganchev, T., Mporas, I., & Fakotakis, N. (2008). A real-world emotional speech corpus for modern greek. In LREC.
Kövecses, Z. (2003). Metaphor and emotion: Language, culture, and body in human feeling. Cambridge: Cambridge University Press.
Laskowski, K., & Burger, S. (2006). Annotation and analysis of emotionally relevant behavior in the ISL meeting corpus. In LREC.
Li, A. (2015). Encoding and decoding of emotional speech: A cross-cultural and multimodal study between Chinese and Japanese. Berlin: Springer.
Lian-hong, C., Dan-dan, C., & Rui, C. (2007). TH-CoSS,a Mandarin Speech Corpus for TTS. Journal of Chinese Information Processing, 02.
Lubis, N. A. (2014). Construction and analysis of Indonesian emotional speech corpus. In 17th oriental chapter of the international committee for the co-ordination and standardization of speech databases and assessment techniques (COCOSDA) (pp. 1–5).
Lubis, N., Gomez, R., Sakti, S., Nakamura, K., Yoshino, K., Nakamura, S., et al. (2016). Construction of Japanese audio-visual emotion database and its application in emotion recognition. In LREC.
Lubis, N., Sakti, S., Neubig, G., Toda, T., & Nakamura, S. (2015). Construction and analysis of social-affective interaction corpus in English and Indonesian. In Oriental COCOSDA held jointly with 2015 conference on asian spoken language research and evaluation (O-COCOSDA/CASLRE) (pp. 202–206).
Martin, O., Kotsia, I., Macq, B., & Pitas, I. (2006). The eNTERFACE’05 audio-visual emotion database. In Data engineering workshops, 2006 (p. 8). IEEE.
Mehrabian, A. (1995). Relationships among three general approaches to personality description. The Journal of Psychology, 129(5), 565–581.
Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Current Psychology, 14(4), 261–292.
Mehrabian, A., & Russell, J. A. (1974). Approach to environmental psychology. Cambridge, MA: MIT Press.
Mori, H., Satake, T., Nakamura, M., & Kasuya, H. (2008). UU database: A spoken dialogue corpus for studies on paralinguistic information in expressive conversation. In International conference on text, speech and dialogue (pp. 427–434). Berlin: Springer.
Mori, H., Satake, T., Nakamura, M., & Kasuya, H. (2011). Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics. Speech Communication, 53(1), 36–50.
Moriyama, T., Mori, S., & Ozawa, S. (2009). A synthesis method of emotional speech using subspace constraints in prosody. Journal of Information Processing, 50(3), 1181–1191.
Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41(4), 603–623.
O’Connor, J., & Arnold, G. (1973). Intonation of colloquial English. London: Longman.
Plutchik, R. (1980). A general psychoevolutionary theory of emotion. In R. Plutchik & H. Kellerman (Eds.), Emotion: Theory, research, and experience (Vol. 1, pp. 3–31). New York: Academic Press.
Plutchik, R. (1984). Emotions: A general psychoevolutionary theory. In Approaches to emotion (pp. 197–219).
Posner, J., Russell, J. A., & Petersona, B. S. (2005). The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology. Dvelopmental and Psychopathology, 17(3), 715–734.
Ringeval, F., Sonderegger, A., Sauer, J., & Lalanne, D. (2013). Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE international conference and workshops on (pp. 1–8). IEEE.
Russell, J., & Mehrabian, A. (1977). Evidence for a three-factor theory of emotions. Journal of Research in Personality, 11, 273–294.
Saratxaga, I., Navas, E., Hernaez, I., & Luengo, I. (2006). Designing and recording an emotional speech database for corpus based synthesis in Basque. In Proceedings of fifth international conference on language resources and evaluation (LREC) (pp. 2126–2129).
Scherer, K. R. (1986). Vocal affect expression: a review and a model for future research. Psychological Bulletin, 99(2), 143.
Scherer, K. R. (1995). Expression of emotion in voice and music. Journal of Voice, 9(3), 235–248.
Scherer, K. R., & Tannenbaum, P. H. (1986). Emotional experiences in everyday life: A survey approach. Motivation and Emotion, 10(4), 295–314.
Schlosberg, H. (1954). Three dimensions of emotion. Psychological Review, 61, 81–88.
Schubiger, M. (1958). English intonation, its form and function. Tübingen: M. Niemeyer Verlag.
Sneddon, I., McRorie, M., McKeown, G., & Hanratty, J. (2012). The Belfast induced natural emotion database. IEEE Transactions on Affective Computing, 3(1), 32–41.
Stein, N. L., & Oatley, K. (1992). Basic emotions: Theory and measurement. Cognition and Emotion, 6(3–4), 161–168.
Trong, K. P., Neerincx, M. A., & Van Leeuwen, D. A. (2008). Measuring spontaneous vocal and facial emotion expressions in real world environments. In Proceedings of measuring behavior 2008 (pp. 170–171). Maastricht.
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181.
Wang, X., Li, A., & Tao, J. (2007). An expressive speech corpus of standard Chinese. In O-COCOSDA2007. Hanoi, Vietnam.
Watson, D., & Tellegan, A. (1985). Toward a consensual structure of mood. Psychological Bulletin, 98, 219–235.
Wu, T., Yang, Y., Wu, Z., & Li, D. (2006). MASC: A speech corpus in Mandarin for emotion analysis and affective speaker recognition. In 2006 IEEE Odyssey-the speaker and language recognition workshop (pp. 1–5).
Wundt, W. M. (1897). Outlines of psychology. In http://psychclassics.asu.edu/index.htm, Classics in the history of psychology. Toronto: York University 2010.
Yamagishi, J., Onishi, K., Masuko, T., & Kobayashi, T. (2005). Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE TRANSACTIONS on Information and Systems, 88(3), 502–509.
Zhang, S., Ching, P., & Kong, F. (2006). Acoustic analysis of emotional speech in Mandarin Chinese. In International symposium on chinese spoken language processing (pp. 57–66).
Zhang, S., Xu, Y., Jia, J., & Cai, L. (2008). Analysis and modeling of affective audio visual speech based on PAD emotion space. In 6th international symposium on Chinese spoken language processing (pp. 1–4). Kunming, China.
Zovato, E., Sandri, S., Quazza, S., & Badino, L. (2004). Prosodic analysis of a multi-style corpus in the perspective of emotional speech synthesis. In ICSLP 2004 (Vol. 2, pp. 1453–1457). Prentice Hall.
Acknowledgements
This work was partially supported by a SIIT graduate student scholarship, the Center of Excellence in Intelligent Informatics, Speech and Language Technology and Service Innovation (CILS), Thammasat University, the Center of Excellence in Intelligent Informatics and Service Innovation (IISI), SIIT, Thammasat University, and the Thailand Research Fund under the Grant Number RTA6080013.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kasuriya, S., Theeramunkong, T., Wutiwiwatchai, C. et al. Developing a Thai emotional speech corpus from Lakorn (EMOLA). Lang Resources & Evaluation 53, 17–55 (2019). https://doi.org/10.1007/s10579-018-9428-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-018-9428-9