Abstract
Increasing attention has been directed to the study of the automatic emotion recognition in human speech recently. This paper presents an approach for recognizing negative emotions in spoken dialogs at the utterance level. Our approach mainly includes two parts. First, in addition to the traditional acoustic features, linguistic features based on distributed representation are extracted from the text transcribed by an automatic speech recognition (ASR) system. Second, we propose a novel deep learning model, multi-feature stacked denoising autoencoders (MSDA), which can fuse the high-level representations of the acoustic and linguistic features along with contexts to classify emotions. Experimental results demonstrate that our proposed method yields an absolute improvement over the traditional method by 5.2 %.
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Allauzen, C., Mohri, M., Riley, M., Roark, B.: A generalized construction of integrated speech recognition transducers. In: ICASSP, vol. 1, pp. 761–764 (2004)
Ayadi, M.E., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Banse, R., Scherer, K.R.: Acoustic profiles in vocal emotion expression. J. Personal. Soc. Psychol. 70(3), 614–636 (1996)
Bitouk, D., Verma, R., Nenkova, A.: Class-level spectral features for emotion recognition. Speech Commun. 52(7), 613–625 (2010)
Dellaert, F., Polzin, T., Waibel, A.: Recognizing emotion in speech. In: ICSLP, vol. 3, pp. 1970–1973 (1996)
Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M.L., Zweig, G., He, X., Williams, J., Gong, Y., Acero, A.: Recent advances in deep learning for speech research at Microsoft. In: ICASSP, pp. 8604–8608 (2013)
He, H., Garcia, E.A.: Learning from imbalanced data. TKDE 21(9), 1263–1284 (2009)
Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributed semantics. In: ACL (2014)
Hoefel, G., Elkan, C.: Learning a two-stage SVM/CRF sequence classifier. In: CIKM, pp. 271–278 (2008)
Kiela, D., Bottou, L.: Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In: EMNLP, pp. 36–45 (2008)
Lee, C.M., Narayanan, S.S.: Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005)
Liscombe, J., Riccardi, G., Hakkani-Tür, D.: Using context to improve emotion detection in spoken dialog systems. In: Eurospeech (2005)
Litman, D.J., Forbes-Riley, K.: Predicting student emotions in computer-human tutoring dialogues. In: ACL (2004)
McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, M., Stroeve, S.: Approaching automatic recognition of emotion from voice: a rough benchmark. In: ITRW (2000)
Meignier, S., Moraru, D., Fredouille, C., Bonastre, J.F., Besacier, L.: Step-by-step and integrated approaches in broadcast news speaker diarization. Comput. Speech Lang. 20(2), 303–330 (2006)
Metze, F., Polzehl, T., Wagner, M.: Fusion of acoustic and linguistic features for emotion detection. In: ICSC, pp. 153–160 (2009)
Morrison, D., Wang, R., De Silva, L.C.: Ensemble methods for spoken emotion recognition in call-centres. Speech commun. 49(2), 98–112 (2007)
Pérez-Rosas, V., Mihalcea, R., Morency, L.P.: Utterance-level multimodal sentiment analysis. In: ACL, pp. 973–982 (2013)
Raaijmakers, S., Truong, K., Wilson, T.: Multimodal subjectivity analysis of multiparty conversation. In: EMNLP, pp. 466–474 (2008)
Reynolds, D.A., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: ICASSP, vol. 5, pp. 953–956 (2005)
Rozgić, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Vembu, A.N., Prasad, R.: Emotion recognition using acoustic and lexical features. In: Interspeech (2012)
Sánchez-Gutiérrez, M.E., Albornoz, E.M., Martinez-Licona, F., Rufiner, H.L., Goddard, J.: Deep learning for emotional speech recognition. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-Lopez, J.A., Salas-Rodríguez, J., Suen, C.Y. (eds.) MCPR 2014. LNCS, vol. 8495, pp. 311–320. Springer, Heidelberg (2014)
Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP (2013)
Tato, R., Santos, R., Kompe, R., Pardo, J.M.: Emotional space improves emotion recognition. In: Interspeech (2002)
Ververidis, D., Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech commun. 48(9), 1162–1181 (2006)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML, pp. 1096–1103 (2008)
Zhou, G., He, T., Zhao, J.: Bridging the language gap: learning distributed semantics for cross-lingual sentiment classification. In: Zong, C., Nie, J.-Y., Zhao, D., Feng, Y. (eds.) NLPCC 2014. CCIS, vol. 496, pp. 138–149. Springer, Heidelberg (2014)
Acknowledgments
Our work is supported by National High Technology Research and Development Program of China (863 Program) (No. 2015AA015402), National Natural Science Foundation of China (No. 61370117 & No. 61433015) and Major National Social Science Fund of China (No. 12&ZD227).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhang, X., Wang, H., Li, L., Zhao, M., Li, Q. (2015). Negative Emotion Recognition in Spoken Dialogs. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-25816-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25815-7
Online ISBN: 978-3-319-25816-4
eBook Packages: Computer ScienceComputer Science (R0)