Abstract
Multimodal emotion analysis has become a hot trend because of its wide applications, such as the question-answering system. However, in a real-world scenario, people usually have mixed or partial emotions about evaluating objects. In this paper, we introduce a fuzzy temporal convolutional network based on contextual self-attention (CSAT-FTCN) to address these challenges, which has a membership function modeling various fuzzy emotions for understanding emotions in a more profound sense. Moreover, the CSAT-FTCN can obtain the dependency relationships of target utterances on internal own key information and external contextual information to understand emotions in a more profound sense. Additionally, as for multi-modality data, we introduce an attention fusion (ATF) mechanism to capture the dependency relationship between different modality information. The experimental results show that our CSAT-FTCN outperforms state-of-the-art models on tested datasets. The CSAT-FTCN network provides a novel method for multimodal emotion analysis.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study. The datasets used in this paper are public and available.
References
Cambria E, Liu Q, Decherchi S, Xing F, Kwok K. Senticnet 7: a commonsense-based neurosymbolic AI framework for explainable sentiment analysis. In: Proceedings of LREC 2022. 2022.
Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E. Dialoguernn: an attentive RNN for emotion detection in conversations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. 2019. p. 6818–25.
Poria S, Majumder N, Hazarika D, Cambria E, Gelbukh A, Hussain A. Multimodal sentiment analysis: Addressing key issues and setting up the baselines. IEEE Intell Syst. 2018;33(6):17–25.
Susanto Y, Livingstone AG, Ng BC, Cambria E. The hourglass model revisited. IEEE Intell Syst. 2020;35(5):96–102.
Hu A, Flaxman S. Multimodal sentiment analysis to explore the structure of emotions. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018;350–8.
Blanchard N, Moreira D, Bharati A, Scheirer W. Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities. In: Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML). 2018. p. 1–10.
Hazarika D, Poria S, Zadeh A, Cambria E, Morency L-P, Zimmermann R. Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the Conference. Association for Computational Linguistics. North American Chapter. Meeting, vol. 2018. NIH Public Access; 2018. p. 2122.
Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R. Icon: Interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. p. 2594–604.
Yang J, Wang Y, Yi R, Zhu Y, Rehman A, Zadeh A, Poria S, Morency L-P. Mtag: Modal-temporal attention graph for unaligned human multimodal language sequences. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. p. 1009–21.
Zhang K, Li Y, Wang J, Cambria E, Li X. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Trans Circ Syst Video Technol. 2021.
Xiao G, Tu G, Zheng L, Zhou T, Li X, Ahmed SH, Jiang D. Multi-modality sentiment analysis in social internet of things based on hierarchical attentions and CSAT-TCN with MBM network. IEEE Internet Things J. 2020.
Wang W, Tran D, Feiszli M. What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 12695–705.
Du C, Li T, Liu Y, Wen Z, Hua T, Wang Y, Zhao H. Improving multi-modal learning with uni-modal teachers. arXiv:2106.11059 [Preprint]. 2021. Available from: http://arxiv.org/abs/2106.11059.
He K, Mao R, Gong T, Li C, Cambria E. Meta-based self-training and re-weighting for aspect-based sentiment analysis. IEEE Transactions on Affective Computing. 2022. p. 1–13.
Mao R, Li X. Bridging towers of multi-task learning with a gating mechanism for aspect-based sentiment analysis and sequential metaphor identification. Proc AAAI Conf Artif Intell. 2021;35(15):13534–42.
Medhat W, Hassan A, Korashy H. Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J. 2014;5(4):1093–113.
Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013. p. 1631–42.
Oneto L, Bisio F, Cambria E, Anguita D. Statistical learning theory and elm for big social data analysis. IEEE Comput Intell Mag. 2016;11(3):45–55.
Deb S, Dandapat S. Emotion classification using segmentation of vowel-like and non-vowel-like regions. IEEE Transactions on Affective Computing. 2017.
Atmaja BT, Akagi M. Multitask learning and multistage fusion for dimensional audiovisual emotion recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020. p. 4482–6.
Peng W, Hong X, Zhao G. Adaptive modality distillation for separable multimodal sentiment analysis. IEEE Intell Syst. 2021;36(3):82–9.
Al Hanai T, Ghassemi MM, Glass JR. Detecting depression with audio/text sequence modeling of interviews. In: Interspeech. 2018. p. 1716–20.
Yang K, Xu H, Gao K. Cm-bert: Cross-modal bert for text-audio sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020. p. 521–8.
Tu G, Wen J, Liu C, Jiang D, Cambria E. Context- and sentiment-aware networks for emotion recognition in conversation. IEEE Trans Artif Intell. 2022;3(5):699–708.
Poria S, Peng H, Hussain A, Howard N, Cambria E. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing. 2017;261:217–30.
Wen J, Jiang D, Tu G, Liu C, Cambria E. Dynamic interactive multiview memory network for emotion recognition in conversation. Info Fus. 2022.
Dashtipour K, Gogate M, Cambria E, Hussain A. A novel context-aware multimodal framework for persian sentiment analysis. Neurocomputing. 2021;457:377–88.
Mao R, Lin C, Guerin F. End-to-end sequential metaphor identification inspired by linguistic theories. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. p. 3888–98.
Tu G, Wen J, Liu H, Chen S, Zheng L, Jiang D. Exploration meets exploitation: Multitask learning for emotion recognition based on discrete and dimensional models. Knowl-Based Syst. 2022;235.
Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proc AAAI Conf Artif Intell. 2021;35(12):10790–7.
Peng H, Ma Y, Poria S, Li Y, Cambria E. Phonetic-enriched text representation for chinese sentiment analysis with reinforcement learning. Info Fus. 2021;70:88–99.
Rajapakshe T, Rana R, Khalifa S, Liu J, Schuller B. A novel policy for pre-trained deep reinforcement learning for speech emotion recognition. Australas Comput Sci Week. 2022;2022:96–105.
Li T, Chen X, Zhang S, Dong Z, Keutzer K. Cross-domain sentiment classification with contrastive learning and mutual information maximization. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. p. 8203–7.
Stappen L, Schumann L, Sertolli B, Baird A, Weigell B, Cambria E, Schuller BW. Muse-toolbox: The multimodal sentiment analysis continuous annotation fusion and discrete class transformation toolbox. in Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge. 2021. p. 75–82.
Serrano-Guerrero J, Romero FP, Olivas JA. Fuzzy logic applied to opinion mining: a review. Knowl-Based Syst. 2021;222.
Zhang Y, Liu Y, Li Q, Tiwari P, Wang B, Li Y, Pandey HM, Zhang P, Song D. Cfn: A complex-valued fuzzy network for sarcasm detection in conversations. IEEE Trans Fuzzy Syst. 2021;29(12):3696–710.
Vega CF, Quevedo J, Escandón E, Kiani M, Ding W, Andreu-Perez J. Fuzzy temporal convolutional neural networks in p300-based brain-computer interface for smart home interaction. Appl Soft Comput. 2022;117.
Zhang Z, Wang H, Geng J, Jiang W, Deng X, Miao W. An information fusion method based on deep learning and fuzzy discount-weighting for target intention recognition. Eng Appl Artif Intell. 2022;109.
Wu M, Su W, Chen L, Pedrycz W, Hirota K. Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition. IEEE Transactions on Affective Computing. 2020.
He S, Wang Y. Evaluating new energy vehicles by picture fuzzy sets based on sentiment analysis from online reviews. Artif Intell Rev. 2022;1–22.
Chaturvedi I, Satapathy R, Cavallari S, Cambria E. Fuzzy commonsense reasoning for multimodal sentiment analysis. Patt Recognit Lett. 2019;125.
Jiang D, Wu K, Chen D, Tu G, Zhou T, Garg A, Gao L. A probability and integrated learning based classification algorithm for high-level human emotion recognition problems. Measurement. 2020;150.
Tian Y, Stewart CM. Framing the sars crisis: a computer-assisted text analysis of CNN and BBC online news reports of Sars. Asian J Commun. 2005;15(3):289–301.
Eyben F, Wöllmer M, Schuller B. Opensmile: The Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia. 2010. p. 1459–62.
Ji S, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell. 2012;35(1):221–31.
Chaturvedi S, Mishra V, Mishra N. Sentiment analysis using machine learning for business intelligence. In: 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI). IEEE; 2017. p. 2162–6.
Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS. IEMOCAP: Interactive emotional dyadic motion capture database. Lang Resour Eval. 2008;42(4):335–59.
Zadeh A, Zellers R, Pincus E, Morency L-P. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259 [Preprint]. 2016. Available from: http://arxiv.org/abs/1606.06259.
Zadeh AB, Liang PP, Poria S, Cambria E, Morency L-P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. p. 2236–46.
Mai S, Xing S, Hu H. Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Trans Multimedia. 2019;22(1):122–37.
Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl-Based Syst. 2018;161:124–33.
Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency L-P. Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers). 2017. p. 873–83.
Zadeh A, Chen M, Poria S, Cambria E, Morency L-P. Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. p. 1103–14.
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh AB, Morency L-P. Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. p. 2247–56.
Mai S, Hu H, Xing S. Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. p. 481–92.
Chen M, Li X. Swafn: Sentimental words aware fusion network for multimodal sentiment analysis. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020. p. 1067–77.
Mai S, Xing S, Hu H. Analyzing multimodal sentiment via acoustic- and visual-lstm with channel-aware temporal convolution network. IEEE/ACM Trans Audio Speech Language Process. 2021;29:1424–37.
Lian Z, Liu B, Tao J. Pirnet: Personality-enhanced iterative refinement network for emotion recognition in conversation. IEEE Transactions on Neural Networks and Learning Systems. p. 1–12.
Acknowledgements
The authors would like to respect and thank all reviewers for their constructive and helpful review.
Funding
This research is funded by the National Natural Science Foundation of China (62106136, 6190223), Natural Science Foundation of Guangdong Province (2019A1515010943), The Basic and Applied Basic Research of Colleges and Universities in Guangdong Province (Special Projects in Artificial Intelligence) (2019KZDZX1030), 2020 Li Ka Shing Foundation Cross-Disciplinary Research Grant (2020LKSFG04D), Science and Technology Major Project of Guangdong Province (STKJ2021005, STKJ202209002), and the Opening Project of Guangdong Province Key Laboratory of Information Security Technology (2020B1212060078).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent
Informed consent was obtained from all individual participants included in the study.
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection on A Decade of Sentic Computing
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, D., Liu, H., Wei, R. et al. CSAT-FTCN: A Fuzzy-Oriented Model with Contextual Self-attention Network for Multimodal Emotion Recognition. Cogn Comput 15, 1082–1091 (2023). https://doi.org/10.1007/s12559-023-10119-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-023-10119-6