Abstract
Recently, increasing attention has been attracted to Social Networking Sentiment Analysis. Twitter as one of the most fashional social networking platforms has been researched as a hot topic in this domain. Normally, sentiment analysis is regarded as a classification problem. Training a classifier with tweets data, there is a large amount of noise due to tweets’ shortness, marks, irregular words etc. In this work we explore the impact pre-processing methods make on twitter sentiment classification. We evaluate the effects of URLs, negation, repeated letters, stemming and lemmatization. Experimental results on the Stanford Twitter Sentiment Dataset show that sentiment classification accuracy rises when URLs features reservation, negation transformation and repeated letters normalization are employed while descends when stemming and lemmatization are applied. Moreover, we get a better result by augmenting the original feature space with bigram and emotions features. Comprehensive application of these measures makes us achieve classification accuracy of 85.5%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, pp. 1–12 (2009)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 79–86. Association for Computational Linguistics (2002)
Zhang, X., Fuehres, H., Gloor, P.: Predicting Stock Market Indicators Through Twitter “I hope it is not as bad as I fear”. Procedia - Social and Behavioral Sciences 26, 55–62 (2011)
Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. Journal of Computational Science 2(1), 1–8 (2011)
Haddi, E., Liu, X., Shi, Y.: The Role of Text Pre-processing in Sentiment Analysis. Procedia Computer Science 17, 26–32 (2013)
Asur, S., Huberman, B.A.: Predicting the future with social media. In: 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 492–499. IEEE (2010)
Stieglitz, S., Dang-Xuan, L.: Political Communication and Influence through Microblogging-An Empirical Analysis of Sentiment in Twitter Messages and Retweet Behavior. In: 2012 45th Hawaii International Conference on System Science (HICSS), pp. 3500–3509. IEEE (2012)
Tumasjan, A., Sprenger, T.O., Sandner, P.G., et al.: Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment. ICWSM 10, 178–185 (2010)
Williams, C., Gulati, G.: What is a social network worth? Facebook and vote share in the 2008 presidential primaries. American Political Science Association (2008)
Mishne, G., Glance, N.S.: Predicting Movie Sales from Blogger Sentiment. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 155–158 (2006)
Aciar, S., Zhang, D., Simoff, S., et al.: Informed recommender: Basing recommendations on consumer product reviews. IEEE Intelligent Systems 22(3), 39–47 (2007)
Aguwa, C.C., Monplaisir, L., Turgut, O.: Voice of the customer: Customer satisfaction ratio based analysis. Expert Systems with Applications 39(11), 10112–10119 (2012)
Kang, H., Yoo, S.J., Han, D.: Senti-lexicon and improved Naïve Bayes algorithms for sentiment analysis of restaurant reviews. Expert Systems with Applications 39(5), 6000–6010 (2012)
Saif, H., He, Y., Alani, H.: Alleviating data sparsity for twitter sentiment analysis. In: The 2nd Workshop on Making Sense of Microposts (2012)
Speriosu, M., Sudan, N., Upadhyay, S., et al.: Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of the First workshop on Unsupervised Learning in NLP, pp. 53–63. Association for Computational Linguistics (2011)
Agarwal, A., Xie, B., Vovsha, I., et al.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, pp. 30–38. Association for Computational Linguistics (2011)
Lin, C.J., Weng, R.C., Keerthi, S.S.: Trust region newton method for logistic regression. The Journal of Machine Learning Research 9, 627–650 (2008)
Quan, C.Q., Ren, F.J.: Target Based Review Classification for Fine-grained Sentiment Analysis. International Journal of Innovative Computing, Information and Control 10(1) (2014)
Quan, C.Q., Ren, F.J.: Unsupervised Product Feature Extraction for Feature-oriented Opinion Determination. Information Sciences (2014), doi: http://dx.doi.org/10.1016/j.ins.2014.02.063
Quan, C.Q., Wei, X.Q., Ren, F.J.: Combine Sentiment Lexicon and Dependency Parsing for Sentiment Classification. In: SII 2013 (December 2013)
Quan, C.Q., Ren, F.J., He, T.T.: Sentimental Classification Based on Kernel Methods. International Journal of Innovative Computing, Information and Control 6(6) (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Bao, Y., Quan, C., Wang, L., Ren, F. (2014). The Role of Pre-processing in Twitter Sentiment Analysis. In: Huang, DS., Jo, KH., Wang, L. (eds) Intelligent Computing Methodologies. ICIC 2014. Lecture Notes in Computer Science(), vol 8589. Springer, Cham. https://doi.org/10.1007/978-3-319-09339-0_62
Download citation
DOI: https://doi.org/10.1007/978-3-319-09339-0_62
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09338-3
Online ISBN: 978-3-319-09339-0
eBook Packages: Computer ScienceComputer Science (R0)