Abstract
The bag of words representation of documents is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Improvements might be achieved by expanding the vocabulary with other relevant word, like synonyms.
In this paper we use word-word co-occurence information from a large corpus to expand the vocabulary of another corpus consisting of tweets. Several different methods on how to include the co-occurence information are constructed and tested out on the classification of real twitter data. Our results show that we are able to reduce the number of erroneous classifications by 14% using co-occurence information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zubiaga, A., Spina, D., Martinez, R., Fresno, V.: Real-time classification of twitter trends. J. Am. Soc. Inform. Sci. Technol. 66(3), 462–473 (2015)
Petrović, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and twitter. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. NAACL HLT 2012, Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 338–346 (2012)
Zhang, X., Fuehres, H., Gloor, P.A.: Predicting stock market indicators through twitter I hope it is not as bad as I fear. Procedia Soc. Behav. Sci. 26, 55–62 (2011)
Lampos, V., Bie, T., Cristianini, N.: Flu detector - tracking epidemics on twitter. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 599–602. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15939-8_42
Lee, K., Palsetia, D., Narayanan, R., Patwary, M.M.A., Agrawal, A., Choudhary, A.: Twitter trending topic classification. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDMW 2011, Computer Society, pp. 251–258. IEEE, Washington, DC (2011)
Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide Web. WWW 2010, pp. 851–860. ACM, New York (2010)
Hammer, H.L., Yazidi, A., Bai, A., Engelstad, P., et al.: Improving classification of tweets using word-word co-occurrence information from a large external corpus. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing, pp. 1174–1177. ACM (2016)
Pinto, D., Rosso, P., Jiménez-Salazar, H.: A self-enriching methodology for clustering narrow domain short texts. Comput. J. 54(7), 1148–1165 (2011)
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 541–544. IEEE (2003)
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the SIGIR Semantic Web Workshop (2003)
Rodriguez, M., Hidalgo, J., Agudo, B.: Using wordnet to complement training information in text categorization. In: Proceedings of 2nd International Conference on Recent Advances in Natural Language Processing II: Selected Papers from RANLP, vol. 97, pp. 353–364 (2000)
Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on World Wide Web, pp. 519–528. ACM (2003)
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In: AAAI, vol. 6, pp. 1301–1306 (2006)
Alahmadi, A., Joorabchi, A., Mahdi, A.: A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In: 2013 7th IEEE GCC Conference and Exhibition (GCC). IEEE Press (2013)
Cai, L., Zhou, G., Liu, K., Zhao, J.: Large-scale question classification in CQA by leveraging wikipedia semantic knowledge. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM 2011, pp. 1321–1330. ACM, New York (2011)
Chen, Z., Lu, Y.: A word co-occurrence matrix based method for relevance feedback. J. Comput. Inf. Syst. 7(1), 17–24 (2011)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors forword representation (2015). http://nlp.stanford.edu/projects/glove/glove.pdf. Accessed 27 July 2015
Friedman, J., Hastie, T., Tibshirani, R.: Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33(1), 1–22 (2010)
Genkin, A., Lewis, D.D., Madigan, D.: Large-scale bayesian logistic regression for text categorization. Technometrics 49(14), 291–304 (2007)
Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Hammer, H.L., Yazidi, A., Bai, A., Engelstad, P. (2017). Improving Classification of Tweets Using Linguistic Information from a Large External Corpus. In: Maglaras, L., Janicke, H., Jones, K. (eds) Industrial Networks and Intelligent Systems. INISCOM 2016. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 188. Springer, Cham. https://doi.org/10.1007/978-3-319-52569-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-52569-3_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52568-6
Online ISBN: 978-3-319-52569-3
eBook Packages: Computer ScienceComputer Science (R0)