Abstract
Computational analysis of linguistic data requires that texts are transformed into numeric representations. The aim of this research is to evaluate different methods for building vector representations of text documents from social media. The methods are compared in respect to their performance in a classification task. Namely, traditional count-based term frequency-inverse document frequency (TFIDF) is compared to the semantic distributed word embedding representations. Unlike previous research, we investigate document representations in the context of morphologically rich Finnish. Based on the results, we suggest a framework for building vector space representations of texts in social media, applicable to language technologies for morphologically rich languages. In the current study, lemmatization of tokens increased classification accuracy, while lexical filtering generally hindered performance. Finally, we report that distributed embeddings and TFIDF perform at comparable levels with our data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL (1). pp. 238–247 (2014)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)
Friedman, C., Rindflesch, T.C., Corn, M.: Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J. Biomed. Inf. 46(5), 765–773 (2013)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. arXiv preprint (2014). http://arxiv.org/pdf/1405.4053.pdf
Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint (2013). http://arxiv.org/pdf/1301.3781.pdf
De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., Dhoedt, B.: Learning semantic similarity for very short texts. arXiv preprint arXiv:1512.00765 (2015)
Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2418–2424 (2015)
Taddy, M.: Document classification by inversion of distributed language representations. arXiv preprint (2015). http://arxiv.org/pdf/1504.07295.pdf
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-2015), pp. 957–966 (2015)
Kanerva, J., Ginter, F.: Post-hoc manipulations of vector space models with application to semantic role labeling. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC) at EACL 2014, pp. 1–10 (2014). https://aclweb.org/anthology/W/W14/W14-1501.pdf
Tsarfaty, R., Seddah, D., Kübler, S., Nivre, J.: Parsing morphologically rich languages: introduction to the special issue. Comput. Linguistics 39(1), 15–22 (2013)
The Suomi24 Corpus (2015). http://urn.fi/urn:nbn:fi:lb-2015040801. 14 May 2015 Version
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, May 2010. http://is.muni.cz/publication/884893/en
Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the HLT-NAACL, pp. 746–751 (2013)
Hakulinen, A., Vilkuna, M., Korhonen, R., Koivisto, V., Heinonen, T.R., Alho, I.: Ison suomen kielioppi [Great Grammar of Finnish]. Suomalaisen Kirjallisuuden Seura, Helsinki, Finland, online (edn.) (2004). http://scripta.kotus.fi/visk
Enarvi, S., Kurimo, M.: Studies on training text selection for conversational Finish language modeling. In: Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT 2013), pp. 256–263 (2013)
Hrala, M., Král, P.: Evaluation of the document classification approaches. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. AISC, vol. 226, pp. 877–885. Springer, Switzerland (2013)
Khoo, A., Marom, Y., Albrecht, D.: Experiments with sentence classification. In: Proceedings of the 2006 Australasian Language Technology Workshop, pp. 18–25 (2006)
Toman, M., Tesar, R., Jezek, K.: Influence of word normalization on text classification. Proc. InSciT 4, 354–358 (2006)
Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of Finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM (2004)
Haverinen, K., Nyblom, J., Viljanen, T., Laippala, V., Kohonen, S., Missilä, A., Ojala, S., Salakoski, T., Ginter, F.: Building the essential resources for Finnish: the Turku Dependency Treebank. Lang. Resour. Eval. 48(3), 493–531 (2013)
Salvetti, F., Lewis, S., Reichenbach, C.: Automatic opinion polarity classification of movie reviews. Colorado Res. Linguist. 17(1) (2004)
Joachims, Thorsten: Text categorization with Support Vector Machines: learning with many relevant features. In: Nédellec, Claire, Rouveirol, Céline (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). doi:10.1007/BFb0026683
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton, MA, USA (1979)
Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)
Qin, S., Song, J., Zhang, P., Tan, Y.: Feature selection for text classification based on part of speech filter and synonym merge. In: Proceedings of the 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 681–685. IEEE (2015)
Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguist. 41(4), 665–695 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Venekoski, V., Puuska, S., Vankka, J. (2016). Vector Space Representations of Documents in Classifying Finnish Social Media Texts. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2016. Communications in Computer and Information Science, vol 639. Springer, Cham. https://doi.org/10.1007/978-3-319-46254-7_42
Download citation
DOI: https://doi.org/10.1007/978-3-319-46254-7_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46253-0
Online ISBN: 978-3-319-46254-7
eBook Packages: Computer ScienceComputer Science (R0)