Abstract
In this paper we focus on overcoming a common belief that accurate subject classification of text documents must involve high dimensional feature vectors. We study the fastText algorithm in terms of its ability to find and extract well distinguishable characteristics for a text corpora. In research we compare the achieved accuracy in the task of subject classification with various size of feature space selected. Finally, we attempt to discover the foundation behind fastText’s well performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aghila, G., et al.: A survey of naive bayes machine learning approach in text document classification. arXiv preprint arXiv:1003.1795 (2010)
Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No. 01CH37221), vol. 1, pp. 561–564 (2001)
Harris, Z.: Distributional structure. Word 10(2–3), 146–162 (1954)
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edn. Springer, New York (2009). autres impressions: 2011 (corr.), 2013 (7e corr.)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Młynarczyk, K., Piasecki, M.: Wiki test - 34 categories, CLARIN-PL digital repository (2015). http://hdl.handle.net/11321/217
Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories, CLARIN-PL digital repository (2015). http://hdl.handle.net/11321/222
Torkkola, K.: Discriminative features for textdocument classification. Form. Pattern Anal. Appl. 6(4), 301–308 (2004). https://doi.org/10.1007/s1004400301968
Walkowiak, T., Datko, S., Maciejewski, H.: Feature extraction in subject classification of text documents in Polish. In: Artificial Intelligence and Soft Computing - 17th International Conference, ICAISC 2018, Zakopane, Poland, 3–7 June 2018, Proceedings, Part II, pp. 445–452 (2018). https://doi.org/10.1007/978-3-319-91262-2_40
Walkowiak, T., Datko, S., Maciejewski, H.: Bag-of-words, bag-of-topics and word-to-vec based subject classification of text documents in polish - a comparative study. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Contemporary Complex Systems and Their Dependability, pp. 526–535. Springer, Cham (2019)
Walkowiak, T., Datko, S., Maciejewski, H.: Reduction of dimensionality of feature vectors in subject classification of text documents. In: Kabashkin, I., Yatskiv(Jackiva), I., Prentkovskis, O. (eds.) Reliability and Statistics in Transportation and Communication, pp. 159–167. Springer, Cham (2019)
Wang, L., Zhao, X.: Improved KNN classification algorithms research in text categorization. In: 2012 2nd International Conference on Consumer Electronics, Communications and Networks (CECNet), pp. 1848–1852. IEEE (2012)
Acknowledgments
This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Walkowiak, T., Datko, S., Maciejewski, H. (2020). Low-Dimensional Classification of Text Documents. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Engineering in Dependability of Computer Systems and Networks. DepCoS-RELCOMEX 2019. Advances in Intelligent Systems and Computing, vol 987. Springer, Cham. https://doi.org/10.1007/978-3-030-19501-4_53
Download citation
DOI: https://doi.org/10.1007/978-3-030-19501-4_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19500-7
Online ISBN: 978-3-030-19501-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)