Abstract
We explore the abilities of character recurrent neural network (char-RNN) for hashtag segmentation. Our approach to the task is the following: we generate synthetic training dataset according to frequent n-grams that satisfy predefined morpho-syntactic patterns to avoid any manual annotation. The active learning strategy limits the training dataset and selects informative training subset. The approach does not require any language-specific settings and is compared for two languages, which differ in inflection degree.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
The test data is available at: https://github.com/glushkovato/hashtag_segmentation.
- 4.
References
Matthews, A., Schlinger, E., Lavie, A., Dyer, C.: Synthesizing compound words for machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1085–1094 (2016)
Riedl, M., Biemann, C.: Unsupervised compound splitting with distributional semantics rivals supervised methods. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 617–622 (2016)
Koehn, P., Knight, K.: Empirical methods for compound splitting. In: Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, vol. 1, pp. 187–193 (2003)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017)
Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation (2016)
Alberti, C., et al.: SyntaxNet models for the CoNLL 2017 shared task (2017)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: EACL 2017, p. 427 (2017)
Santos, C.D., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pp. 1818–1826 (2014)
Samih, Y., et al.: A neural architecture for dialectal Arabic segmentation. In: Proceedings of the Third Arabic Natural Language Processing Workshop, pp. 46–54 (2017)
Sun, Z., Shen, G., Deng, Z.: A gap-based framework for Chinese word segmentation via very deep convolutional networks (2017)
Cai, D., Zhao, H., Zhang, Z., Xin, Y., Wu, Y., Huang, F.: Fast and accurate neural word segmentation for Chinese (2017)
Zhang, Q., Liu, X., Fu, J.: Neural networks incorporating dictionaries for Chinese word segmentation (2018)
Cai, D., Zhao, H.: Neural word segmentation learning for Chinese. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 409–420 (2016)
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF (2016)
Weston, J., et al.: Towards AI-complete question answering: a set of prerequisite toy tasks (2015)
Utama, P., et al.: An end-to-end neural natural language interface for databases (2018)
Schohn, G., Cohn, D.: Less is more: active learning with support vector machines. In: ICML, pp. 839–846 (2000)
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45–66 (2001)
Shen, Y., Yun, H., Lipton, Z.C., Kronrod, Y., Anandkumar, A.: Deep active learning for named entity recognition (2017)
Zhang, Y., Lease, M., Wallace, B.C.: Active discriminative text representation learning. In: AAAI, pp. 3386–3392 (2017)
Reuter, J., Pereira-Martins, J., Kalita, J.: Segmenting Twitter hashtags. Int. J. Nat. Lang. Comput. 5, 23–36 (2016)
Berardi, G., Esuli, A., Marcheggiani, D., Sebastiani, F.: ISTI@ TREC Microblog Track 2011: Exploring the Use of Hashtag Segmentation and Text Quality Ranking. TREC (2011)
Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the TREC-2011 microblog track. In: Proceedings of the 20th Text REtrieval Conference (TREC 2011) (2011)
Bansal, P., Bansal, R., Varma, V.: Towards deep semantic analysis of hashtags. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 453–464. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16354-3_50
Declerck, T., Lendvai, P.: Processing and normalizing hashtags. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 104–109 (2015)
Akhtar, Md.S., Sawant, P., Ekbal, A., Pawar, J., Bhattacharyya, P.: IITP at EmoInt-2017: measuring intensity of emotions using sentence embeddings and optimized features. In: Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 212–218 (2017)
Park, J.H., Xu, P., Fung, P.: PlusEmo2Vec at SemEval-2018 Task 1: Exploiting emotion knowledge from emoji and# hashtags (2018)
Shao, Y., Hardmeier, C., Nivre, J.: Universal word segmentation: implementation and interpretation. Trans. Assoc. Computat. Linguist. 6, 421–435 (2018)
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random field. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 562 (2004)
Xue, N.: Chinese word segmentation as character tagging. Int. J. Comput. Linguist. Chin. Lang. Process. 8(1), 29–48 (2003). Special Issue on Word Formation and Chinese Language Processing
Norvig, P.: Natural language corpus data. Beautiful Data 219–242 (2009)
Acknowledgements
The paper was prepared within the framework of the HSE University Basic Research Program and funded by the Russian Academic Excellence Project ’5-100’.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Glushkova, T., Artemova, E. (2023). Char-RNN and Active Learning for Hashtag Segmentation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-24337-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)