Abstract
Twitter becomes one of the critical channels for disseminating up-to-date information. The volume of tweets can be huge. It is desirable to have an automatic system to analyze tweets. The obstacle is that Twitter users usually invent new words using non-standard rules that appear in a burst within a short period of time. Existing new word detection methods are not able to identify them effectively. Even if the new words can be identified, it is difficult to understand their meanings. In this paper, we focus on Chinese Twitter. There are no natural word delimiters in a sentence, which makes the problem more difficult. To solve the problem, we first introduce a method of detecting new words in Chinese twitter using a statistical approach without relying on training data for which the availability is limited. Then, we derive two tagging algorithms based on two aspects, namely word distance and word vector angle, to tag these new words using known words, which would provide a basis for subsequent automatic interpretation. We show the effectiveness of our algorithms using real data in twitter and although we focus on Chinese, the approach could be applied to other Kanji based languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Also known as Out-of-Vocabulary (OOV) detection.
References
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 562. Association for Computational Linguistics (2004)
Finin, T., et al.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics (2010)
Ritter, A., Clark, S., Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2011)
Gattani, A., et al.: Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. Proc. VLDB Endow. 6(11), 1126–1137 (2013)
Sun, X., Wang, H., Li, W.: Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1. Association for Computational Linguistics (2012)
Ye, Y., Qingyao, W., Li, Y., Chow, K.P., Hui, L.C.K., Kwong, L.C.: Unknown Chinese word extraction based on variety of overlapping strings. Inf. Process. Manag. 49(2), 497–512 (2013)
Zhao, H., Kit, C.: Exploiting unlabeled text with different unsupervised segmentation criteria for Chinese word segmentation. Res. Comput. Sci. 33, 93–104 (2008)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Zhao, H., Kit, C.: Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In: IJCNLP, pp. 106–111 (2008)
Zhou, N., et al.: A hybrid probabilistic model for unified collaborative and content-based image tagging. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1281–1294 (2011)
Kim, H.-N., et al.: Collaborative filtering based on collaborative tagging for enhancing the quality of recommendation. Electron. Commer. Res. Appl. 9(1), 73–83 (2010)
Luo, S., Sun, M.: Two-character Chinese word extraction based on hybrid of internal and contextual measures. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17. Association for Computational Linguistics (2003)
Jin, Z., Tanaka-Ishii, K.: Unsupervised segmentation of Chinese text by use of branching entropy. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, Association for Computational Linguistics (2006)
Wang, L., et al.: CRFs-based Chinese word segmentation for micro-blog with small-scale data. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language (2012)
Zhang, K., Sun, M., Zhou, C.: Word segmentation on Chinese mirco-blog data with a linear-time incremental model. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin (2012)
Zhang, H.-P., et al.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17. Association for Computational Linguistics (2003)
Dumais, S.T.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)
Aksoy, S., Haralick, R.M.: Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognit. Lett. 22(5), 563–582 (2001)
Kityz, C., Wilksz, Y.: Unsupervised learning of word boundary with description length gain. In: Proceedings of the CoNLL99 ACL Workshop. Association for Computational Linguistics, Bergen (1999)
Gang, Z., et al.: Chinese new words detection in internet. Chin. Inf. Technol. 18(6), 1–9 (2004)
Tseng, H., et al.: A conditional random field word segmenter for sighan bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, vol. 171 (2005)
Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. In: EMNLP (2013)
Wallach, H.M.: Conditional random fields: an introduction (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer-Verlag GmbH Germany
About this chapter
Cite this chapter
Liang, Y., Yin, P., Yiu, S.M. (2017). New Word Detection and Tagging on Chinese Twitter Stream. In: Hameurlain, A., Küng, J., Wagner, R., Madria, S., Hara, T. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXII. Lecture Notes in Computer Science(), vol 10420. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-55608-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-662-55608-5_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-55607-8
Online ISBN: 978-3-662-55608-5
eBook Packages: Computer ScienceComputer Science (R0)