Abstract
We address the problem of segmenting a Chinese text into words. In this paper, we propose a trigram model algorithm for segmenting a Chinese text. We also discuss why statistical language model is appropriate to be applied to Chinese word segmentation and give an algorithm for segmenting a Chinese text into words. In particular, we solve the problem of searching which often leads to low performance brought by trigram model. Finally, the issue of OOV word identification is discussed and merged to trigram model based method in order to improve the accuracy of segmentation.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cheng, K.-S., Young, G.H., Wong, K.-F.: A study on word-based and integral-bit Chinese text compression algorithms. Journal of the American Society for Information Science 50(4), 18 C228 (1999)
Zou, F.: The Identification of Stop Words and Keywords: A Study of Automatic Term Weighting in Natural Language Text Processing. MPhil Thesis (June 2006)
Mao, J., Cheng, G., He, Y.: Phrase-based Statistical Language Modeling from Bilingual Parallel Corpus. In: The International Symposium on Combinatorics, Algorithms, Probabilistic and Experimental methodologies (April 2007)
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing. Prentice-Hall, Englewood Cliffs (2006)
Gao, J., Wu, A., Li, M., Huang, C.-N.: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics 31(4), 531–574 (2005)
Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: Proceeding of International Conference of Spoken Language Processing (September 2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mao, J., Cheng, G., He, Y., Xing, Z. (2007). A Trigram Statistical Language Model Algorithm for Chinese Word Segmentation. In: Preparata, F.P., Fang, Q. (eds) Frontiers in Algorithmics. FAW 2007. Lecture Notes in Computer Science, vol 4613. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73814-5_26
Download citation
DOI: https://doi.org/10.1007/978-3-540-73814-5_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73813-8
Online ISBN: 978-3-540-73814-5
eBook Packages: Computer ScienceComputer Science (R0)