Learning Distributed Representations of Uyghur Words and Morphemes

Abudukelimu, Halidanmu; Liu, Yang; Chen, Xinxiong; Sun, Maosong; Abulizi, Abudoukelimu

doi:10.1007/978-3-319-25816-4_17

Halidanmu Abudukelimu⁸,
Yang Liu^8,9,
Xinxiong Chen⁸,
Maosong Sun^8,9 &
…
Abudoukelimu Abulizi¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9427))

Included in the following conference series:

7177 Accesses
2 Citations

Abstract

While distributed representations have proven to be very successful in a variety of NLP tasks, learning distributed representations for agglutinative languages such as Uyghur still faces a major challenge: most words are composed of many morphemes and occur only once on the training data. To address the data sparsity problem, we propose an approach to learn distributed representations of Uyghur words and morphemes from unlabeled data. The central idea is to treat morphemes rather than words as the basic unit of representation learning. We annotate a Uyghur word similarity dataset and show that our approach achieves significant improvements over CBOW, a state-of-the-art model for computing vector representations of words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Distributed Document Representation for Document Classification

Creating Semantic Representations

SubGram: Extending Skip-Gram Word Representation with Substrings

Notes

1.
http://uy.ts.cn.

References

Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Google Scholar
Botha, J.A., Blunsom, P.: Compositional morphology for word representations and language modelling. In: Proceedings of ICML (2014)
Google Scholar
Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: Proceedings of IJCAI (2015)
Google Scholar
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. 4(1), article 3 (2007)
Article Google Scholar
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Sloan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concepted revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)
Article Google Scholar
Huang, E., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of ACL (2012)
Google Scholar
Lazaridou, A., Marelli, M., Zamparelli, R., Baroni, M.: Compositionally derived representations of morphologically complex words in distributional semantics. In: Proceedings of ACL (2013)
Google Scholar
Luong, M.T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Proceedings of CoNLL (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS (2013)
Google Scholar
Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of ICML (2007)
Google Scholar
Mnih, A., Hinton, G.: A scalable hierarchical distributed language model. In: Proceedings of NIPS (2008)
Google Scholar
Qiu, S., Cui, Q., Bian, J., Gao, B., Liu, T.Y.: Co-learning of word representations and morpheme representations. In: Proceedings of COLING (2014)
Google Scholar

Download references

Acknowledgments

This research is supported by National Key Basic Research Program of China (973 Program 2014CB340500), the National Natural Science Foundation of China (No. 61331013), the National Key Technology R & D Program (No. 2014BAK10B03), the Singapore National Research Foundation under its International Research Center @ Singapore Funding Initiative and administered by the IDM Programme. We are grateful to Meiping Dong, Lei Xu, Liner Yang, Yu Zhao, Yankai Lin, Chunyang Liu, Shiqi Shen, and Meng Zhang for their constructive feedback to the early draft of this paper.

Author information

Authors and Affiliations

State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, China
Halidanmu Abudukelimu, Yang Liu, Xinxiong Chen & Maosong Sun
Jiangsu Collaborative Innovation Center for Language Competence, Nanjing, Jiangsu, China
Yang Liu & Maosong Sun
Lab of Computational Linguistics, Center for Psychology and Cognitive Science School of Humanities, Tsinghua University, Beijing, China
Abudoukelimu Abulizi

Authors

Halidanmu Abudukelimu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xinxiong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Maosong Sun
View author publications
You can also search for this author in PubMed Google Scholar
Abudoukelimu Abulizi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Halidanmu Abudukelimu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Tsinghua University, Beijing, China
Zhiyuan Liu
Soochow University, Suzhou, Jiangsu, China
Min Zhang
Tsinghua University, Beijing, China
Yang Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abudukelimu, H., Liu, Y., Chen, X., Sun, M., Abulizi, A. (2015). Learning Distributed Representations of Uyghur Words and Morphemes. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-25816-4_17
Published: 08 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25815-7
Online ISBN: 978-3-319-25816-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics