Low-Rank Approximation of Matrices for PMI-Based Word Embeddings

Sorokina, Alena; Karipbayeva, Aidana; Assylbekov, Zhenisbek

doi:10.1007/978-3-031-24337-0_7

Alena Sorokina⁸,
Aidana Karipbayeva⁸ &
Zhenisbek Assylbekov⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13451))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

Abstract

We perform an empirical evaluation of several methods of low-rank approximation in the problem of obtaining PMI-based word embeddings. All word vectors were trained on parts of a large corpus extracted from English Wikipedia (enwik9) which was divided into two equal-sized datasets, from which PMI matrices were obtained. A repeated measures design was used in assigning a method of low-rank approximation (SVD, NMF, QR) and a dimensionality of the vectors (250, 500) to each of the PMI matrix replicates. Our experiments show that word vectors obtained from the truncated SVD achieve the best performance on two downstream tasks, similarity and analogy, compare to the other two low-rank approximation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The impact of preprocessing on word embedding quality: a comparative study

Article 18 October 2022

A Fistful of Vectors: A Tool for Intrinsic Evaluation of Word Embeddings

Article 22 January 2024

Word Embedding Based on Low-Rank Doubly Stochastic Matrix Decomposition

Notes

1.
Assume that words have already been converted into integer indices.
2.
\(\textbf{A}_{a:b,c:d}\) is a submatrix located at the intersection of rows \(a, a+1, \ldots , b\) and columns \(c, c + 1, \ldots , d\) of a matrix \(\textbf{A}\).
3.
http://mattmahoney.net/dc/textdata.html.
4.
The isotropy is motivated by the work of Arora et al. (2016); \(\mathbf {4.5}\) is a vector with all elements equal to 4.5.

References

Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: A latent variable model approach to PMI-based word embeddings. Trans. Assoc. Comput. Linguist. 4, 385–399 (2016)
Article Google Scholar
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. (TOMS) 38(1), 1 (2011)
MathSciNet MATH Google Scholar
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936)
Article MATH Google Scholar
Finkelstein, L., et al.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)
Article Google Scholar
Jones, E., Oliphant, T., Peterson, P.: \(\{\)SciPy\(\}\): open source scientific tools for \(\{\)Python\(\}\) (2014)
Google Scholar
Kishore Kumar, N., Schneider, J.: Literature survey on low rank approximation of matrices. Linear Multilinear Algebra 65(11), 2212–2244 (2017)
Article MathSciNet MATH Google Scholar
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)
Google Scholar
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. TACL 3, 211–225 (2015)
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
Google Scholar

Download references

Acknowledgement

The work of Zhenisbek Assylbekov has been funded by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan, contract # 346/018-2018/33-28, IRN AP05133700.

Author information

Authors and Affiliations

Department of Mathematics, Nazarbayev University, Astana, Kazakhstan
Alena Sorokina, Aidana Karipbayeva & Zhenisbek Assylbekov

Authors

Alena Sorokina
View author publications
You can also search for this author in PubMed Google Scholar
Aidana Karipbayeva
View author publications
You can also search for this author in PubMed Google Scholar
Zhenisbek Assylbekov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aidana Karipbayeva .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sorokina, A., Karipbayeva, A., Assylbekov, Z. (2023). Low-Rank Approximation of Matrices for PMI-Based Word Embeddings. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-24337-0_7
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Low-Rank Approximation of Matrices for PMI-Based Word Embeddings

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The impact of preprocessing on word embedding quality: a comparative study

A Fistful of Vectors: A Tool for Intrinsic Evaluation of Word Embeddings

Word Embedding Based on Low-Rank Doubly Stochastic Matrix Decomposition

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Low-Rank Approximation of Matrices for PMI-Based Word Embeddings

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The impact of preprocessing on word embedding quality: a comparative study

A Fistful of Vectors: A Tool for Intrinsic Evaluation of Word Embeddings

Word Embedding Based on Low-Rank Doubly Stochastic Matrix Decomposition

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation