Abstract
Many information retrieval algorithms rely on the notion of a good distance that allows to efficiently compare objects of different nature. Recently, a new promising metric called Word Mover’s Distance was proposed to measure the divergence between text passages. In this paper, we demonstrate that this metric can be extended to incorporate term-weighting schemes and provide more accurate and computationally efficient matching between documents using entropic regularization. We evaluate the benefits of both extensions in the task of cross-lingual document retrieval (CLDR). Our experimental results on eight CLDR problems suggest that the proposed methods achieve remarkable improvements in terms of Mean Reciprocal Rank compared to several baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We use the Numberbatch embeddings presented in our experiments (Sect. 4).
- 2.
- 3.
We release the code at: https://github.com/balikasg/WassersteinRetrieval.
- 4.
For Entro_Wass we used the sinkhorn2 function with reg=0.1, numItermax=50, method=‘sinkhorn_stabilized’ arguments to prevent numerical errors.
- 5.
The v17.06 vectors: https://github.com/commonsense/conceptnet-numberbatch.
References
Acs, J.: Pivot-based multilingual dictionary building using Wiktionary. In: LREC (2014)
Acs, J., Pajkossy, K., Kornai, A.: Building basic vocabulary across 40 languages. In: Sixth Workshop on Building and Using Comparable Corpora@ACL (2013)
Benamou, J.D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 2(37), A1111–A1138 (2015)
Blacoe, W., Lapata, M.: A comparison of vector-based representations for semantic composition. In: EMNLP-CoNLL (2012)
Broder, A.: A taxonomy of web search. In: SIGIR. ACM (2002)
Courty, N., Flamary, R., Tuia, D.: Domain adaptation with regularized optimal transport. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8724, pp. 274–289. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44848-9_18
Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: NIPS, pp. 2292–2300 (2013)
Flamary, R., Courty, N.: Pot python optimal transport library (2017)
Fukumasu, K., Eguchi, K., Xing, E.P.: Symmetric correspondence topic models for multilingual text analysis. In: NIPS (2012)
Kantorovich, L.: On the translocation of masses. C.R. (Doklady) Acad. Sci. URSS(N.S.) 37(10), 199–201 (1942)
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q., et al.: From word embeddings to document distances. In: ICML (2015)
Laclau, C., Redko, I., Matei, B., Bennani, Y., Brault, V.: Co-clustering through optimal transport. In: ICML (2017)
Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cogn. Sci. 34, 1388–1429 (2010)
Monge, G.: Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences, pp. 666–704 (1781)
Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in python. JMLR 12, 2825–2830 (2011)
Richard, S., Paul, K.: Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 21, 343–348 (1967)
Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV (1998)
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000)
Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: an open multilingual graph of general knowledge. In: AAAI (2017)
Speer, R., Lowry-Duda, J.: Conceptnet at semeval-2017 task 2: extending word embeddings with multilingual relational knowledge. arXiv:1704.03560 (2017)
Voorhees, E.M.: Overview of TREC 2003. In: TREC (2003)
Voorhees, E.M., et al.: The TREC-8 question answering track report. In: TREC (1999)
Vulić, I., De Smet, W., Tang, J., Moens, M.F.: Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf. Process. Manage. 51, 111–147 (2015)
Vulić, I., Moens, M.F.: Bilingual distributed word representations from document-aligned comparable data. J. Artif. Intell. Res. 55, 953–994 (2016)
van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011)
Wang, Y.C., Wu, C.K., Tsai, R.T.H.: Cross-language article linking with different knowledge bases using bilingual topic model and translation features. Knowl.-Based Syst. 111, 228–236 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Balikas, G., Laclau, C., Redko, I., Amini, MR. (2018). Cross-Lingual Document Retrieval Using Regularized Wasserstein Distance. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science(), vol 10772. Springer, Cham. https://doi.org/10.1007/978-3-319-76941-7_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-76941-7_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76940-0
Online ISBN: 978-3-319-76941-7
eBook Packages: Computer ScienceComputer Science (R0)