Cross-Lingual Document Retrieval Using Regularized Wasserstein Distance | SpringerLink
Skip to main content

Cross-Lingual Document Retrieval Using Regularized Wasserstein Distance

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10772))

Included in the following conference series:

  • 4696 Accesses

Abstract

Many information retrieval algorithms rely on the notion of a good distance that allows to efficiently compare objects of different nature. Recently, a new promising metric called Word Mover’s Distance was proposed to measure the divergence between text passages. In this paper, we demonstrate that this metric can be extended to incorporate term-weighting schemes and provide more accurate and computationally efficient matching between documents using entropic regularization. We evaluate the benefits of both extensions in the task of cross-lingual document retrieval (CLDR). Our experimental results on eight CLDR problems suggest that the proposed methods achieve remarkable improvements in terms of Mean Reciprocal Rank compared to several baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 12583
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 15729
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We use the Numberbatch embeddings presented in our experiments (Sect. 4).

  2. 2.

    http://linguatools.org/tools/corpora/wikipedia-comparable-corpora/.

  3. 3.

    We release the code at: https://github.com/balikasg/WassersteinRetrieval.

  4. 4.

    For Entro_Wass we used the sinkhorn2 function with reg=0.1, numItermax=50, method=‘sinkhorn_stabilized’ arguments to prevent numerical errors.

  5. 5.

    The v17.06 vectors: https://github.com/commonsense/conceptnet-numberbatch.

References

  1. Acs, J.: Pivot-based multilingual dictionary building using Wiktionary. In: LREC (2014)

    Google Scholar 

  2. Acs, J., Pajkossy, K., Kornai, A.: Building basic vocabulary across 40 languages. In: Sixth Workshop on Building and Using Comparable Corpora@ACL (2013)

    Google Scholar 

  3. Benamou, J.D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 2(37), A1111–A1138 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  4. Blacoe, W., Lapata, M.: A comparison of vector-based representations for semantic composition. In: EMNLP-CoNLL (2012)

    Google Scholar 

  5. Broder, A.: A taxonomy of web search. In: SIGIR. ACM (2002)

    Google Scholar 

  6. Courty, N., Flamary, R., Tuia, D.: Domain adaptation with regularized optimal transport. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8724, pp. 274–289. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44848-9_18

    Google Scholar 

  7. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: NIPS, pp. 2292–2300 (2013)

    Google Scholar 

  8. Flamary, R., Courty, N.: Pot python optimal transport library (2017)

    Google Scholar 

  9. Fukumasu, K., Eguchi, K., Xing, E.P.: Symmetric correspondence topic models for multilingual text analysis. In: NIPS (2012)

    Google Scholar 

  10. Kantorovich, L.: On the translocation of masses. C.R. (Doklady) Acad. Sci. URSS(N.S.) 37(10), 199–201 (1942)

    MathSciNet  MATH  Google Scholar 

  11. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q., et al.: From word embeddings to document distances. In: ICML (2015)

    Google Scholar 

  12. Laclau, C., Redko, I., Matei, B., Bennani, Y., Brault, V.: Co-clustering through optimal transport. In: ICML (2017)

    Google Scholar 

  13. Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cogn. Sci. 34, 1388–1429 (2010)

    Article  Google Scholar 

  14. Monge, G.: Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences, pp. 666–704 (1781)

    Google Scholar 

  15. Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in python. JMLR 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  16. Richard, S., Paul, K.: Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 21, 343–348 (1967)

    Article  MathSciNet  Google Scholar 

  17. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV (1998)

    Google Scholar 

  18. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000)

    Article  MATH  Google Scholar 

  19. Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: an open multilingual graph of general knowledge. In: AAAI (2017)

    Google Scholar 

  20. Speer, R., Lowry-Duda, J.: Conceptnet at semeval-2017 task 2: extending word embeddings with multilingual relational knowledge. arXiv:1704.03560 (2017)

  21. Voorhees, E.M.: Overview of TREC 2003. In: TREC (2003)

    Google Scholar 

  22. Voorhees, E.M., et al.: The TREC-8 question answering track report. In: TREC (1999)

    Google Scholar 

  23. Vulić, I., De Smet, W., Tang, J., Moens, M.F.: Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf. Process. Manage. 51, 111–147 (2015)

    Article  Google Scholar 

  24. Vulić, I., Moens, M.F.: Bilingual distributed word representations from document-aligned comparable data. J. Artif. Intell. Res. 55, 953–994 (2016)

    MathSciNet  MATH  Google Scholar 

  25. van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011)

    Article  Google Scholar 

  26. Wang, Y.C., Wu, C.K., Tsai, R.T.H.: Cross-language article linking with different knowledge bases using bilingual topic model and translation features. Knowl.-Based Syst. 111, 228–236 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Georgios Balikas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Balikas, G., Laclau, C., Redko, I., Amini, MR. (2018). Cross-Lingual Document Retrieval Using Regularized Wasserstein Distance. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science(), vol 10772. Springer, Cham. https://doi.org/10.1007/978-3-319-76941-7_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-76941-7_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-76940-0

  • Online ISBN: 978-3-319-76941-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics