Wikipedia Text Reuse: Within and Without | SpringerLink
Skip to main content

Wikipedia Text Reuse: Within and Without

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11437))

Included in the following conference series:

Abstract

We study text reuse related to Wikipedia at scale by compiling the first corpus of text reuse cases within Wikipedia as well as without (i.e., reuse of Wikipedia text in a sample of the Common Crawl). To discover reuse beyond verbatim copy and paste, we employ state-of-the-art text reuse detection technology, scaling it for the first time to process the entire Wikipedia as part of a distributed retrieval pipeline. We further report on a pilot analysis of the 100 million reuse cases inside, and the 1.6 million reuse cases outside Wikipedia that we discovered. Text reuse inside Wikipedia gives rise to new tasks such as article template induction, fixing quality flaws, or complementing Wikipedia’s ontology. Text reuse outside Wikipedia yields a tangible metric for the emerging field of quantifying Wikipedia’s influence on the web. To foster future research into these tasks, and for reproducibility’s sake, the Wikipedia text reuse corpus and the retrieval pipeline are made freely available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 14871
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 18589
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    en.wikipedia.org/wiki/Wikipedia:Copyrights.

  2. 2.

    en.wikipedia.org/wiki/Wikipedia:Copyright_violations.

  3. 3.

    en.wikipedia.org/wiki/Academic_studies_about_Wikipedia.

  4. 4.

    We used word 3-gram seeds, extended via DBScan clustering (\(\varepsilon =150\), \(\mathrm {minPoints}=5\)), and filtered cases shorter than 200 words or with cosine similarity \(< 0.5\).

  5. 5.

    The top three being wikia.com (563), rediff.com (55), and un.org (28 reusing pages).

  6. 6.

    en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content.

  7. 7.

    monetizepros.com/cpm-rate-guide/display/.

  8. 8.

    foundation.wikimedia.org/wiki/2016-2017_Fundraising_Report.

  9. 9.

    github.com/webis-de/ECIR-19, webis.de/data/webis-wikipedia-text-reuse-18.html.

References

  1. Ardi, C., Heidemann, J.: Web-scale content reuse detection (extended). USC/Information Sciences Institute, Tech. Rep. ISI-TR-692 (2014)

    Google Scholar 

  2. Bendersky, M., Croft, W.: Finding text reuse on the web. In: Proceedings of WSDM 2009, pp. 262–271 (2009)

    Google Scholar 

  3. Chaidaroon, S., Fang, Y.: Variational deep semantic hashing for text documents. arXiv preprint arXiv:1708.03436 (2017)

  4. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC 2002, pp. 380–388 (2002)

    Google Scholar 

  5. Citron, D.T., Ginsparg, P.: Patterns of text reuse in a scientific corpus. PNAS 112(1), 25–30 (2015)

    Article  Google Scholar 

  6. Clough, P.D., Wilks, Y.: Measuring text reuse in a journalistic domain. In: Proceedings of the CLUK Colloquium (2001)

    Google Scholar 

  7. Coffee, N., Koenig, J.P., Poornima, S., Forstall, C.W., Ossewaarde, R., Jacobson, S.L.: The Tesserae project: intertextual analysis of Latin poetry. Literary Linguist. Comput. 28(2), 221–228 (2012)

    Article  Google Scholar 

  8. Generous, N., Fairchild, G., Deshpande, A., Del Valle, S., Priedhorsky, R.: Global disease monitoring and forecasting with Wikipedia. PLoS Comput. Biol. 10(11), e1003892 (2014)

    Article  Google Scholar 

  9. Hagen, M., Potthast, M., Adineh, P., Fatehifar, E., Stein, B.: Source retrieval for web-scale text reuse detection. In: Proceedings of CIKM 2017, pp. 2091–2094 (2017)

    Google Scholar 

  10. Lin, Y., Yu, B., Hall, A., Hecht, B.: Problematizing and addressing the article-as-concept assumption in Wikipedia. In: Proceedings of CSCW 2017, pp. 2052–2067 (2017)

    Google Scholar 

  11. McMahon, C., Johnson, I.L., Hecht, B.J.: The substantial interdependence of Wikipedia and Google: a case study on the relationship between peer production communities and information technologies. In: Proceedings of ICWSM 2017, pp. 142–151 (2017)

    Google Scholar 

  12. Mestyán, M., Yasseri, T., Kertész, J.: Early prediction of movie box office success based on Wikipedia activity big data. PLoS One 8(8), e71226 (2013)

    Article  Google Scholar 

  13. Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: Proceedings of ACL 2008, pp. 236–244 (2008)

    Google Scholar 

  14. Potthast, M., et al.: Overview of the 5th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2013 Evaluation Labs

    Google Scholar 

  15. Potthast, M., et al.: Overview of the 6th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2014 Evaluation Labs

    Google Scholar 

  16. Stamatatos, E.: Plagiarism detection using stopword \(n\)-grams. JASIST 62(12), 2512–2527 (2011)

    Article  Google Scholar 

  17. Stein, B., Meyer zu Eißen, S., Potthast, M.: Strategies for retrieving plagiarized documents. In: Proceedings of SIGIR 2007, pp. 825–826 (2007)

    Google Scholar 

  18. Taraborelli, D.: The sum of all human knowledge in the age of machines: a new research agenda for Wikimedia. In: Proceedings of the ICWSM 2015 Workshop Wikipedia, a Social Pedia: Research Challenges and Opportunities

    Google Scholar 

  19. Thompson, N., Hanley, D.: Science is shaped by Wikipedia: Evidence from a randomized control trial. MIT Sloan Research Paper No. 5238-17 (2018)

    Google Scholar 

  20. Vincent, N., Johnson, I., Hecht, B.: Examining Wikipedia with a broader lens: quantifying the value of Wikipedia’s relationships with other large-scale online communities. In: Proceedings of CHI 2018, pp. 566:1–566:13 (2018)

    Google Scholar 

  21. Weissman, S., Ayhan, S., Bradley, J., Lin, J.: Identifying duplicate and contradictory information in Wikipedia. In: Proceedings of JCDL 2015, pp. 57–60 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Milad Alshomary , Michael Völske , Tristan Licht , Henning Wachsmuth , Benno Stein , Matthias Hagen or Martin Potthast .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alshomary, M. et al. (2019). Wikipedia Text Reuse: Within and Without. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds) Advances in Information Retrieval. ECIR 2019. Lecture Notes in Computer Science(), vol 11437. Springer, Cham. https://doi.org/10.1007/978-3-030-15712-8_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-15712-8_49

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-15711-1

  • Online ISBN: 978-3-030-15712-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics