Webis Wikipedia-IPC
Published February 8, 2023 | Version v1
Dataset Open

Webis Wikipedia-IPC

  • 1. Bauhaus-Universität Weimar
  • 2. Friedrich-Schiller-Universität Jena
  • 3. Leipzig University and ScaDS.AI

Description

Webis Wikipedia-IPC

When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyzed captions in the English Wikipedia, where editors frequently relabel the same image for different articles. As a result, the Wikipedia-IPC (Image caption Paraphrase Corpus) dataset was created which include caption pairs of the same image which represent paraphrases. It contains 30,237 gold, 229,877 silver, and 656,560 bronze quality paraphrase pairs.

Notes

Bronze quality will be released soon.

Files

wikipedia-ipc.zip

Files (50.2 MB)

Name Size Download all
md5:7a2b7ecb9bd13546eb9be67787e8939c
50.2 MB Preview Download