Abstract
We present a dataset for learning to rank in the medical domain, consisting of thousands of full-text queries that are linked to thousands of research articles. The queries are taken from health topics described in layman’s English on the non-commercial www.NutritionFacts.org website; relevance links are extracted at 3 levels from direct and indirect links of queries to research articles on PubMed. We demonstrate that ranking models trained on this dataset by far outperform standard bag-of-words retrieval models. The dataset can be downloaded from: www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
For example, the USPTO and EPO provide specialized patent search facilities at www.uspto.gov/patents/process/search and www.epo.org/searching.html.
- 3.
- 4.
- 5.
- 6.
BM25 parameters were set to \(k_1= 1.2\), \(b= 0.75\).
- 7.
Preprocessing included lowercasing, tokenizing, filtering punctuation and stop-words, and replacing numbers with a special token.
References
Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Chapelle, O., Weinberger, K.: Learning to rank with (a lot of) word features. Inf. Retr. J. 13(3), 291–314 (2010)
Collins, M., Koo, T.: Discriminative reranking for natural language parsing. Comput. Linguist. 31(1), 25–69 (2005)
Goel, S., Langford, J., Strehl, A.L.: Predictive indexing for fast search. In: NIPS, Vancouver, Canada (2008)
Goeuriot, L., Kelly, L., Jones, G.J.F., Müller, H., Zobel, J.: Report on the SIGIR 2014 workshop on medical information retrieval (MedIR). SIGIR Forum 48(2), 78–82 (2014)
Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A.J., Strehl, A.L., Vishwanathan, V.: Hash Kernels. In: AISTATS, Irvine, CA (2009)
Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. In: CIKM, Lisbon, Portugal (2007)
Sokolov, A., Jehl, L., Hieber, F., Riezler, S.: Boosting cross-language retrieval by learning bilingual phrase associations from relevance rankings. In: EMNLP, Seattle (2013)
Acknowledgments
We are grateful to Dr. Michael Greger for permitting crawling www.NutritionFacts.org. This research was supported in part by DFG grant RI-2221/1-2 “Weakly Supervised Learning of Cross-Lingual Systems”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Boteva, V., Gholipour, D., Sokolov, A., Riezler, S. (2016). A Full-Text Learning to Rank Dataset for Medical Information Retrieval. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_58
Download citation
DOI: https://doi.org/10.1007/978-3-319-30671-1_58
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30670-4
Online ISBN: 978-3-319-30671-1
eBook Packages: Computer ScienceComputer Science (R0)