Abstract
Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brabants recht. Costumen van Antwerpen (2005), http://www.kulak.ac.be/facult/rechten/Monballyu/Rechtlagelanden/Brabantsrecht/brabantsrechtindex.htm
Braschler, M., Peters, C.: Cross-language evaluation forum: Objectives, results, achievements. Information Retrieval 7, 7–31 (2004)
Braun, L.: Information retrieval from Dutch historical corpora. Master’s thesis, Maastricht University (2002)
CLEF. Cross language evaluation forum (2005), http://www.clef-campaign.org/
Craswell, N., Hawking, D.: Overview of the TREC 2004 web track. In: The Thirteenth Text REtrieval Conference (TREC 2004). National Institute for Standards and Technology. NIST Special Publication 500-251 (2005)
DBNL. Digitale bibliotheek voor de Nederlandse letteren (2005), http://www.dbnl.nl
DigiCULT. Technology challenges for digital culture (2005), http://www.digicult.info/
Efron, B.: Bootstrap methods: Another look at the jackknife. Annals of Statistics 7, 1–26 (1979)
Gelders recht. Gelders Land- en Stadsrecht (2005), http://www.kulak.ac.be/facult/rechten/Monballyu/Rechtlagelanden/Geldersrecht/geldersrechtindex.htm
Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual document retrieval for European languages. Information Retrieval 7, 33–52 (2004)
Hüning, M.: Geschiedenis van het Nederlands (1996), http://www.ned.univie.ac.at/publicaties/taalgeschiedenis/nl/
Kukich, K.: Technique for automatically correcting words in text. ACM Computing Surveys 24, 377–439 (1992)
Lesk, M.: Understanding Digital Libraries, 2nd edn. The Morgan Kaufmann series in multimedia information and systems. Morgan Kaufmann, San Francisco (2005)
Lucene. The Lucene search engine (2005), http://jakarta.apache.org/lucene/
NeXTeNS. Text-to-speech for Dutch (2005), http://nextens.uvt.nl/
O’Rourke, A.J., Robertson, A.M., Willett, P., Eley, P., Simons, P.: Word variant identification in old french. Information Research 2 (1996), http://informationr.net/ir/2-4/paper22.html
Robertson, A.M., Willett, P.: Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods. In: Proceedings ACM SIGIR 1992, pp. 256–265. ACM Press, New York (1992)
Rogers, H.J., Willett, P.: Searching for historical word forms in text databases using spelling-correction methods. Journal of Documentation 7, 333–353 (1991)
Russell, R.C.: Specification of Letters, volume 1,261,167 of Patent Number. United States Patent Office, A Cross-Language Approach to Historic Document Retrieval 419 (1918)
Sankoff, D., Kruskal, J.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley Publishing Co., Reading (1983)
Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Information Processing and Management 33, 495–512 (1997)
Savoy, J.: Combining multiple strategies for effective monolingual and crosslanguage retrieval. Information Retrieval 7, 121–148 (2004)
Snowball. A language for stemming algorithms (2005), http://snowball.tartarus.org/
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM 21, 168–173 (1974)
Wikipedia. Indo-european languages languages (2005), http://en.wikipedia.org/wiki/Indo-European
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Koolen, M., Adriaans, F., Kamps, J., de Rijke, M. (2006). A Cross-Language Approach to Historic Document Retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_36
Download citation
DOI: https://doi.org/10.1007/11735106_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)