Abstract
String distance metrics have been widely used in various applications concerning processing of textual data. This paper reports on the exploration of their usability for tackling the reference matching task and for the automatic correction of misspelled search engine queries, in the context of highly inflective languages, in particular focusing on Polish. The results of numerous experiments in different scenarios are presented and they revealed some preferred metrics. Surprisingly good results were observed for correcting misspelled search engine queries. Nevertheless, a more in-depth analysis is necessary to achieve improvements. The work reported here constitutes a good point of departure for further research on this topic.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Morton, T.: Coreference for NLP Applications. In: Proceedings of ACL 1997 (1997)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Proceedings of the KDD2003 (2003)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 (2003)
Piskorski, J.: Named Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–132. Springer, Heidelberg (2005)
Linden, K.: Multilingual modeling of cross-lingual spelling variants. Information Retrieval 9(3), 295–310 (2006)
Kukich, K.: Techniques for automatically correcting words in text. In: CSC ’93: Proceedings of the 1993 ACM conference on Computer science, Indianapolis, Indiana, United States, ACM Press, New York (1993)
Levenshtein, V.: Binary Codes for Correcting Deletions, Insertions, and Reversals. Doklady Akademii Nauk SSSR 163(4), 845–848 (1965)
Landau, G., Vishkin, U.: Fast Parallel and Serial Approximate String Matching. Journal of Algorithms 10, 157–169 (1997)
Needleman, S., Wunsch, C.: A General Method Applicable to Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48(3), 443–453 (1970)
Smith, T., Waterman, M.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Winkle, W.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC (1999)
Ukkonen, E.: Approximate String Matching with q-grams and Maximal Matches. Theoretical Computer Science 92(1), 191–211 (1992)
Russell, R.: U.S. Patent 1,261,167 (1918), http://pattft.uspto.gov/netahtml/srchnum.htm
Monge, A., Elkan, C.: The Field Matching Problem: Algorithms and Applications. In: Proceedings of Knowledge Discovery and Data Mining 1996, pp. 267–270 (1996)
NetSprint, http://www.netsprint.pl
Cucerzan, S., Brill, E.: Spelling correction as an iterative process that exploits the collective knowledge of web users. In: Proceedings of EMNLP 2004, pp. 293–300 (2004)
Gravano, L., et al.: Using q-grams in a DBMS for Approximate String Processing. IEEE Data Engineering Bulletin 24(4), 28–34 (2001)
Bilenko, M., Mooney, R.: Employing trainable string similarity metrics for information integration. In: Proceedings of the IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico (August 2003)
SimMetric, http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Piskorski, J., Sydow, M. (2007). String Distance Metrics for Reference Matching and Search Query Correction. In: Abramowicz, W. (eds) Business Information Systems. BIS 2007. Lecture Notes in Computer Science, vol 4439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72035-5_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-72035-5_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72034-8
Online ISBN: 978-3-540-72035-5
eBook Packages: Computer ScienceComputer Science (R0)