Abstract
We study the following three problems of computing generic or discriminating words for a given collection of documents. Given a pattern P and a threshold d, we want to report (i) all longest extensions of P which occur in at least d documents, (ii) all shortest extensions of P which occur in less than d documents, and (iii) all shortest extensions of P which occur only in d selected documents. For these problems, we propose efficient algorithms based on suffix trees and using advanced data structure techniques. For problem (i), we propose an optimal solution with constant running time per output word.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bentley, J.L.: Multidimensional divide-and-conquer. Comm. ACM 23(4), 214–229 (1980)
Chan, T.M.: Persistent Predecessor Search and Orthogonal Point Location on the Word RAM. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 1131–1145 (2011)
Chi Kwong Hui, L.: Color Set Size Problem with Applications to String Matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992)
Fadiel, A., Lithwick, S., Ganji, G., Scherer, S.W.: Remarkable sequence signatures in archaeal genomes. Archaea 1(3), 185–190 (2003)
Farach, M., Muthukrishnan, M.: Perfect Hashing for Strings: Formalization and Algorithms. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 130–140. Springer, Heidelberg (1996)
Fredman, M.L., Willard, D.E.: Trans-dichotomous algorithms for minimum spanning trees and shortest paths. J. Comput. Syst. Sci. 48(3), 533–551 (1994)
Gabow, H.N., Bentley, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proc. 16th Annual ACM Symposium on Theory of Computing (STOC 1984), pp. 135–143 (1984)
Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 368–373. ACM Press (2006)
JáJá, J., Mortensen, C.W., Shi, Q.: Space-Efficient and Fast Algorithms for Multidimensional Dominance Reporting and Counting. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 558–568. Springer, Heidelberg (2004)
Kucherov, G., Nekrich, Y., Starikovskaya, T.: Cross-Document Pattern Matching. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 196–207. Springer, Heidelberg (2012)
Matias, Y., Vitter, J.S., Young, N.E.: Approximate data structures with applications. In: Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 187–194 (1994)
Muthukrishnan, S.M.: Efficient algorithms for document retrieval problems. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002. Society for Industrial and Applied Mathematics, Philadelphia (2002)
Nekrich, Y.: I/O-efficient point location in a set of rectangles. In: Procedings of the 8th Latin American Symposium on Theoretical Informatics, pp. 687–698 (2008)
Schieber, B., Vishkin, U.: On finding lowest common ancestors: Simplification and parallelization. SIAM Journal on Computing 17, 111–123 (1988)
Tompa, M., et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23(1), 137–144 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kucherov, G., Nekrich, Y., Starikovskaya, T. (2012). Computing Discriminating and Generic Words. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds) String Processing and Information Retrieval. SPIRE 2012. Lecture Notes in Computer Science, vol 7608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34109-0_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-34109-0_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34108-3
Online ISBN: 978-3-642-34109-0
eBook Packages: Computer ScienceComputer Science (R0)