Abstract
An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| − 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2 ||U||− 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge, UK (1997)
Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific, Singapore (2002)
Allauzen, C., Mohri, M., Saraclar, M.: General Indexation of Weighted Automata – Application to Spoken Utterance Retrieval. In: Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval (HLT/NAACL 2004), Boston, Massachusetts, pp. 33–40 (2004)
Weinstein, E., Moreno, P.: Music Identification with Weighted Finite-State Transducers. In: Proceedings of ICASSP 2007, Honolulu, Hawaii (2007)
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoretical Computer Science 40, 31–55 (1985)
Crochemore, M.: Transducers and repetitions. Theoretical Computer Science 45, 63–86 (1986)
Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., McConnell, R.: Complete inverted files for efficient text retrieval and analysis. Journal of the ACM 34, 578–589 (1987)
Mohri, M.: Finite-state transducers in language and speech processing. Computational Linguistics 23, 269–311 (1997)
Mohri, M.: Statistical Natural Language Processing. In: Lothaire, M. (ed.) Applied Combinatorics on Words, Cambridge University Press, Cambridge (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mohri, M., Moreno, P., Weinstein, E. (2007). Factor Automata of Automata and Applications. In: Holub, J., Žďárek, J. (eds) Implementation and Application of Automata. CIAA 2007. Lecture Notes in Computer Science, vol 4783. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76336-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-76336-9_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76335-2
Online ISBN: 978-3-540-76336-9
eBook Packages: Computer ScienceComputer Science (R0)