Abstract
In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size n, they use only n bits of disk working space while all previous approaches use Θ(n logn) bits of disk working space. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses.
We also present a scan-based algorithm for inverting the BWT that uses Θ(n) bits of working space, and a lightweight internal-memory algorithm for computing the BWT which is the fastest in the literature when the available working space is o(n) bits.
Finally, we prove lower bounds on the complexity of computing and inverting the BWT via sequential scans in terms of the classic product: internal-memory space × number of passes over the disk data, showing that our algorithms are within an O(logn) factor of the optimal.
The first author has been partially supported by Yahoo! Research and FIRB Linguistica 2006. The second author has been partially funded by Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM P05-001-F, Mideplan, Chile.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
References
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34(8), 711–726 (2004)
Chiang, Y., Goodrich, M., Grove, E., Tamassia, R., Vengroff, D., Vitter, J.: External-memory graph algorithms. In: ACM-SIAM SODA, pp. 139–149 (1995)
Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler transform: Linking range searching and text indexing. In: IEEE DCC (2008)
Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM Journal of Experimental Algorithmics 12 (2008)
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of the ACM 47(6), 987–1011 (2000)
Ferragina, P.: String search in external memory: Data structures and algorithms. In: Aluru, S. (ed.) Handbook of Computational Molecular Biology (2005)
Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. CoRR, abs/0909.4341 (2009)
Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 756–767. Springer, Heidelberg (2006)
Ferragina, P., Navarro, G.: The Pizza & Chili corpus home page (2007), http://pizzachili.dcc.uchile.cl/ , pizzachili.di.unipi.it
Franceschini, G., Muthukrishnan, S.: In-place suffix sorting. In: Arge, L., Cachin, C., Jurdziński, T., Tarlecki, A. (eds.) ICALP 2007. LNCS, vol. 4596, pp. 533–545. Springer, Heidelberg (2007)
Gagie, T.: On the value of multiple read/write streams for data compression. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009. LNCS, vol. 5577, pp. 68–77. Springer, Heidelberg (2009)
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Frakes, B., Baeza-Yates, R.A. (eds.) Information Retrieval: Data Structures and Algorithms, ch. 5, pp. 66–82 (1992)
González, R., Navarro, G.: A compressed text index on secondary memory. In: Proceedings IWOCA 2007, pp. 80–91. College Publications, UK (2007)
Hon, W., Sadakane, K., Sung, W.: Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 2162–2178 (2009)
Hon, W., Lam, T., Sadakane, K., Sung, W., Yiu, S.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48, 23–36 (2007)
Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theoretical Computer Science 387, 249–257 (2007)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)
Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Information Processing Letters 86(5), 241–246 (2003)
Munro, J., Paterson, M.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)
Na, J., Park, K.: Alphabet-independent linear-time construction of compressed suffix arrays using o(n logn)-bit working space. TCS 386, 127–136 (2007)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)
Sirén, J.: Compressed suffix arrays for massive data. In: Hyyro, H. (ed.) SPIRE 2009. LNCS, vol. 5721, pp. 63–74. Springer, Heidelberg (2009)
Vitter, J.: Algorithms and Data Structures for External Memory. Foundations and Trends in Theoretical Computer Science, vol. 2, p. 4. NOW (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferragina, P., Gagie, T., Manzini, G. (2010). Lightweight Data Indexing and Compression in External Memory. In: López-Ortiz, A. (eds) LATIN 2010: Theoretical Informatics. LATIN 2010. Lecture Notes in Computer Science, vol 6034. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12200-2_60
Download citation
DOI: https://doi.org/10.1007/978-3-642-12200-2_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12199-9
Online ISBN: 978-3-642-12200-2
eBook Packages: Computer ScienceComputer Science (R0)