Abstract
We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly.
Our investigation of LZ yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to Lempel-Ziv to the number of distinct short substrings contained in it. In addition, we show that approximating the compressibility with respect to LZ is related to approximating the support size of a distribution.
A full version of this paper is available [17]. These results appeared previously as part of a technical report [16].
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)
Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: lower bounds and applications. In: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp. 266–275. ACM Press, New York (2001)
Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. SIAM Journal on Computing 35(1), 132–150 (2005)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4) (2002). (See comment by Khmelev DV, Teahan WJ, Phys Rev Lett. 90(8):089803, (2003) and the reply Phys Rev Lett. 90(8):089804, 2003)
Brautbar, M., Samorodnitsky, A.: Approximating the entropy of large alphabets. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2007)
Bunge, J.: Bibligraphy on estimating the number of classes in a population, http://www.stat.cornell.edu/~bunge/bibliography.htm
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268–279. ACM, New York (2000)
Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Cilibrasi, R., Vitányi, P.M.B.: Similarity of objects and the meaning of words. In: Cai, J.-Y., Cooper, S.B., Li, A. (eds.) TAMC 2006. LNCS, vol. 3959, pp. 21–45. Springer, Heidelberg (2006)
Cover, T., Thomas, J.: Elements of Information Theory. Wiley & Sons, Chichester (1991)
Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Prob. Peredachi Inf. 37(2), 96–98 (2000) [Probl. Inf. Transm. ( Engl. Transl.) 37, 172–184 (2001)]
Lehman, E., Shelat, A.: Approximation algorithms for grammer-based compression. In: Proc. 18th Annual Symp. on Discrete Algorithms, pp. 205–212 (2002)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004) (Prelim. version in SODA 2003)
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Heidelberg (1997)
Loewenstern, D., Hirsh, H., Noordewier, M., Yianilos, P.: DNA sequence classification using compression-based induction. Technical Report 95-04, Rutgers University, DIMACS (1995)
Raskhodnikova, S., Ron, D., Rubinfeld, R., Shpilka, A., Smith, A.: Sublinear algorithms for approximating string compressibility and the distribution support size. Electronic Colloquium on Computational Complexity, TR05-125 (2005)
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.: Sublinear algorithms for approximating string compressibility. Full version of this paper, in preparation, Arxiv Report 0706.1084 [cs.DS] (June 2007)
Raskhodnikova, S., Ron, D., Shpilka, A., Smith, A.: On the difficulty of approximating the support size of a distribution (Manuscript 2007)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 337–343 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 530–536 (1978)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A. (2007). Sublinear Algorithms for Approximating String Compressibility. In: Charikar, M., Jansen, K., Reingold, O., Rolim, J.D.P. (eds) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2007 2007. Lecture Notes in Computer Science, vol 4627. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74208-1_44
Download citation
DOI: https://doi.org/10.1007/978-3-540-74208-1_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74207-4
Online ISBN: 978-3-540-74208-1
eBook Packages: Computer ScienceComputer Science (R0)