Abstract
With increasing number of DNA sequences being discovered the problem of storing and using genomic databases has become vital. Since DNA sequences consist of only four letters, two bits are sufficient to store each base. Many algorithms have been proposed in the recent past that push the bits/base limit further. The subtle patterns in DNA along with statistical inferences have been exploited to increase the compression ratio. From the compression perspective, the entire DNA sequences can be considered to be made of two types of sequences: repetitive and non-repetitive. The repetitive parts are compressed used dictionary-based schemes and non-repetitive sequences of DNA are usually compressed using general text compression schemes. In this paper, we present a memoization based encoding scheme for non-repeat DNA sequences. This scheme is incorporated with a DNA-specific compression algorithm, DNAPack, which is used for compression of DNA sequences. The results show that our method noticeably performs better than other techniques of its kind.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its application in genome comparison. genomic 12, 512–514 (2001)
Grumbach, S., Tahi, F.: Compression of dna sequences. In: Data compression conference, pp. 340–350 (1993)
Grumbach, S., Tahi, F.: A new challenge for compression algorithms genetic sequences. Journal of Information processing and Management 30, 866–875 (1994)
Matsumuto, T., Sadakane, K., Imai, H.: Biological sequences compression algorithms. In: Genome Information Ser. Workshop Genome Inform., vol. 11, pp. 43–52 (2000)
Rivals, E., Delahaye, J.-P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive dna sequences. LIFL Lille I Univerisity technical report, 285 (1995)
Willems, F.M.J., Shtralov, Y.M., Tjalkens, T.J.: The context tree weighting method:basic properties. IEE trans Inform Theory 41(3), 653–664 (1995)
Sadakane, K., Okazaki, T., Imai, H.: Implementing the context tree weighting method for text compression. In: DCC 2000: Proceedings of the Conference on Data Compression, Washington, DC, USA, p. 123. IEEE Computer Society, Los Alamitos (2000)
Rivals, E., Dauchet, M.: Fast discerning repeats in DNA sequences with a compression algorithm. In: Proc. Genome Informatics Workshop, pp. 215–226. Universal Academy Press, Tokyo (1997)
Sata, H., Yoshioka, T., Konagaya, A., Toyoda, T.: Dna compression in the post genomic era. Genome Informatics 12, 512–514 (2001)
Ziv, J., Limpel, A.: Compression of individual sequences using variable-rate encoding. IEE trans. Inform Theory 24, 530–536 (1978)
Ziv, J., Limpel, A.: A universal algorithm for sequential data compression. IEE trans. Inform. Theory 23(3), 337–343 (1977)
Sadel, I.: Universal data compression algorithm based on approximate string matching. In: Probability in the Engineering and Informational Sciences, pp. 465–486 (1996)
Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences. IEEE Engineering in Medicine and biology Magazine 20(4), 61–66 (2001)
Li, M., Badger, J.H., Chen, J.H., Kwong, S., Kerney, P., Zhang, H.: An information based sequences distance and its application to whole mitochondrial genome. Bioinformatics 17(2), 149–154 (2001)
Chen, X., La, M., Ma, B., Tromp, J.: Dnacompress: fast and effective dna sequence compression. Bioinformatics 18, 1696–1698 (2002)
Ma, B., Tromp, J., Li, M.: Patternhunter-faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Chang, C.: Dnac: A compression algorithm of dna sequences by non-overlapping approximate repeats. Master Thesis (2004)
Modegi, T.: Development of lossless compression techniques for biology information and its application for bioinformatics database retrieval. Genome Informatics (14), 695–696 (2003)
Zhang, Y., Parthe, R., Adjeroh, D.: Lossless compression of dna microarray images. csbw 0, 128–132 (2005)
Tan, Z., Cao, X., Ooi, B.C., Tung, A.K.H.: The ed-tree: An index for large dna sequence databases. ssdbm, 151 (2003)
Behzadi, B., Le Fessant, F.: Dna compression challenge revisited:a dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005)
Apostolico, A., Lonardi, S.: Compression of biological sequences by greedy off-line textual substitution. dcc, 143 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Srinivasa, K.G., Jagadish, M., Venugopal, K.R., Patnaik, L.M. (2006). Non-repetitive DNA Sequence Compression Using Memoization. In: Maglaveras, N., Chouvarda, I., Koutkias, V., Brause, R. (eds) Biological and Medical Data Analysis. ISBMDA 2006. Lecture Notes in Computer Science(), vol 4345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946465_36
Download citation
DOI: https://doi.org/10.1007/11946465_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68063-5
Online ISBN: 978-3-540-68065-9
eBook Packages: Computer ScienceComputer Science (R0)