Abstract
An XML document D often has a regular structure, i.e., it is composed of many similarly named and structured subtrees. Therefore, the entropy of a trees structuredness should be relatively low and thus the trees should be highly compressible by transforming them to an intermediate form. In general, this idea is used in permutation based XML-conscious compressors. An example of such a compressor is called XSAQCT, where the compressible form is called an annotated tree. While XSAQCT proved to be useful for various applications, it was never shown that it is a lossless compressor. This paper provides the formal background for the definition of an annotated tree, and a formal proof that the compression is lossless. It also shows properties of annotated trees that are useful for various applications, and discusses a measure of compressibility using this approach, followed by the experimental results showing compressibility of annotated trees.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
gzip -9 FILE.
- 2.
xz -9 -e FILE.
- 3.
bzip2 -9 FILE.
- 4.
ppmonstr -m1700 -o64 FILE.
- 5.
zpaq add FILE.zpaq FILE -method 69 -noattributes.
- 6.
paq8pxd_v7 -8 FILE.
References
XML: Extensible markup language (XML) 1.0 (Fifth edition) (2013). http://www.w3.org/tr/rec-xml/. Assessed October 2013
Busatto, G., Lohrey, M., Maneth, S.: Efficient memory representation of XML documents. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 199–216. Springer, Heidelberg (2005)
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1), 4:1–4:33 (2009)
Busatto, G., Lohrey, M., Maneth, S.: Efficient memory representation of XML document trees. Inf. Syst. 33(4–5), 456–474 (2008)
Arion, A., Bonifati, A., Manolescu, I., Pugliese, A.: XQueC: a query-conscious compressed XML database. ACM Trans. Internet Technol. 7(2), 1–32 (2007)
GZIP: The gzip home page (2013). http://www.gzip.org. Assessed October 2013
bzip2: bzip2 compression (2013). http://www.bzip.org/. Assessed October 2013
Müldner, T., Fry, C., Miziołek, J., Durno, S.: XSAQCT: XML queryable compressor. In: Balisage: The Markup Conference 2009, Montreal, Canada, August 2009
Müldner, T., Miziołek, J., Corbin, T.: Annotated trees and their applications to XML compression. In: The Tenth International Conference on Web Information Systems and Technologies, WEBIST, Barcelona, Spain, pp. 27–39 (2014)
Müldner, T., Corbin, T., Miziołek, J., Fry, C.: Design and implementation of an online XML compressor for large XML files. Int. J. Adv. Internet Technol. 5(3), 115–118 (2012)
xmlgen: The benchmark data generator (2013). http://www.xml-benchmark.org/generator.html. Assessed October 2013
Baseball.xml: baseball.xml (2013). http://rassyndrome.webs.com/cc/baseball.xml. Assessed October 2013
Corpus, W.: Wratislavia XML corpus (2013). http://www.ii.uni.wroc.pl/~inikep/research/wratislavia/. Assessed October 2013
Consortium, T.U.: Update on activities at the Universal Protein Resource (UniProt) in 2013 (January 2013). http://dx.doi.org/10.1093/nar/gks1068. Assessed on 20 June 2013
enwiki dumps: enwiki-latest.xml (2013). http://dumps.wikimedia.org/enwiki/latest/. Assessed October 2013
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23(3), 337–343 (2006)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation (1994)
ZPAQ: Zpaq (2013). http://www.w3.org/tr/rec-xml/. Assessed October 2013
Mahoney, M.: Large Text Compression Benchmark (2012). http://mattmahoney.net/dc/zpaq.html. Assessed October 2013
Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing xpath queries. ACM Trans. Database Syst. 30(2), 444–491 (2005)
Acknowledgements
The work of the first and third authors are partially supported by the NSERC RGPIN grant and NSERC CSG-M (Canada Graduate Scholarship-Masters) grant respectively.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Müldner, T., Miziołek, J.K., Corbin, T. (2015). Permutation Based XML Compression. In: Monfort, V., Krempels, KH. (eds) Web Information Systems and Technologies. WEBIST 2014. Lecture Notes in Business Information Processing, vol 226. Springer, Cham. https://doi.org/10.1007/978-3-319-27030-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-27030-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27029-6
Online ISBN: 978-3-319-27030-2
eBook Packages: Computer ScienceComputer Science (R0)