Abstract
Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the minimal schema with interleaving is challenging. The problem of finding a minimal schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce results that are very close to optimal.
Work supported by the National Natural Science Foundation of China under Grant Nos. 61472405, 61070038.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abiteboul, S., Bourhis, P., Vianu, V.: Highly expressive query languages for unordered data trees. In: Proceedings of the 15th International Conference on Database Theory, pp. 46–60 (2012)
Bex, G.J., Gelade, W., Martens, W., Neven, F.: Simplifying XML schema: effortless handling of nondeterministic regular expressions. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 731–744 (2009)
Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. arXiv preprint arXiv:1303.4277 (2013)
Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise DTDs from XML data. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 115–126. VLDB Endowment, September 2006
Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of the 33rd international conference on Very large data bases, pp. 998–1009. VLDB Endowment, September 2007
Bex, G.J., Wouter, G., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Transactions on the Web (TWEB) 4(4), 14 (2010)
Ignatiev, A., Morgado, A., Marques-Silva, J.: On reducing maximum independent set to minimum satisfiability. In: Sinz, C., Egly, U. (eds.) SAT 2014. LNCS, vol. 8561, pp. 103–120. Springer, Heidelberg (2014)
Clark, J.: Trang: Multi-format schema converter based on RELAX NG. http://www.thaiopensource.com/relaxng/trang.html
Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)
Bailey, R.W.: The number of weak orderings of a finite set. Social Choice and Welfare 15(4), 559–562 (1998)
de Koninck, J.M.: Those Fascinating Numbers. American Mathematical Soc. (2009)
Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. In: Proceedings of the 16th International Conference on Database Theory, pp. 45–56 (2013)
Ciucanu, R., Staworko, S.: Learning schemas for unordered xml. arXiv preprint arXiv:1307.6348 (2013)
Gionis, A., Kujala, T., Mannila, H.: Fragments of order. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–136 (2003)
Mannila, H., Meek, C.: Global partial orders from sequential data. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 161–168 (2000)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, vol. 1215, pp. 487–499. VLDB (1994)
Pei, J., Wang, H., Liu, J., Wang, K., et al.: Discovering frequent closed partial orders from strings. IEEE Transactions on Knowledge and Data Engineering 18(11), 1467–1481 (2006)
Miklau, G.: XMLData Repository, November 2002. http://www.cs.washington.edu/research/xmldatasets/
Boppana, R., Halld\(\acute{o}\)rsson, M.M.: Approximating maximum independent sets by excluding subgraphs. BIT Numerical Mathematics 32(2), 180–196 (1992)
Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6(3), 505–517 (1977)
Algorithm to divide a set of symbols with constraints into minimun number of subsets. http://stackoverflow.com/q/29117747/4684328
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Peng, F., Chen, H. (2015). Discovering Restricted Regular Expressions with Interleaving. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9313. Springer, Cham. https://doi.org/10.1007/978-3-319-25255-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-25255-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25254-4
Online ISBN: 978-3-319-25255-1
eBook Packages: Computer ScienceComputer Science (R0)