Abstract
Provenance data are metadata that represent the source information or modification history of various data. Provenance information can be a few dozen times greater in amount than the original data because it is continuously increased whenever the source data are modified. Therefore, schemes for efficiently compressing large-capacity provenance data are required. In this paper, we proposed a new resource description framework (RDF) provenance compression scheme that considers graph patterns. The proposed scheme reduces the space occupied by string data by converting the provenance data into numeric data through a dictionary encoding process. Unlike existing provenance compression schemes, in the proposed scheme, some RDF documents manage the source RDF documents on the semantic web to track changes in the provenance data. The proposed scheme reduces the storage space by compressing the source RDF documents by considering their patterns. It also compresses the provenance data by considering the patterns of active nodes in the PROV model. This improves the compression performance through a compression based on the provenance flow. The excellence of the proposed scheme was verified based on the compression rate and processing time determined from a performance evaluation.


















Similar content being viewed by others
References
Shadbolt N, Berners-Lee T, Hall W (2006) The semantic web revisited. IEEE Intell Syst 21(3):96–101
Bok K, Lim J, Kim K, Yoo J (2016) A RDF indexing scheme for large scale semantic web. Inf Int Interdiscip J 19(30):1011–1020
Arenas A, Perez J (2011) Querying semantic web data with SPARQL. In: ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp 305–316
Özsu MT (2016) A survey of RDF data management systems. Front Comput Sci 10(3):418–432
Frey J, Müller K, Hellmann S, Rahm E, Vidal M (2019) Evaluation of metadata representations in RDF stores. Semant Web 10(2):205–229
Wylot M, Hauswirth M, Cudré-Mauroux P, Sakr S (2018) RDF data storage and query processing schemes: a survey. ACM Comput Surv 51(4):1–36
Pan Z, Zhu T, Liu H, Ning H (2018) A survey of RDF management technologies and benchmark datasets. J Ambient Intell Humaniz Comput 9(5):1693–1704
Liu J, Chen J, Rao Z, Sun Z, Yang H, Xu R (2018) A massive RDF storage approach based on graph database. In: International Conference on Geoinformatics and Data Analysis, pp 169–173
Zou L, Özsu MT (2017) Graph-based RDF data management. Data Sci Eng 2(1):56–70
Fiorelli M, Pazienza MT, Stellato A, Turbati A (2017) Change management and validation for collaborative editing of RDF datasets. Int J Metadata Semant Ontol 12(2/3):142–154
Yang X (2018) Query for streaming information: dynamic processing and adaptive incremental maintenance of RDF stream. In: International World Wide Web Conferences, pp 843–847
Naja I, Gibbins N (2018) Using provenance to efficiently propagate SPARQL updates on RDF source graphs. In: International Provenance and Annotation Workshop, pp 158–170
Narock T, Yoon VY, March S (2014) A provenance-based approach to semantic web service description and discovery. Decis Support Syst 64:90–99
Xie Y, Muniswamy-Reddy K, Feng D, Liz Y, Long DDE, Tan Z, Chen L (2012) A hybrid approach for efficient provenance storage. In: ACM Conference on Information and Knowledge Management, pp 1752–1756
Wright R (2018) Quine: a temporal graph system for provenance storage and analysis. In: International Provenance and Annotation Workshop, pp 177–180
Avgoustaki A, Flouris G, Fundulaki I, Plexousakis D (2016) Provenance management for evolving RDF datasets. In: International Conference on the Semantic Web, pp 575–592
Wylot M, Cudré-Mauroux P, Hauswirth M, Groth PT (2017) Storing, tracking, and querying provenance in linked data. IEEE Trans Knowl Data Eng 29(8):1751–1764
Piscopo A, Kaffee L, Phethean C, Simperl E (2017) Provenance information in a collaborative knowledge graph: an evaluation of Wikidata external references. In: International Semantic Web Conference, pp 542–558
Liu Q, Wylot M, Phuoc DL, Hauswirth M (2019) Provenance management over linked data streams. Open J Databases 6(1):5–20
Xin Y, Wang X, Jin D, Wang S (2018) Distributed efficient provenance-aware regular path queries on large RDF graphs. In: International Conference on Database Systems for Advanced Applications, pp 766–782
Camisetty A, Chandurkar C, Sun M, Koop D (2019) Enhancing web-based analytics applications through provenance. IEEE Trans Visual Comput Graph 25(1):131–141
Ornelas T, Braga RMM, David JMN, Campos F, Costa GCB (2018) Provenance data discovery through semantic web resources. Concurr Comput Pract Exp 30(6):e4366
Simmhan Y, Plale B, Gannon D (2005) A survey of data provenance in e-science. SIGMOD Rec 34(3):31–36
Miao H, Deshpande A (2018) ProvDB: provenance-enabled lifecycle management of collaborative data analysis workflows. IEEE Data Eng Bull 41(4):26–38
Gaspar W, Braga RMM, Campos F, David JMN, Ornelas T (2015) Scientific provenance metadata capture and management using semantic web. Int J Metadata Semant Ontol 10(2):123–138
Sharma K, Marjit U, Biswas U (2015) Efficient provenance storage for RDF dataset in semantic web environment. In: International Conference on Information Technology, pp 94–100
Mahmood T, Jami SI, Shaikh ZA, Mughal MH (2013) Toward the modeling of data provenance in scientific publications. Comput Stand Interfaces 35(1):6–29
Chebotko A, Lu S, Fei X, Fotouhi F (2010) RDFProv: a relational RDF store for querying and managing scientific workflow provenance. Data Knowl Eng 69(8):836–865
Khan FA, Hussain S, Janciak I, Brezany P (2011) Towards next generation provenance systems for e-science. Int J Inf Syst Model Des 2(3):24–48
Moreau L, Groth PT (2013) Provenance: an introduction to PROV. Synthesis lectures on the semantic web: theory and technology. Morgan & Claypool Publishers, San Rafael, pp 1–129
Missier P, Belhajjame K, Cheney J (2013) The W3C PROV family of specifications for modelling provenance metadata. In: International Conference on Extending Database Technology, pp 773–776
Closa G, Masó-Pau J, Proß B, Pons X (2017) W3C PROV to describe provenance at the dataset, feature and attribute levels in a distributed environment. Comput Environ Urban Syst 64:103–117
PROV-Overview. http://www.w3.org/TR/prov-overview/. Accessed 19 Oct 2018
PROV-DM: The PROV Data Model. http://www.w3.org/TR/prov-dm/. Accessed 16 Dec 2018
Halpin H, Cheney J (2014) Dynamic provenance for SPARQL updates. In: International Semantic Web Conference (1), pp 425–440
Halpin H, Cheney J (2014) Dynamic provenance for SPARQL updates using named graphs. In: International World Wide Web Conference, pp 287–288
García-Cuesta E, Gómez-Pérez JM (2018) Indexing execution patterns in workflow provenance graphs through generalized Trie structures. Preprint arXiv:1807.07346
Fernández JD, Martínez-Prieto MA, Polleres A, Reindorf J (2018) HDTQ: managing RDF datasets in compressed space. In: European Semantic Web Conference, pp 191–208
Dolgorsuren B, Khan K, Rasel MK, Lee Y (2019) StarZIP: streaming graph compression technique for data archiving. IEEE Access 7:38020–38034
Maneth S, Peternek F (2018) Grammar-based graph compression. Inf Syst 76:19–45
Chapman A, Jagadish HV, Ramanan P (2008) Efficient provenance storage. In: ACM SIGMOD International Conference on Management of Data, pp 993–1006
Xie Y, Reddy KM, Feng D, Li Y, Long DDE (2013) Evaluation of a hybrid approach for efficient provenance storage. J ACM Trans Storage 9(4):1–29
Álvarez-García S, Brisaboa NR, Fernández JD, Martínez-Prieto MA (2011) Compressed k2-triples for full-in-memory RDF engines. In: Americas Conference on Information Systems, pp 1–9
Brisaboa NR, Ladra S, Navarro G (2009) k2-trees for compact web graph representation. In: International Symposium on String Processing and Information Retrieval, pp 18–30
García NF, Fisteus JA, Sánchez L, Fuentes-Lorenzo D, Corcho Ó (2014) RDSZ: an approach for lossless RDF stream compression. In: International Conference on the Semantic Web: Trends and Challenges, pp 52–67
Deutsch P, Gailly J (1996) ZLIB compressed data format specification version 3.3. Req Comments 1950:1–11
Acknowledgements
This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. B0101-15-0266, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis), by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (No. NRF-2017M3C4A7069432), and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2019R1I1A1A01062289).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Bok, K., Han, J., Lim, J. et al. Provenance compression scheme based on graph patterns for large RDF documents. J Supercomput 76, 6376–6398 (2020). https://doi.org/10.1007/s11227-019-02926-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02926-2