Abstract
The lack of a descriptive schema for an RDF dataset has motivated several research works addressing the problem of automatic schema discovery. The goal of these approaches is to provide the underlying structural schema of a given RDF dataset, either from the existing instances, or using some schema-related declarations if provided. However, as the instances in the RDF dataset evolve, the generated schema may become inconsistent with the dataset. It is therefore necessary to incrementally update the existing schema according to the changes occurring in the dataset over time.
In this paper, we propose a schema discovery approach for massive RDF datasets which incrementally deals with both the insertion and the deletion of entities. It is based on a scalable and incremental density-based clustering algorithm which propagates the changes occurring in the dataset into the clusters corresponding to the classes of the schema. Our approach is implemented using big data technologies to scale-up to massive data, while providing a high quality clustering result. We present some experiments which demonstrate the efficiency of our proposal on both synthetic and real datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alcalde, C., Burusco, A.: Study of the relevance of objects and attributes of L-fuzzy contexts using overlap indexes. In: Medina, J., et al. (eds.) IPMU 2018. CCIS, vol. 853, pp. 537–548. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91473-2_46
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: Proceeding of the 20th International Conference on Extending Database Technology (EDBT), pp. 222–233 (2017)
Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Parametric schema inference for massive JSON datasets. VLDB J. 28, 497–521 (2019)
Bouhamoum, R., Kedad, Z., Lopes, S.: Scalable schema discovery for RDF data. Trans. Large Scale Data Knowl. Centered Syst. 46, 91–120 (2020). https://doi.org/10.1007/978-3-662-62386-2_4
Bouhamoum, R., Kedad, Z., Lopes, S.: Incremental schema discovery at scale for RDF data. In: Verborgh, R., et al. (eds.) ESWC 2021. LNCS, vol. 12731, pp. 195–211. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77385-4_12
Bouhamoum, R., Kellou-Menouer, K.K., Lopes, S., Kedad, Z.: Scaling up schema discovery approaches. In: Proceeding of the 34th International Conference on Data Engineering Workshops (ICDEW), pp. 84–89. IEEE (2018)
Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large Scale Data Knowl. Centered Syst. 19, 1–25 (2015). https://doi.org/10.1007/978-3-662-46562-2_1
Cordova, I., Moh, T.: DBSCAN on resilient distributed datasets. In: 2015 International Conference on High Performance Computing & Simulation, HPCS 2015, Amsterdam, Netherlands, 20–24 July 2015, pp. 531–540. IEEE (2015). https://doi.org/10.1109/HPCSim.2015.7237086
Ester, M., Kriegel, H., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB 1998, Proceedings of 24rd International Conference on Very Large Data Bases, 24–27 August 1998, New York City, New York, USA, pp. 323–333. Morgan Kaufmann (1998). http://www.vldb.org/conf/1998/p323.pdf
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeding of the Second International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231. AAAI Press (1996)
The Apache Software Foundation: Apache Hadoop (2018). https://hadoop.apache.org/. Accessed 20 Oct 2018
Gong, Y., Sinnott, R.O., Rimba, P.: RT-DBSCAN: real-time parallel clustering of spatio-temporal data using spark-streaming. In: Shi, Y., et al. (eds.) ICCS 2018. LNCS, vol. 10860, pp. 524–539. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93698-7_40
Gragera Aguaza, A., Suppakitpaisarn, V.: Relaxed triangle inequality ratio of the Sørensen-Dice and Tversky indexes. Theor. Comput. Sci. 718, 37–45 (2017)
Han, D., Agrawal, A., Liao, W., Choudhary, A.N.: A novel scalable DBSCAN algorithm with spark. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, 23–27 May 2016, pp. 1393–1402. IEEE Computer Society (2016). https://doi.org/10.1109/IPDPSW.2016.57
He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comp. Sci. 8(1), 83–99 (2014). https://doi.org/10.1007/s11704-013-3158-3
He, Y., et al.: MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In: 17th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2011, Tainan, Taiwan, 7–9 December 2011, pp. 473–480. IEEE Computer Society (2011). https://doi.org/10.1109/ICPADS.2011.83
IBM: IBM quest synthetic data generator (2015). https://sourceforge.net/projects/ibmquestdatagen/. Accessed 01 Oct 2018
Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Jafari, O., Maurya, P., Nagarkar, P., Islam, K.M., Crushev, C.: A survey on locality sensitive hashing algorithms and their applications. CoRR abs/2102.08942 (2021). https://arxiv.org/abs/2102.08942
Kardoulakis, N., Kellou-Menouer, K., Troullinou, G., Kedad, Z., Plexousakis, D., Kondylakis, H.: Hint: hybrid and incremental type discovery for large RDF data sources. In: Zhu, Q., Zhu, X., Tu, Y., Xu, Z., Kumar, A. (eds.) SSDBM 2021: 33rd International Conference on Scientific and Statistical Database Management, Tampa, FL, USA, 6–7 July 2021, pp. 97–108. ACM (2021). https://doi.org/10.1145/3468791.3468808
Kellou-Menouer, K., Kardoulakis, N., Troullinou, G., Kedad, Z., Plexousakis, D., Kondylakis, H.: A survey on semantic schema discovery. VLDB J. (2021). https://doi.org/10.1145/3468791.3468808
Kellou-Menouer, K., Kedad, Z.: Schema discovery in RDF data sources. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 481–495. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_36
Kellou-Menouer, K., Kedad, Z.: A self-adaptive and incremental approach for data profiling in the semantic web. Trans. Large Scale Data Knowl. Centered Syst. 29, 108–133 (2016). https://doi.org/10.1007/978-3-662-54037-4_4
Lulli, A., Dell’Amico, M., Michiardi, P., Ricci, L.: NG-DBSCAN: scalable density-based clustering for arbitrary data. Proc. VLDB Endow. 10(3), 157–168 (2016). https://doi.org/10.14778/3021924.3021932
Luo, G., Luo, X., Gooch, T.F., Tian, L., Qin, K.: A parallel DBSCAN algorithm based on spark. In: Cai, Z., et al. (eds.) 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom), BDCloud-SocialCom-SustainCom 2016, Atlanta, GA, USA, 8–10 October 2016, pp. 548–553. IEEE Computer Society (2016). https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.85
Bakr, A.M., Ghanem, N.M., Ismail, M.A.: Efficient incremental density-based algorithm for clustering large datasets. Alex. Eng. J. 54, 1147–1154 (2015)
Patwary, M.M.A., Palsetia, D., Agrawal, A., Liao, W., Manne, F., Choudhary, A.N.: A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In: Hollingsworth, J.K. (ed.) SC Conference on High Performance Computing Networking, Storage and Analysis, SC 2012, Salt Lake City, UT, USA, 11–15 November 2012, p. 62. IEEE/ACM (2012). https://doi.org/10.1109/SC.2012.9
Pernelle, N., Saïs, F., Mercier, D., Thuraisamy, S.: RDF data evolution: efficient detection and semantic representation of changes. In: Proceedings of the Posters and Demos Track of the International Conference on Semantic Systems - SEMANTICS, vol. 12 (2016)
Sevilla Ruiz, D., Morales, S.F., García Molina, J.: Inferring versioned schemas from NoSQL databases and its applications. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 467–480. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_35
Savvas, I.K., Tselios, D.C.: Parallelizing DBSCAN algorithm using MPI. In: Reddy, S., Gaaloul, W. (eds.) 25th IEEE International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, WETICE 2016, Paris, France, 13–15 June 2016, pp. 77–82. IEEE Computer Society (2016). https://doi.org/10.1109/WETICE.2016.26
Song, H., Lee, J.: RP-DBSCAN: a superfast parallel DBSCAN algorithm based on random partitioning. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1173–1187. ACM (2018). https://doi.org/10.1145/3183713.3196887
Issa, S., Paris, P.-H., Hamdi, F., Si-Said Cherfi, S.: Revealing the conceptual schemas of RDF datasets. In: Giorgini, P., Weber, B. (eds.) CAiSE 2019. LNCS, vol. 11483, pp. 312–327. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21290-2_20
The Apache Software Foundation: Apache Spark (2018). https://spark.apache.org. Accessed 20 Oct 2018
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Bouhamoum, R., Kedad, Z., Lopes, S. (2022). Incremental Schema Generation for Large and Evolving RDF Sources. In: Hameurlain, A., Tjoa, A.M., Pacitti, E., Miklos, Z. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems LI. Lecture Notes in Computer Science(), vol 13410. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-66111-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-662-66111-6_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-66110-9
Online ISBN: 978-3-662-66111-6
eBook Packages: Computer ScienceComputer Science (R0)