{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,11,27]],"date-time":"2024-11-27T05:27:37Z","timestamp":1732685257975,"version":"3.28.2"},"reference-count":31,"publisher":"SAGE Publications","issue":"5","license":[{"start":{"date-parts":[[2024,10,9]],"date-time":"2024-10-09T00:00:00Z","timestamp":1728432000000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["SW"],"published-print":{"date-parts":[[2024,10,9]]},"abstract":"The explosion of the web and the abundance of linked data demand effective and efficient methods for storage, management, and querying. Apache Spark is one of the most widely used engines for big data processing, with more and more systems adopting it for efficient query answering. Existing approaches exploiting Spark for querying RDF data, adopt partitioning techniques for reducing the data that need to be accessed in order to improve efficiency. However, simplistic data partitioning fails, on one hand, to minimize data access and on the other hand to group data usually queried together. This is translated into limited improvement in terms of efficiency in query answering. In this paper, we present DIAERESIS, a novel platform that accepts as input an RDF dataset and effectively partitions it, minimizing data access and improving query answering efficiency. To achieve this, DIAERESIS first identifies the top-k most important schema nodes, i.e., the most important classes, as centroids and distributes the other schema nodes to the centroid they mostly depend on. Then, it allocates the corresponding instance nodes to the schema nodes they are instantiated under. Our algorithm enables fine-tuning of data distribution, significantly reducing data access for query answering. We experimentally evaluate our approach using both synthetic and real workloads, strictly dominating existing state-of-the-art, showing that we improve query answering in several cases by orders of magnitude.<\/jats:p>","DOI":"10.3233\/sw-243554","type":"journal-article","created":{"date-parts":[[2024,3,8]],"date-time":"2024-03-08T15:21:03Z","timestamp":1709911263000},"page":"1763-1789","source":"Crossref","is-referenced-by-count":1,"title":["DIAERESIS: RDF data partitioning and query processing on SPARK"],"prefix":"10.1177","volume":"15","author":[{"given":"Georgia","family":"Troullinou","sequence":"first","affiliation":[{"name":"FORTH-ICS, Heraklion, Crete, Greece"}]},{"given":"Giannis","family":"Agathangelos","sequence":"additional","affiliation":[{"name":"FORTH-ICS, Heraklion, Crete, Greece"}]},{"given":"Haridimos","family":"Kondylakis","sequence":"additional","affiliation":[{"name":"FORTH-ICS, Heraklion, Crete, Greece"}]},{"given":"Kostas","family":"Stefanidis","sequence":"additional","affiliation":[{"name":"Tampere University, Finland"}]},{"given":"Dimitris","family":"Plexousakis","sequence":"additional","affiliation":[{"name":"FORTH-ICS, Heraklion, Crete, Greece"}]}],"member":"179","reference":[{"key":"10.3233\/SW-243554_ref1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDEW.2018.00016"},{"issue":"3","key":"10.3233\/SW-243554_ref2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s00778-021-00711-3","article-title":"A survey of RDF stores & SPARQL engines for querying knowledge graphs","volume":"31","author":"Ali","year":"2022","journal-title":"VLDB J."},{"key":"10.3233\/SW-243554_ref3","doi-asserted-by":"crossref","unstructured":"M.\u00a0Armbrust, R.S.\u00a0Xin, C.\u00a0Lian, Y.\u00a0Huai, D.\u00a0Liu, J.K.\u00a0Bradley, X.\u00a0Meng, T.\u00a0Kaftan, M.J.\u00a0Franklin, A.\u00a0Ghodsi and M.\u00a0Zaharia, Spark SQL: Relational data processing in Spark, in: SIGMOD, 2015.","DOI":"10.1145\/2723372.2742797"},{"key":"10.3233\/SW-243554_ref4","doi-asserted-by":"publisher","DOI":"10.1145\/3106426.3106534"},{"issue":"2\u20133","key":"10.3233\/SW-243554_ref5","doi-asserted-by":"publisher","first-page":"655","DOI":"10.1007\/s00778-019-00558-9","article-title":"An analytical study of large SPARQL query logs","volume":"29","author":"Bonifati","year":"2020","journal-title":"VLDB J."},{"issue":"2","key":"10.3233\/SW-243554_ref6","doi-asserted-by":"publisher","first-page":"163","DOI":"10.1080\/0022250X.2001.9990249","article-title":"A faster algorithm for betweenness centrality","volume":"25","author":"Brandes","year":"2001","journal-title":"Journal of mathematical sociology"},{"key":"10.3233\/SW-243554_ref7","doi-asserted-by":"publisher","DOI":"10.1016\/j.cosrev.2020.100309"},{"key":"10.3233\/SW-243554_ref8","doi-asserted-by":"crossref","unstructured":"V.\u00a0Christophides, V.\u00a0Efthymiou and K.\u00a0Stefanidis, Entity Resolution in the Web of Data, Morgan & Claypool Publishers, 2015.","DOI":"10.1007\/978-3-031-79468-1"},{"key":"10.3233\/SW-243554_ref10","unstructured":"O.\u00a0Cur\u00e9, H.\u00a0Naacke, M.A.\u00a0Baazizi and B.\u00a0Amann, HAQWA: A hash-based and query workload aware distributed RDF store, in: ISWC P&D, 2015."},{"key":"10.3233\/SW-243554_ref11","doi-asserted-by":"publisher","DOI":"10.1109\/FiCloud.2017.48"},{"key":"10.3233\/SW-243554_ref12","doi-asserted-by":"crossref","unstructured":"D.\u00a0Graux, L.\u00a0Jachiet, P.\u00a0Genev\u00e8s and N.\u00a0Laya\u00efda, SPARQLGX in action: Efficient distributed evaluation of SPARQL with Apache Spark, in: ISWC, 2016.","DOI":"10.1007\/978-3-319-46547-0_9"},{"issue":"2\u20133","key":"10.3233\/SW-243554_ref13","doi-asserted-by":"publisher","first-page":"158","DOI":"10.1016\/j.websem.2005.06.005","article-title":"LUBM: A benchmark for OWL knowledge base systems","volume":"3","author":"Guo","year":"2005","journal-title":"J. Web Sem."},{"issue":"3","key":"10.3233\/SW-243554_ref14","doi-asserted-by":"publisher","first-page":"191","DOI":"10.1007\/s10619-023-07422-4","article-title":"S3QLRDF: Distributed SPARQL query processing using Apache Spark\u00a0\u2013 a comparative performance study","volume":"41","author":"Hassan","year":"2023","journal-title":"Distributed Parallel Databases"},{"key":"10.3233\/SW-243554_ref15","doi-asserted-by":"publisher","DOI":"10.1109\/ICOSC.2019.8665614"},{"key":"10.3233\/SW-243554_ref16","doi-asserted-by":"publisher","DOI":"10.1109\/SMDS49396.2020.00023"},{"key":"10.3233\/SW-243554_ref17","doi-asserted-by":"crossref","unstructured":"Q.-S.\u00a0Hua, H.\u00a0Fan, M.\u00a0Ai, L.\u00a0Qian, Y.\u00a0Li, X.\u00a0Shi and H.\u00a0Jin, Nearly optimal distributed algorithm for computing betweenness centrality, in: ICDCS, 2016.","DOI":"10.1109\/ICDCS.2016.89"},{"key":"10.3233\/SW-243554_ref18","doi-asserted-by":"crossref","unstructured":"N.\u00a0Kardoulakis, K.\u00a0Kellou-Menouer, G.\u00a0Troullinou, Z.\u00a0Kedad, D.\u00a0Plexousakis and H.\u00a0Kondylakis, HInT: Hybrid and incremental type discovery for large RDF data sources, in: SSDBM, 2021.","DOI":"10.1145\/3468791.3468808"},{"key":"10.3233\/SW-243554_ref19","unstructured":"L.\u00a0Kaufman and P.\u00a0Rousseeuw, Clustering by Means of Medoids, North-Holland, 1987."},{"issue":"4","key":"10.3233\/SW-243554_ref20","doi-asserted-by":"publisher","first-page":"675","DOI":"10.1007\/s00778-021-00717-x","article-title":"A survey on semantic schema discovery","volume":"31","author":"Kellou-Menouer","year":"2022","journal-title":"VLDB J."},{"key":"10.3233\/SW-243554_ref21","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989477"},{"key":"10.3233\/SW-243554_ref22","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-34002-426"},{"key":"10.3233\/SW-243554_ref23","doi-asserted-by":"crossref","unstructured":"A.\u00a0Madkour, A.M.\u00a0Aly and W.G.\u00a0Aref, WORQ: Workload-driven RDF query processing, in: ISWC, 2018, pp.\u00a0583\u2013599.","DOI":"10.1007\/978-3-030-00671-6_34"},{"key":"10.3233\/SW-243554_ref24","doi-asserted-by":"crossref","unstructured":"K.\u00a0M\u00f6ller, T.\u00a0Heath, S.\u00a0Handschuh and J.\u00a0Domingue, Recipes for Semantic Web Dog Food\u00a0\u2013 the ESWC and ISWC metadata projects, in: ISWC, 2007.","DOI":"10.1007\/978-3-540-76298-0_58"},{"key":"10.3233\/SW-243554_ref25","doi-asserted-by":"crossref","unstructured":"H.\u00a0Naacke, B.\u00a0Amann and O.\u00a0Cur\u00e9, SPARQL graph pattern processing with Apache Spark, in: GRADES@SIGMOD\/PODS, ACM, 2017, pp.\u00a01:1\u20131:7.","DOI":"10.1145\/3078447.3078448"},{"key":"10.3233\/SW-243554_ref26","doi-asserted-by":"crossref","unstructured":"A.\u00a0Pappas, G.\u00a0Troullinou, G.\u00a0Roussakis, H.\u00a0Kondylakis and D.\u00a0Plexousakis, Exploring importance measures for summarizing RDF\/S KBs, in: ESWC (1), Vol.\u00a010249, 2017, pp.\u00a0387\u2013403.","DOI":"10.1007\/978-3-319-58068-5_24"},{"key":"10.3233\/SW-243554_ref27","doi-asserted-by":"crossref","unstructured":"M.\u00a0Saleem, Q.\u00a0Mehmood and A.-C.\u00a0Ngonga Ngomo, FEASIBLE: A feature-based SPARQL benchmark generation framework, in: ISWC, 2015, pp.\u00a052\u201369.","DOI":"10.1007\/978-3-319-25007-6_4"},{"key":"10.3233\/SW-243554_ref28","doi-asserted-by":"crossref","unstructured":"A.\u00a0Sch\u00e4tzle, M.\u00a0Przyjaciel-Zablocki, T.\u00a0Berberich and G.\u00a0Lausen, S2X: Graph-parallel querying of RDF with GraphX, in: Big-O(Q)\/DMAH, 2015.","DOI":"10.1007\/978-3-319-41576-5_12"},{"issue":"10","key":"10.3233\/SW-243554_ref29","first-page":"804","article-title":"S2RDF: RDF querying with SPARQL on Spark","volume":"9","author":"Sch\u00e4tzle","year":"2016","journal-title":"PVLDB"},{"key":"10.3233\/SW-243554_ref31","doi-asserted-by":"crossref","unstructured":"G.\u00a0Troullinou, H.\u00a0Kondylakis, K.\u00a0Stefanidis and D.\u00a0Plexousakis, Exploring RDFS kbs using summaries, in: The Semantic Web\u2013ISWC 2018: 17th International Semantic Web Conference, Monterey, CA, USA, October 8\u201312, 2018, Proceedings, Part I 17, Springer, 2018, pp.\u00a0268\u2013284.","DOI":"10.1007\/978-3-030-00671-6_16"},{"key":"10.3233\/SW-243554_ref32","unstructured":"G.\u00a0Troullinou, H.\u00a0Kondylakis, K.\u00a0Stefanidis and D.\u00a0Plexousakis, RDFDigest+: A summary-driven system for KBs exploration, in: Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks Co-Located with 17th International Semantic Web Conference (ISWC 2018), Monterey, USA, October 8th-to-12th, 2018, M. van Erp, M. Atre, V. L\u00f3pez, K. Srinivas and C. Fortuna, eds, (CEUR Workshop Proceedings), Vol.\u00a02180 CEUR-WS.org, 2018, https:\/\/ceur-ws.org\/Vol-2180\/paper-73.pdf."},{"key":"10.3233\/SW-243554_ref34","unstructured":"M.\u00a0Zaharia, M.\u00a0Chowdhury, M.J.\u00a0Franklin, S.\u00a0Shenker and I.\u00a0Stoica, Spark: Cluster computing with working sets, in: HotCloud, 2010."}],"container-title":["Semantic Web"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/SW-243554","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,26]],"date-time":"2024-11-26T13:20:05Z","timestamp":1732627205000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.medra.org\/servlet\/aliasResolver?alias=iospress&doi=10.3233\/SW-243554"}},"subtitle":[],"editor":[{"given":"Aidan","family":"Hogan","sequence":"additional","affiliation":[{"name":"Universidad de Chile, Chile"}]}],"short-title":[],"issued":{"date-parts":[[2024,10,9]]},"references-count":31,"journal-issue":{"issue":"5"},"URL":"https:\/\/doi.org\/10.3233\/sw-243554","relation":{},"ISSN":["2210-4968","1570-0844"],"issn-type":[{"type":"electronic","value":"2210-4968"},{"type":"print","value":"1570-0844"}],"subject":[],"published":{"date-parts":[[2024,10,9]]}}}