Abstract
Worst-case optimal multiway join algorithms have recently gained a lot of attention in the database literature. These algorithms not only offer strong theoretical guarantees of efficiency, but have also been empirically demonstrated to significantly improve query runtimes for relational and graph databases. Despite these promising theoretical and practical results, however, the Semantic Web community has yet to adopt such techniques; to the best of our knowledge, no native RDF database currently supports such join algorithms, where in this paper we demonstrate that this should change. We propose a novel procedure for evaluating SPARQL queries based on an existing worst-case join algorithm called Leapfrog Triejoin. We propose an adaptation of this algorithm for evaluating SPARQL queries, and implement it in Apache Jena. We then present experiments over the Berlin and WatDiv SPARQL benchmarks, and a novel benchmark that we propose based on Wikidata that is designed to provide insights into join performance for a more diverse set of basic graph patterns. Our results show that with this new join algorithm, Apache Jena often runs orders of magnitude faster than the base version and two other SPARQL engines: Virtuoso and Blazegraph.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We use \(\mathbf I \mathbf B \) as a shortcut for \(\mathbf I \cup \mathbf B \), etc.
- 2.
Other features like BIND, VALUES, SERVICE, etc., that generate or extend mappings can be evaluated in the standard way.
- 3.
Following [18], we use the notation (s|?, p|?, o|?) to denote eight forms of triple patterns where, for example, (?, p, o) refers to the set of triple patterns with variable subject, constant predicate and constant object: \(\mathbf V \times \mathbf I \times \mathbf I \mathbf L \).
- 4.
Such an order would be produced by SPARQL engines in practice if we had a graph with many :father and :mother relations, outnumbering :winner relations.
- 5.
We currently consider querying over a single RDF graph; if we were to consider a complete index on quads in order to support named graphs, the number of required indexes would jump to 24. In such a case, however, practical steps can be taken to reduce the number of indexes where, for example, some such orders will be rarely accessed by real-world queries and can thus be removed.
References
Apache Jena. https://jena.apache.org/. Accessed 30 Dec 2018
Github project. https://gqgh5wfgzt.github.io/benchmark-leapfrog/
SPARQL 1.1 Query Language. https://www.w3.org/TR/sparql11-query/. Accessed 30 Dec 2018
Aberger, C.R., Lamb, A., Tu, S., Nötzli, A., Olukotun, K., Ré, C.: EmptyHeaded: a relational engine for graph processing. ACM Trans. Database Syst. (TODS) 42(4), 20 (2017)
Abo Khamis, M., Ngo, H.Q., Rudra, A.: FAQ: questions asked frequently. In: Principles of Database Systems (PODS), pp. 13–28. ACM (2016)
Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF data management systems. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 197–212. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_13
Angles, R., Gutierrez, C.: The multiset semantics of SPARQL patterns. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_2
Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: SPARQL web-querying infrastructure: ready for action? ISWC 2013. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_18
Aref, M.: Design and implementation of the LogicBlox system. In: SIGMOD International Conference on Management of Data, pp. 1371–1382. ACM (2015)
Atre, M., Chaoji, V., Zaki, M.J., Hendler, J.A.: Matrix “Bit” loaded: a scalable lightweight join query processor for RDF data. In: World Wide Web (WWW), pp. 41–50 (2010)
Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: Foundations of Computer Science (FOCS), pp. 739–748. IEEE (2008)
Barbay, J., Kenyon, C.: Adaptive intersection and t-threshold problems. In: Symposium on Discrete Algorithms (SODA), pp. 390–399. Society for Industrial and Applied Mathematics (2002)
Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. Int. J. Semant. Web Inf. Syst. (IJSWIS) 5(2), 1–24 (2009)
Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. PVLDB 11(2), 149–161 (2017)
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: Symposium on Discrete Algorithms (SODA). Citeseer (2000)
Erling, O., Mikhailov, I.: RDF support in the virtuoso DBMS. In: Pellegrini, T., Auer, S., Tochtermann, K., Schaffert, S. (eds.) Networked Knowledge - Networked Media. Studies in Computational Intelligence, vol. 221, pp. 7–24. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02184-8_2
Galkin, M., Endris, K.M., Acosta, M., Collarana, D., Vidal, M., Auer, S.: SMJoin: a multi-way join operator for SPARQL queries. In: International Conference on Semantic Systems (SEMANTICS), pp. 104–111 (2017)
Harth, A., Decker, S.: Optimized index structures for querying RDF from the Web. In: Latin American Web Congress (LA-Web 2005), pp. 71–80 (2005)
Kalinsky, O., Etsion, Y., Kimelfeld B.: Flexible caching in Trie joins. In: International Conference on Extending Database Technology (EDBT), pp. 282–293. Springer (2017). https://doi.org/10.5441/002/edbt.2017.26
Kalinsky, O., Mishali, O., Hogan, A., Etsion, Y., Kimelfeld, B.: Efficiently charting RDF. CoRR, abs/1811.10955 (2018)
Khamis, M.A., Ngo, H.Q., Ré, C., Rudra, A.: Joins via geometric resolutions: worst case and beyond. ACM Trans. Database Syst. (TODS) 41(4), 22 (2016)
Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the most out of Wikidata: semantic technology usage in Wikipedia’s knowledge graph. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 376–394. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_23
Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. PVLDB 1(1), 647–659 (2008)
Ngo, H.Q.: Worst-case optimal join algorithms: techniques, results, and open problems. In: Principles of Database Systems (PODS), pp. 111–124. ACM (2018)
Ngo, H.Q., Nguyen, D.T., Re, C., Rudra, A.: Beyond worst-case analysis for joins with minesweeper. In: Principles of Database Systems (PODS), pp. 234–245. ACM (2014)
Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms. In: Principles of Database Systems (PODS), pp. 37–48. ACM (2012)
Ngo, H.Q., Ré, C., Rudra, A.: Skew strikes back: new developments in the theory of join algorithms. arXiv preprint arXiv:1310.3314 (2013)
Nguyen, D., et al.: Join processing for graph patterns: an old dog with new tricks. In: GRADES, p. 2. ACM (2015)
Ramakrishnan, R., Gehrke, J.: Database Management Systems. McGraw Hill, New York (2000)
Thompson, B.B., Personick, M., Cutcher, M.: The Bigdata®RDF graph database. In: Linked Data Management, pp. 193–237 (2014)
Veldhuizen, T.L.: Leapfrog Triejoin: a simple, worst-case optimal join algorithm. In: ICDT, pp. 96–106 (2014)
Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 78–85 (2014)
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)
Acknowledgements
This work was supported by the Millennium Institute for Foundational Research on Data (IMFD) and by Fondecyt Grant No. 1181896.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Proof of Theorem 1
A Proof of Theorem 1
Fix a basic graph pattern P, an RDF graph G, and \(\mathtt {?x}_1, \ldots , \mathtt {?x}_n\) the chosen variable order. Further assume that the computation of \({\text {LF}}_G(P',\mathtt {?x})\) takes time at most \(\min _{t \in P' : \mathtt {?x}\in {\text {var}}(t)} |\pi _{\mathtt {?x}}(\llbracket t \rrbracket _G)| \, \cdot \, \log (|G|)\) for every basic graph pattern \(P'\) (for simplicity, we will omit the trivial empty case where there exists \(t \in P'\) such that \(\mathtt {?x}\in {\text {var}}(t)\) and \(|\pi _{\mathtt {?x}}(\llbracket t \rrbracket _G)| = 0\) since the time taken will be simply \(\log (|G|)\)). Finally, for every RDF graph \(G'\) we will say that \(G'\) is of size less than G whenever \(|\llbracket t \rrbracket _{G_i}| \le |\llbracket t \rrbracket _{G}|\) for all \(t \in P\) (recall that P is fixed).
The proof of Theorem 1 goes in two steps. First, we will bound the time of Leapfrog Join by bounding the time \(T_i\) of each for-loop \(\mathtt {?x}_i\) of Algorithm 1. Then for each level \(\mathtt {?x}_i\) we define a new RDF graph \(G_{i}\) of size less than G such that \(T_i = |\llbracket P \rrbracket _{G_i}| \le 2^{\text {MIN}(P, G)}\). The proof will follow by taking the sum over all \(T_i\).
Fix a variable \(\mathtt {?x}_i\) and denote by \(\bar{x}_{i-1} = \mathtt {?x}_1, \ldots , \mathtt {?x}_{i-1}\) the order of variables before \(\mathtt {?x}_i\) (for the sake of simplification, in the sequel we consider \(\bar{x}_i\) also as a set). We start by bounding the time of the for-loop in Algorithm 1 corresponding to \(\mathtt {?x}_i\). For this, consider the following extension of \({\text {LF}}_G\) over \(\bar{x}_{i-1}\):
Clearly, the number of times that the for-loop of \(\mathtt {?x}_i\) will be called is given by \(|{\text {LF}}_G(P,\bar{x}_{i-1})|\). Then for each \(\mu \in {\text {LF}}_G(P,\bar{x}_{i-1})\) the Leapfrog procedure \({\text {LF}}_G(\mu (P),\mathtt {?x}_i)\) is called taking time at most \(\min _{t \in \mu (P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)|\) (omitting the \(\log (|G|)\) factor for the moment). If we call \(T_i\) the number of steps that Algorithm 1 spends in the for-loop of \(\mathtt {?x}_i\), we have that:
One can easily check that the total time of Algorithm 1 is given by \((\sum _{i=1}^n T_i)\cdot \log (G)\). Therefore, if we bound \(T_i\) by the AGM bound of P and G, then the worst case optimality of Leapfrog Join will be proven (recall that our analysis is in data complexity, omitting factors that depend on the size of P).
To bound the size of \(T_i\), we build an RDF graph \(G_i\) such that \(T_i = |\llbracket P \rrbracket _{G_i}|\) and the size of \(G_i\) is less that the size of G. Let \(\bot \) be a dummy value. To build \(G_i\) define the set of mappings \(U_i\) such that \(\mu \in U_i\) if and only if there exists \(\mu ' \in {\text {LF}}_G(P,\bar{x}_{i-1})\) such that:
-
1.
ep2mm
-
1.
\(\mu (\mathtt {?x}) = \mu '(\mathtt {?x})\) for every \(\mathtt {?x}\in \bar{x}_{i-1}\),
-
2.
\(1 \le \mu (\mathtt {?x}_i) \le \min _{t \in \mu '(P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)|\), and
-
3.
\(\mu (\mathtt {?x}) = \bot \) for every \(\mathtt {?x}\in \{\mathtt {?x}_{i+1}, \ldots , \mathtt {?x}_{n}\}\).
In other words, \(U_i\) contains all mappings built from mappings of \({\text {LF}}_G(P,\bar{x}_{i-1})\) and extended by assigning to \(\mathtt {?x}_i\) any value less than the time for computing \({\text {LF}}_G(\mu '(P),\mathtt {?x}_i)\). From \(U_i\) we can build the RDF graph \(G_i\) as follows:
By construction, note that the size of \(G_i\) is less than the size of G. Furthermore, we have that \(T_i = |\llbracket P \rrbracket _{G_i}|\). Indeed, for each \(\mu ' \in {\text {LF}}_G(P,\bar{x}_{i-1})\) we will have \(\big (\min _{t \in \mu '(P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)|\big )\) different mappings in \(\llbracket P \rrbracket _{G_i}\) and vice versa.
To finish the proof, recall the linear program associated to P and G, and its minimum value \(\text {MIN}(P,G)\). Consider also the same linear program but now for P and \(G_i\). Given that \(G_i\) is of size less than G, then the minimization function associated to the linear program of P and \(G_i\) always satisfies:
Therefore, we can conclude that \(\text {MIN}(P,G_i) \le \text {MIN}(P,G)\) and thus:
where the second inequality follows by the AGM bound. Given that each \(T_i\) is bounded by \(2^{\text {MIN}(P,G)}\) we conclude that the overall time is bounded by \(n\cdot 2^{\text {MIN}(P,G)} \cdot \log (G)\) and that Leapfrog Join is worst-case optimal.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hogan, A., Riveros, C., Rojas, C., Soto, A. (2019). A Worst-Case Optimal Join Algorithm for SPARQL. In: Ghidini, C., et al. The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science(), vol 11778. Springer, Cham. https://doi.org/10.1007/978-3-030-30793-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-30793-6_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30792-9
Online ISBN: 978-3-030-30793-6
eBook Packages: Computer ScienceComputer Science (R0)