A Worst-Case Optimal Join Algorithm for SPARQL

Hogan, Aidan; Riveros, Cristian; Rojas, Carlos; Soto, Adrián

doi:10.1007/978-3-030-30793-6_15

Aidan Hogan^18,19,
Cristian Riveros^17,19,
Carlos Rojas^17,19 &
…
Adrián Soto^17,19

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11778))

Included in the following conference series:

International Semantic Web Conference

2881 Accesses
27 Citations

Abstract

Worst-case optimal multiway join algorithms have recently gained a lot of attention in the database literature. These algorithms not only offer strong theoretical guarantees of efficiency, but have also been empirically demonstrated to significantly improve query runtimes for relational and graph databases. Despite these promising theoretical and practical results, however, the Semantic Web community has yet to adopt such techniques; to the best of our knowledge, no native RDF database currently supports such join algorithms, where in this paper we demonstrate that this should change. We propose a novel procedure for evaluating SPARQL queries based on an existing worst-case join algorithm called Leapfrog Triejoin. We propose an adaptation of this algorithm for evaluating SPARQL queries, and implement it in Apache Jena. We then present experiments over the Berlin and WatDiv SPARQL benchmarks, and a novel benchmark that we propose based on Wikidata that is designed to provide insights into join performance for a more diverse set of basic graph patterns. Our results show that with this new join algorithm, Apache Jena often runs orders of magnitude faster than the base version and two other SPARQL engines: Virtuoso and Blazegraph.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

Article 23 June 2021

SMat-J: A Sparse Matrix-Based Join for SPARQL Query Processing

Efficient Evaluation of Conjunctive Regular Path Queries Using Multi-way Joins

Notes

1.
We use $\mathbf I \mathbf B $ as a shortcut for $\mathbf I \cup \mathbf B $, etc.
2.
Other features like BIND, VALUES, SERVICE, etc., that generate or extend mappings can be evaluated in the standard way.
3.
Following [18], we use the notation (s|?, p|?, o|?) to denote eight forms of triple patterns where, for example, (?, p, o) refers to the set of triple patterns with variable subject, constant predicate and constant object: $\mathbf V \times \mathbf I \times \mathbf I \mathbf L $.
4.
Such an order would be produced by SPARQL engines in practice if we had a graph with many :father and :mother relations, outnumbering :winner relations.
5.
We currently consider querying over a single RDF graph; if we were to consider a complete index on quads in order to support named graphs, the number of required indexes would jump to 24. In such a case, however, practical steps can be taken to reduce the number of indexes where, for example, some such orders will be rarely accessed by real-world queries and can thus be removed.

References

Apache Jena. https://jena.apache.org/. Accessed 30 Dec 2018
Github project. https://gqgh5wfgzt.github.io/benchmark-leapfrog/
SPARQL 1.1 Query Language. https://www.w3.org/TR/sparql11-query/. Accessed 30 Dec 2018
Aberger, C.R., Lamb, A., Tu, S., Nötzli, A., Olukotun, K., Ré, C.: EmptyHeaded: a relational engine for graph processing. ACM Trans. Database Syst. (TODS) 42(4), 20 (2017)
Article MathSciNet Google Scholar
Abo Khamis, M., Ngo, H.Q., Rudra, A.: FAQ: questions asked frequently. In: Principles of Database Systems (PODS), pp. 13–28. ACM (2016)
Google Scholar
Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF data management systems. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 197–212. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_13
Chapter Google Scholar
Angles, R., Gutierrez, C.: The multiset semantics of SPARQL patterns. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_2
Chapter Google Scholar
Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: SPARQL web-querying infrastructure: ready for action? ISWC 2013. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_18
Chapter Google Scholar
Aref, M.: Design and implementation of the LogicBlox system. In: SIGMOD International Conference on Management of Data, pp. 1371–1382. ACM (2015)
Google Scholar
Atre, M., Chaoji, V., Zaki, M.J., Hendler, J.A.: Matrix “Bit” loaded: a scalable lightweight join query processor for RDF data. In: World Wide Web (WWW), pp. 41–50 (2010)
Google Scholar
Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: Foundations of Computer Science (FOCS), pp. 739–748. IEEE (2008)
Google Scholar
Barbay, J., Kenyon, C.: Adaptive intersection and t-threshold problems. In: Symposium on Discrete Algorithms (SODA), pp. 390–399. Society for Industrial and Applied Mathematics (2002)
Google Scholar
Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. Int. J. Semant. Web Inf. Syst. (IJSWIS) 5(2), 1–24 (2009)
Article Google Scholar
Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. PVLDB 11(2), 149–161 (2017)
Google Scholar
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: Symposium on Discrete Algorithms (SODA). Citeseer (2000)
Google Scholar
Erling, O., Mikhailov, I.: RDF support in the virtuoso DBMS. In: Pellegrini, T., Auer, S., Tochtermann, K., Schaffert, S. (eds.) Networked Knowledge - Networked Media. Studies in Computational Intelligence, vol. 221, pp. 7–24. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02184-8_2
Chapter Google Scholar
Galkin, M., Endris, K.M., Acosta, M., Collarana, D., Vidal, M., Auer, S.: SMJoin: a multi-way join operator for SPARQL queries. In: International Conference on Semantic Systems (SEMANTICS), pp. 104–111 (2017)
Google Scholar
Harth, A., Decker, S.: Optimized index structures for querying RDF from the Web. In: Latin American Web Congress (LA-Web 2005), pp. 71–80 (2005)
Google Scholar
Kalinsky, O., Etsion, Y., Kimelfeld B.: Flexible caching in Trie joins. In: International Conference on Extending Database Technology (EDBT), pp. 282–293. Springer (2017). https://doi.org/10.5441/002/edbt.2017.26
Kalinsky, O., Mishali, O., Hogan, A., Etsion, Y., Kimelfeld, B.: Efficiently charting RDF. CoRR, abs/1811.10955 (2018)
Google Scholar
Khamis, M.A., Ngo, H.Q., Ré, C., Rudra, A.: Joins via geometric resolutions: worst case and beyond. ACM Trans. Database Syst. (TODS) 41(4), 22 (2016)
Article MathSciNet Google Scholar
Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the most out of Wikidata: semantic technology usage in Wikipedia’s knowledge graph. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 376–394. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_23
Chapter Google Scholar
Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. PVLDB 1(1), 647–659 (2008)
Google Scholar
Ngo, H.Q.: Worst-case optimal join algorithms: techniques, results, and open problems. In: Principles of Database Systems (PODS), pp. 111–124. ACM (2018)
Google Scholar
Ngo, H.Q., Nguyen, D.T., Re, C., Rudra, A.: Beyond worst-case analysis for joins with minesweeper. In: Principles of Database Systems (PODS), pp. 234–245. ACM (2014)
Google Scholar
Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms. In: Principles of Database Systems (PODS), pp. 37–48. ACM (2012)
Google Scholar
Ngo, H.Q., Ré, C., Rudra, A.: Skew strikes back: new developments in the theory of join algorithms. arXiv preprint arXiv:1310.3314 (2013)
Nguyen, D., et al.: Join processing for graph patterns: an old dog with new tricks. In: GRADES, p. 2. ACM (2015)
Google Scholar
Ramakrishnan, R., Gehrke, J.: Database Management Systems. McGraw Hill, New York (2000)
MATH Google Scholar
Thompson, B.B., Personick, M., Cutcher, M.: The Bigdata®RDF graph database. In: Linked Data Management, pp. 193–237 (2014)
Google Scholar
Veldhuizen, T.L.: Leapfrog Triejoin: a simple, worst-case optimal join algorithm. In: ICDT, pp. 96–106 (2014)
Google Scholar
Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 78–85 (2014)
Article Google Scholar
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)
Google Scholar

Download references

Acknowledgements

This work was supported by the Millennium Institute for Foundational Research on Data (IMFD) and by Fondecyt Grant No. 1181896.

Author information

Authors and Affiliations

Pontificia Universidad Católica de Chile, Santiago, Chile
Cristian Riveros, Carlos Rojas & Adrián Soto
DCC, Universidad de Chile, Santiago, Chile
Aidan Hogan
Millennium Institute for Foundational Research on Data, Santiago, Chile
Aidan Hogan, Cristian Riveros, Carlos Rojas & Adrián Soto

Authors

Aidan Hogan
View author publications
You can also search for this author in PubMed Google Scholar
Cristian Riveros
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Rojas
View author publications
You can also search for this author in PubMed Google Scholar
Adrián Soto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aidan Hogan .

Editor information

Editors and Affiliations

Fondazione Bruno Kessler, Trento, Italy
Chiara Ghidini
Linköping University, Linköping, Sweden
Olaf Hartig
University of Bonn, Bonn, Germany
Maria Maleshkova
University of Economics Prague, Prague, Czech Republic
Vojtěch Svátek
University of Illinois at Chicago, Chicago, IL, USA
Isabel Cruz
University of Chile, Santiago, Chile
Aidan Hogan
Memect Technology, Beijing, China
Jie Song
Mines Saint-Etienne, Saint-Etienne, France
Maxime Lefrançois
Inria Sophia Antipolis - Méditerranée, Sophia Antipolis, France
Fabien Gandon

A Proof of Theorem 1

Fix a basic graph pattern P, an RDF graph G, and $\mathtt {?x}_1, \ldots , \mathtt {?x}_n$ the chosen variable order. Further assume that the computation of ${\text {LF}}_G(P',\mathtt {?x})$ takes time at most $\min _{t \in P' : \mathtt {?x}\in {\text {var}}(t)} |\pi _{\mathtt {?x}}(\llbracket t \rrbracket _G)| \, \cdot \, \log (|G|)$ for every basic graph pattern $P'$ (for simplicity, we will omit the trivial empty case where there exists $t \in P'$ such that $\mathtt {?x}\in {\text {var}}(t)$ and $|\pi _{\mathtt {?x}}(\llbracket t \rrbracket _G)| = 0$ since the time taken will be simply $\log (|G|)$). Finally, for every RDF graph $G'$ we will say that $G'$ is of size less than G whenever $|\llbracket t \rrbracket _{G_i}| \le |\llbracket t \rrbracket _{G}|$ for all $t \in P$ (recall that P is fixed).

The proof of Theorem 1 goes in two steps. First, we will bound the time of Leapfrog Join by bounding the time $T_i$ of each for-loop $\mathtt {?x}_i$ of Algorithm 1. Then for each level $\mathtt {?x}_i$ we define a new RDF graph $G_{i}$ of size less than G such that $T_i = |\llbracket P \rrbracket _{G_i}| \le 2^{\text {MIN}(P, G)}$. The proof will follow by taking the sum over all $T_i$.

Fix a variable $\mathtt {?x}_i$ and denote by $\bar{x}_{i-1} = \mathtt {?x}_1, \ldots , \mathtt {?x}_{i-1}$ the order of variables before $\mathtt {?x}_i$ (for the sake of simplification, in the sequel we consider $\bar{x}_i$ also as a set). We start by bounding the time of the for-loop in Algorithm 1 corresponding to $\mathtt {?x}_i$. For this, consider the following extension of ${\text {LF}}_G$ over $\bar{x}_{i-1}$:

$$ {\text {LF}}_G(P,\bar{x}_{i-1}) \ = \ \{\mu \mid {\text {dom}}(\mu ) = \bar{x}_{i-1} \text { and } \llbracket \mu (t) \rrbracket _G \ne \varnothing \text { for all } t \in P\} $$

Clearly, the number of times that the for-loop of $\mathtt {?x}_i$ will be called is given by $|{\text {LF}}_G(P,\bar{x}_{i-1})|$. Then for each $\mu \in {\text {LF}}_G(P,\bar{x}_{i-1})$ the Leapfrog procedure ${\text {LF}}_G(\mu (P),\mathtt {?x}_i)$ is called taking time at most $\min _{t \in \mu (P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)|$ (omitting the $\log (|G|)$ factor for the moment). If we call $T_i$ the number of steps that Algorithm 1 spends in the for-loop of $\mathtt {?x}_i$, we have that:

$$ T_i \ = \ \sum _{\mu \in {\text {LF}}_G(P,\bar{x}_{i-1})} \min _{t \in \mu (P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)| $$

One can easily check that the total time of Algorithm 1 is given by $(\sum _{i=1}^n T_i)\cdot \log (G)$. Therefore, if we bound $T_i$ by the AGM bound of P and G, then the worst case optimality of Leapfrog Join will be proven (recall that our analysis is in data complexity, omitting factors that depend on the size of P).

To bound the size of $T_i$, we build an RDF graph $G_i$ such that $T_i = |\llbracket P \rrbracket _{G_i}|$ and the size of $G_i$ is less that the size of G. Let $\bot $ be a dummy value. To build $G_i$ define the set of mappings $U_i$ such that $\mu \in U_i$ if and only if there exists $\mu ' \in {\text {LF}}_G(P,\bar{x}_{i-1})$ such that:

1.
ep2mm
1.
$\mu (\mathtt {?x}) = \mu '(\mathtt {?x})$ for every $\mathtt {?x}\in \bar{x}_{i-1}$,
2.
$1 \le \mu (\mathtt {?x}_i) \le \min _{t \in \mu '(P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)|$, and
3.
$\mu (\mathtt {?x}) = \bot $ for every $\mathtt {?x}\in \{\mathtt {?x}_{i+1}, \ldots , \mathtt {?x}_{n}\}$.

In other words, $U_i$ contains all mappings built from mappings of ${\text {LF}}_G(P,\bar{x}_{i-1})$ and extended by assigning to $\mathtt {?x}_i$ any value less than the time for computing ${\text {LF}}_G(\mu '(P),\mathtt {?x}_i)$. From $U_i$ we can build the RDF graph $G_i$ as follows:

$$ G_i \ = \ \bigcup _{\mu \in U_i} \mu (P). $$

By construction, note that the size of $G_i$ is less than the size of G. Furthermore, we have that $T_i = |\llbracket P \rrbracket _{G_i}|$. Indeed, for each $\mu ' \in {\text {LF}}_G(P,\bar{x}_{i-1})$ we will have $\big (\min _{t \in \mu '(P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)|\big )$ different mappings in $\llbracket P \rrbracket _{G_i}$ and vice versa.

To finish the proof, recall the linear program associated to P and G, and its minimum value $\text {MIN}(P,G)$. Consider also the same linear program but now for P and $G_i$. Given that $G_i$ is of size less than G, then the minimization function associated to the linear program of P and $G_i$ always satisfies:

$$ \sum _{t\in P} x_t \cdot \log (|\llbracket t \rrbracket _{G_i}|) \le \sum _{t\in P} x_t \cdot \log (|\llbracket t \rrbracket _{G}|). $$

Therefore, we can conclude that $\text {MIN}(P,G_i) \le \text {MIN}(P,G)$ and thus:

$$ T_i \ = \ |\llbracket P \rrbracket _{G_i}| \ \le \ 2^{\text {MIN}(P,G_i)} \ \le \ 2^{\text {MIN}(P,G)} $$

where the second inequality follows by the AGM bound. Given that each $T_i$ is bounded by $2^{\text {MIN}(P,G)}$ we conclude that the overall time is bounded by $n\cdot 2^{\text {MIN}(P,G)} \cdot \log (G)$ and that Leapfrog Join is worst-case optimal.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hogan, A., Riveros, C., Rojas, C., Soto, A. (2019). A Worst-Case Optimal Join Algorithm for SPARQL. In: Ghidini, C., et al. The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science(), vol 11778. Springer, Cham. https://doi.org/10.1007/978-3-030-30793-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-30793-6_15
Published: 17 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30792-9
Online ISBN: 978-3-030-30793-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)

A Worst-Case Optimal Join Algorithm for SPARQL

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

SMat-J: A Sparse Matrix-Based Join for SPARQL Query Processing

Efficient Evaluation of Conjunctive Regular Path Queries Using Multi-way Joins

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Proof of Theorem 1

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

A Worst-Case Optimal Join Algorithm for SPARQL

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

SMat-J: A Sparse Matrix-Based Join for SPARQL Query Processing

Efficient Evaluation of Conjunctive Regular Path Queries Using Multi-way Joins

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Proof of Theorem 1

A Proof of Theorem 1

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation