In-memory parallelization of join queries over large ontological hierarchies

  • PDF / 1,815,925 Bytes
  • 38 Pages / 439.37 x 666.142 pts Page_size
  • 0 Downloads / 205 Views

DOWNLOAD

REPORT


In‑memory parallelization of join queries over large ontological hierarchies Dimitris Bilidas1 · Manolis Koubarakis1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract The Resource Description Framework (RDF) data model enables the construction of knowledge graphs over various domains, using ontologies in order to encode information about the domain, and simple statements in the form of subject-predicateobject triples for data representation, facilitating the interlinking and exchange of Web data. However, this simplicity comes with the cost of having to execute a large number of joins in order to get the desirable query results, while at the same time large ontological hierarchies complicate the query answering process even more, for systems that provide complete answers with respect to such ontological axioms. In this work we present PARJ, an in-memory RDF store which takes into consideration ontological hierarchies during join processing with very low performance overhead, avoiding expensive preprocessing and materialization of implications, and is also amenable to straightforward parallelization. Specifically, we present a join implementation that allows to achieve any desired degree of parallelism on arbitrary join queries and RDF graphs stored in memory using compact vertical partitioning. We use an adaptive join processing approach, such that we take advantage of complete or even partial ordering of RDF data, which is compactly stored in order to increase spatial locality and keep memory consumption low, coupled with an IDto-Position vector index used when ordering does not allow for efficient scanning of the input relation. Finally, we experimentally show the efficiency and scalability of our proposal. Keywords  RDF · SPARQL · OWL · Join processing

* Dimitris Bilidas [email protected] Manolis Koubarakis [email protected] 1



National and Kapodistrian University of Athens, Athens, Greece

13

Vol.:(0123456789)



Distributed and Parallel Databases

1 Introduction The Resource Description Framework (RDF) 1 is a data model recommended by the W3C for semantic data integration, sharing and linking across different organizations and applications on the Web. RDF provides flexible modeling of data coming from heterogeneous domains in the form of triples forming subject-predicate-object statements, facilitating the construction of Knowledge Graphs. Every component of such a triple is a resource uniquely identified by an IRI or a data value in the form of a literal. The latter can only be present in the object position. A set of such statements can be considered an RDF graph, where subjects and objects are nodes and there exists an arc labeled with the property name, connecting corresponding subject and object for each statement. Several organizations publish data in the RDF model, leading to interlinking information from different sources and automatic processing using software agents. As a result, as of 2019 the Linked Open Data (LOD) cloud [49] contains more than 1200 datasets a