Ontology-Based Integration of Cross-Linked Datasets

In this paper we tackle the problem of answering SPARQL queries over virtually integrated databases. We assume that the entity resolution problem has already been solved and explicit information is available about which records in the different databases

PDF / 685,892 Bytes
18 Pages / 439.37 x 666.142 pts Page_size
80 Downloads / 191 Views

DOWNLOAD

REPORT

Free University of Bozen-Bolzano, Bolzano, Italy [email protected] 2 University of Oslo, Oslo, Norway

Abstract. In this paper we tackle the problem of answering SPARQL queries over virtually integrated databases. We assume that the entity resolution problem has already been solved and explicit information is available about which records in the diﬀerent databases refer to the same real world entity. Surprisingly, to the best of our knowledge, there has been no attempt to extend the standard Ontology-Based Data Access (OBDA) setting to take into account these DB links for SPARQL queryanswering and consistency checking. This is partly because the OWL built-in owl:sameAs property, the most natural representation of links between data sets, is not included in OWL 2 QL, the de facto ontology language for OBDA. We formally treat several fundamental questions in this context: how links over database identiﬁers can be represented in terms of owl:sameAs statements, how to recover rewritability of SPARQL into SQL (lost because of owl:sameAs statements), and how to check consistency. Moreover, we investigate how our solution can be made to scale up to large enterprise datasets. We have implemented the approach, and carried out an extensive set of experiments showing its scalability.

1

Introduction

Since the mid 2000s, Ontology-Based Data Access (OBDA) [9,14,15] has become a popular approach for virtual data integration [6]. In (virtual) OBDA, a conceptual layer is given in the form of (the intensional part of) an ontology (usually in OWL 2 QL) that deﬁnes a shared vocabulary, models the domain, hides the structure of the data sources, and can enrich incomplete data with background knowledge. The ontology is connected to the data sources through a declarative speciﬁcation given in terms of mappings [4] that relate symbols in the ontology (classes and properties) to (SQL) views over data. The ontology and mappings together expose a virtual RDF graph, which can be queried using SPARQL queries, that are then translated into SQL queries over the data sources. In this setting, users no longer need an understanding of the data sources, the relation between them, or the encoding of the data. One aspect of OBDA for data integration is less well studied however, namely the fact that in many cases, complementary information about the same entity is distributed over several data sources, and this entity is represented using © Springer International Publishing Switzerland 2015 M. Arenas et al. (Eds.): ISWC 2015, Part I, LNCS 9366, pp. 199–216, 2015. DOI: 10.1007/978-3-319-25007-6 12

200

D. Calvanese et al.

diﬀerent identiﬁers. The ﬁrst important issue that comes up is that of entity resolution, which requires to understand which records actually represent the same real world entity. We do not deal with this problem here, and assume that this information is already available. Traditional relational data integration techniques use extract, transform, load (ETL) processes to address this problem [6]. These techniques usually choose

Data Loading...

Ontology-Based Integration of Cross-Linked Datasets

Recommend Documents

Crosslinked Polymer Hydrogels

Functional Crosslinked Hydrogels

Effects of Counterface Roughness and Conformity on the Tribological Performance of Crosslinked and Non-crosslinked Medic

Good Datasets

Datasets and Dataflows

Mining Spatio-Temporal Datasets

Triblock Copolymer Micelle-Crosslinked Hydrogels

Structural Evolution of Highly Crosslinked Polymer Networks

Structural Evolution of Highly Crosslinked Polymer Networks

Integration of high-resolution optical and SAR satellite remote sensing datasets for aboveground biomass estimation in s

Unsupervised Learning on Document Datasets

Datasets and Data Preparation