Ontology-Based Integration of Cross-Linked Datasets
In this paper we tackle the problem of answering SPARQL queries over virtually integrated databases. We assume that the entity resolution problem has already been solved and explicit information is available about which records in the different databases
- PDF / 685,892 Bytes
- 18 Pages / 439.37 x 666.142 pts Page_size
- 80 Downloads / 170 Views
Free University of Bozen-Bolzano, Bolzano, Italy [email protected] 2 University of Oslo, Oslo, Norway
Abstract. In this paper we tackle the problem of answering SPARQL queries over virtually integrated databases. We assume that the entity resolution problem has already been solved and explicit information is available about which records in the different databases refer to the same real world entity. Surprisingly, to the best of our knowledge, there has been no attempt to extend the standard Ontology-Based Data Access (OBDA) setting to take into account these DB links for SPARQL queryanswering and consistency checking. This is partly because the OWL built-in owl:sameAs property, the most natural representation of links between data sets, is not included in OWL 2 QL, the de facto ontology language for OBDA. We formally treat several fundamental questions in this context: how links over database identifiers can be represented in terms of owl:sameAs statements, how to recover rewritability of SPARQL into SQL (lost because of owl:sameAs statements), and how to check consistency. Moreover, we investigate how our solution can be made to scale up to large enterprise datasets. We have implemented the approach, and carried out an extensive set of experiments showing its scalability.
1
Introduction
Since the mid 2000s, Ontology-Based Data Access (OBDA) [9,14,15] has become a popular approach for virtual data integration [6]. In (virtual) OBDA, a conceptual layer is given in the form of (the intensional part of) an ontology (usually in OWL 2 QL) that defines a shared vocabulary, models the domain, hides the structure of the data sources, and can enrich incomplete data with background knowledge. The ontology is connected to the data sources through a declarative specification given in terms of mappings [4] that relate symbols in the ontology (classes and properties) to (SQL) views over data. The ontology and mappings together expose a virtual RDF graph, which can be queried using SPARQL queries, that are then translated into SQL queries over the data sources. In this setting, users no longer need an understanding of the data sources, the relation between them, or the encoding of the data. One aspect of OBDA for data integration is less well studied however, namely the fact that in many cases, complementary information about the same entity is distributed over several data sources, and this entity is represented using © Springer International Publishing Switzerland 2015 M. Arenas et al. (Eds.): ISWC 2015, Part I, LNCS 9366, pp. 199–216, 2015. DOI: 10.1007/978-3-319-25007-6 12
200
D. Calvanese et al.
different identifiers. The first important issue that comes up is that of entity resolution, which requires to understand which records actually represent the same real world entity. We do not deal with this problem here, and assume that this information is already available. Traditional relational data integration techniques use extract, transform, load (ETL) processes to address this problem [6]. These techniques usually choose
Data Loading...