Assessing and Refining Mappingsto RDF to Improve Dataset Quality
rdf dataset quality assessment is currently performed primarily after data is published. However, there is neither a systematic way to incorporate its results into the dataset nor the assessment into the publishing workflow. Adjustments are manually –but
- PDF / 321,519 Bytes
- 17 Pages / 439.37 x 666.142 pts Page_size
- 40 Downloads / 186 Views
Abstract. rdf dataset quality assessment is currently performed primarily after data is published. However, there is neither a systematic way to incorporate its results into the dataset nor the assessment into the publishing workflow. Adjustments are manually –but rarely– applied. Nevertheless, the root of the violations which often derive from the mappings that specify how the rdf dataset will be generated, is not identified. We suggest an incremental, iterative and uniform validation workflow for rdf datasets stemming originally from (semi-)structured data (e.g., csv, xml, json). In this work, we focus on assessing and improving their mappings. We incorporate (i) a test-driven approach for assessing the mappings instead of the rdf dataset itself, as mappings reflect how the dataset will be formed when generated; and (ii) perform semi-automatic mapping refinementsbased on the results of the quality assessment. The proposed workflow is applied to diverse cases, e.g., large, crowdsourced datasets such as dbpedia, or newly generated, such as iLastic. Our evaluation indicates the efficiency of our workflow, as it significantly improves the overall quality of an rdf dataset in the observed cases. Keywords: Linked data mapping rdfunit
1
·
Data quality
·
rml
·
r2rml
·
Introduction
The Linked Open Data (lod) cloud1 consisted of 12 datasets in 2007, grew to almost 300 in 20112 , and, by the end of 2014, counted up to 1,0003 . Although more and more data is published as Linked Data (ld), the datasets’ quality and consistency varies significantly, ranging from expensively curated to relatively low quality datasets [29]. In previous work [21], we observed that similar violations can occur very frequently. Especially when datasets originally stem from 1
http://lod-cloud.net/ http://lod-cloud.net/state 3 http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/ 2
c Springer International Publishing Switzerland 2015 M. Arenas et al. (Eds.): ISWC 2015, Part II, LNCS 9367, pp. 133–149, 2015. DOI: 10.1007/978-3-319-25010-6 8
134
A. Dimou et al.
semi-structured formats (csv, xml, etc.) and their rdf representation is obtained by repetitively applying certain mappings, the violations are often repeated, as well. semantically annotating data to acquire their enriched representation using the rdf data model. A mapping consists of one or more mapping definitions (mds) that state how rdf terms should be generated, taking into account a data fragment from an original data source, and how these terms are associated to each other and form rdf triples. The most frequent violations are related to the dataset’s schema, namely the vocabularies or ontologies used to annotate the original data [21]. In the case of (semi-)structured data, the dataset’s schema derives from the set of classes and properties specified within the mappings. A mapping might use a single ontology or vocabulary to annotate the data, or a proprietary vocabulary can be generated as the data is annotated. Lately, combinations of different ontologies and vocabulari
Data Loading...