Assessing and Refining Mappingsto RDF to Improve Dataset Quality

rdf dataset quality assessment is currently performed primarily after data is published. However, there is neither a systematic way to incorporate its results into the dataset nor the assessment into the publishing workflow. Adjustments are manually –but

PDF / 321,519 Bytes
17 Pages / 439.37 x 666.142 pts Page_size
40 Downloads / 215 Views

DOWNLOAD

REPORT

Abstract. rdf dataset quality assessment is currently performed primarily after data is published. However, there is neither a systematic way to incorporate its results into the dataset nor the assessment into the publishing workﬂow. Adjustments are manually –but rarely– applied. Nevertheless, the root of the violations which often derive from the mappings that specify how the rdf dataset will be generated, is not identiﬁed. We suggest an incremental, iterative and uniform validation workflow for rdf datasets stemming originally from (semi-)structured data (e.g., csv, xml, json). In this work, we focus on assessing and improving their mappings. We incorporate (i) a test-driven approach for assessing the mappings instead of the rdf dataset itself, as mappings reﬂect how the dataset will be formed when generated; and (ii) perform semi-automatic mapping refinementsbased on the results of the quality assessment. The proposed workﬂow is applied to diverse cases, e.g., large, crowdsourced datasets such as dbpedia, or newly generated, such as iLastic. Our evaluation indicates the eﬃciency of our workﬂow, as it signiﬁcantly improves the overall quality of an rdf dataset in the observed cases. Keywords: Linked data mapping rdfunit

1

·

Data quality

·

rml

·

r2rml

·

Introduction

The Linked Open Data (lod) cloud1 consisted of 12 datasets in 2007, grew to almost 300 in 20112 , and, by the end of 2014, counted up to 1,0003 . Although more and more data is published as Linked Data (ld), the datasets’ quality and consistency varies signiﬁcantly, ranging from expensively curated to relatively low quality datasets [29]. In previous work [21], we observed that similar violations can occur very frequently. Especially when datasets originally stem from 1

http://lod-cloud.net/ http://lod-cloud.net/state 3 http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/ 2

c Springer International Publishing Switzerland 2015 M. Arenas et al. (Eds.): ISWC 2015, Part II, LNCS 9367, pp. 133–149, 2015. DOI: 10.1007/978-3-319-25010-6 8

134

A. Dimou et al.

semi-structured formats (csv, xml, etc.) and their rdf representation is obtained by repetitively applying certain mappings, the violations are often repeated, as well. semantically annotating data to acquire their enriched representation using the rdf data model. A mapping consists of one or more mapping deﬁnitions (mds) that state how rdf terms should be generated, taking into account a data fragment from an original data source, and how these terms are associated to each other and form rdf triples. The most frequent violations are related to the dataset’s schema, namely the vocabularies or ontologies used to annotate the original data [21]. In the case of (semi-)structured data, the dataset’s schema derives from the set of classes and properties speciﬁed within the mappings. A mapping might use a single ontology or vocabulary to annotate the data, or a proprietary vocabulary can be generated as the data is annotated. Lately, combinations of diﬀerent ontologies and vocabulari

Data Loading...

Assessing and Refining Mappingsto RDF to Improve Dataset Quality

Recommend Documents

Assessing the Quality of RDF Mappings with EvaMap

How to Improve Learning Quality?

Measuring Clinical Workflow to Improve Quality and Safety

RDF

Refining Quality Evaluation for Better Learning and Teaching

Technology Contribution to Improve Autistic Children Life Quality

Landscape Indicators Assessing and Monitoring Landscape Quality

Quality Control During Data Collection: Refining for Rigor

RDF Schema and Semantics

Antioxidants improve sperm quality in infertile men

DataOps: Seamless End-to-End Anything-to-RDF Data Integration

Assessing the Feasibility of a Commercially Available Wireless Internet of Things System to Improve Conveyor Safety