Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs

  • PDF / 1,148,689 Bytes
  • 24 Pages / 595.276 x 790.866 pts Page_size
  • 37 Downloads / 186 Views

DOWNLOAD

REPORT


Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs Leslie F. Sikos1 · Dean Philp2 Received: 1 April 2019 / Revised: 13 February 2020 / Accepted: 3 March 2020 © The Author(s) 2020

Abstract Expressing machine-interpretable statements in the form of subject-predicate-object triples is a well-established practice for capturing semantics of structured data. However, the standard used for representing these triples, RDF, inherently lacks the mechanism to attach provenance data, which would be crucial to make automatically generated and/or processed data authoritative. This paper is a critical review of data models, annotation frameworks, knowledge organization systems, serialization syntaxes, and algebras that enable provenance-aware RDF statements. The various approaches are assessed in terms of standard compliance, formal semantics, tuple type, vocabulary term usage, blank nodes, provenance granularity, and scalability. This can be used to advance existing solutions and help implementers to select the most suitable approach (or a combination of approaches) for their applications. Moreover, the analysis of the mechanisms and their limitations highlighted in this paper can serve as the basis for novel approaches in RDF-powered applications with increasing provenance needs. Keywords RDF provenance · Contextual knowledge graph · RDF reification alternatives · RDF data model

1 Introduction to RDF Provenance The Resource Description Framework (RDF)1 is a Semantic Web standard for formal knowledge representation, which can be used to efficiently manipulate and interchange machine-interpretable, structured data. Its data model is particularly powerful due to its syntax and semantics; RDF allows statements to be made in the form of subject-predicateobject triples, resulting in fixed-length dataset fields that are much easier to process than variable-length fields. Formally speaking, assume pairwise disjoint infinite sets of 1. Internationalized Resource Identifiers (IRIs, I), i.e., sets of strings of Unicode characters of the form scheme: [//[user:password@]host[:port]] [/] 1

https://www.w3.org/RDF/

B

Leslie F. Sikos [email protected] Dean Philp [email protected]

1

Edith Cowan University, 270 Joondalup Drive, Joondalup, WA 6027, Australia

2

Defence Science and Technology Group, Third Ave, Edinburgh, SA 5111, Australia

path[?query] [#fragment] used to identify a resource,2 2. RDF literals (L), which can be a) self-denoting plain literals L P in the form ""(@)?, where is a string and is an optional language tag, or b) typed literals LT of the form ""^^, where is an IRI denoting a datatype according to a schema (e.g., XML Schema), and is an element of the lexical space corresponding to the datatype, and 3. blank nodes (B), i.e., unique anonymous resources that do not belong to either of the above sets. A triple of the form (s, p, o) ∈ (I ∪ B) × I × (I ∪ L ∪ B) is called an RDF triple, also known as an RDF statement, where s is the subject, p is the predicat