Data Quality in the Semantic Web
The Semantic Web is an initiative of the World Wide Web Consortium (W3C) with the vision to evolve the traditional Web, which is essentially a graph of interlinked documents, into a “Web of Data” (Berners-Lee et al., 2001; cf. W3C, 2013). One of the major
- PDF / 372,979 Bytes
- 9 Pages / 419.528 x 595.276 pts Page_size
- 80 Downloads / 206 Views
1 Data Sources of the Semantic Web As already explained, data on the Semantic Web is mostly published according to the RDF data model (cf. Heath & Bizer, 2011; Manola & Miller, 2004, see also section 4.2.2), which represents graphs of information in the form of simple statements known as triples with the basic structure of subject, predicate, object (cf. Manola & Miller, 2004). The Semantic Web already provides billions of such triples with data about several different domains such as geography, media, health care, life sciences, linguistics, and e-commerce (cf. Bizer, Heath, et al., 2009, p. 5f.; Heath & Bizer, 2011; 69
C. Fürber, Data Quality Management with Semantic Technologies, DOI 10.1007/978-3-658-12225-6_5, © Springer Fachmedien Wiesbaden 2016
Mühleisen & Bizer, 2012). Figure 24 shows the well-known linking open data (LOD) cloud diagram22 which represents a large part of available data on the Semantic Web (Cyganiak & Jentzsch, 2011a).
Figure 24: Linking Open Data (LOD) cloud diagram22 (Cyganiak & Jentzsch, 2011a)
The amount of triples of the LOD cloud was estimated to be around 31 billion triples in September 2011 (Cyganiak & Jentzsch, 2011b). But the LOD cloud only represents part of the Semantic Web, since the latest available version of the diagram was created on September 19th 2011, and data sources have to meet certain criteria to be included in the diagram. For instance, a data source must contain at least 1000 triples and have at least 50 RDF links to other data sets in the diagram (cf. Cyganiak & Jentzsch, 2011a). Hence, a large amount of data that is not linked to data sets in the LOD cloud is not part of the diagram and its statistics. For example, a lot of product data published via the GoodRelations ontology23, a popular vocabulary for publishing E-Commerce data (Hepp, 2008a), lack explicit links to the LOD cloud and is, therefore, not visible in the diagram despite its significance for the practical application of the Semantic Web. In addition to the intended usage of data published in the LOD-cloud, like intelligent information processing (cf. Bizer, Lehmann, et al., 2009) or entity recognition in natural language processing (cf. Kobilarov, Scott, et al., 2009, p. 732; Reuters, 2013), the data
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ (Last accessed on April 2nd 2012) 23 http://purl.org/goodrelations (Last accessed on April 12th 2012) 22
70
can also be a relevant source for data quality management. Several data quality management heuristics use reference data sets to identify data quality problems (cf. Apel et al., 2010, p. 74; English, 1999, p. 166; Loshin, 2001, p. 161). In (Fürber & Hepp, 2010a), we have shown that Semantic Web data can particularly be useful for the identification of illegal values or functional dependencies between attribute values in the geographic domain with minimal effort. To proof its practical usefulness for DQM, we performed a data quality analysis of real address data from BestBuy stores, a popular North A
Data Loading...