Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce

In the Semantic Web community many approaches have been evolved for generating RDF (Resource Description Framework) resources. However, they often capture duplicate resources, that are stored without elimination. In consequence, duplicate resources reduce

PDF / 187,039 Bytes
9 Pages / 439.37 x 666.142 pts Page_size
76 Downloads / 237 Views

DOWNLOAD

REPORT

Abstract In the Semantic Web community many approaches have been evolved for generating RDF (Resource Description Framework) resources. However, they often capture duplicate resources, that are stored without elimination. In consequence, duplicate resources reduce the data quality as well as increase unnecessary size of the dataset. We propose an approach for detecting duplicate resources in RDF datasets using Hadoop and MapReduce framework. RDF resources are compared using similarity metrics deﬁned at resource level, RDF statement level as well as object level. The performance is evaluated with the evaluation metrics and the experimental evaluation showed the accuracy, effectiveness, and efﬁciency of the proposed approach. Keywords Duplicate data

Semantic Web RDF Hadoop MapReduce

1 Introduction Duplication is the most common factor for affecting the data quality. It occupies more space than needed and ingests more time during accessing. It causes irrelevant observations, as users often depend on valuable information for drawing some conclusions and disclosing new insights. Duplicate data reduces the data quality and brings interoperability-related issues. Prior to retaining the data quality and K. Sharma (&) U. Biswas Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India e-mail: [email protected] U. Biswas e-mail: [email protected] U. Marjit Centre for Information Resource Management (CIRM), University of Kalyani, Kalyani, West Bengal, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 A. Kalam et al. (eds.), Advances in Electronics, Communication and Computing, Lecture Notes in Electrical Engineering 443, https://doi.org/10.1007/978-981-10-4765-7_26

253

254

K. Sharma et al.

reducing the storage area, it is necessary to detect duplicate information and to decide whether a given dataset maintains its data quality. Duplicate data detection [1] in large datasets is a very difﬁcult job. As it requires distinct pairwise comparison of the resources in order to compare and determine the amount of duplicity. For large datasets, it takes more computations and time complexity increases as data size grows on. In order to determine duplicate information for large datasets in a very short period of time, we need a system that can perform multiple jobs in parallel. Only distributed systems have this facility and one such system is Hadoop and MapReduce framework [2–4]. Given an RDF dataset, the proposed approach splits the dataset into multiple ﬁles based on resource type. Each split ﬁles are supplied to the resource comparison job for comparing individual resources in parallel. This paper is organized as follows: Sect. 2 presents the related work and Sect. 3 discusses the background information. Section 4 presents the proposed approach and Sect. 5 shows the experimental results and ﬁnally Sect. 6 concludes the paper.

2 Related Work Duplicate detection is the process of determining resources having similar identities. This problem is a

Data Loading...

Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce

Recommend Documents

Discovering Types in RDF Datasets

Duplicate Detection

Duplicate Detection

Video Near-duplicate Detection

Improving the Map and Shuffle Phases in Hadoop MapReduce

Resource Description Framework (RDF)

Theme-Based Summarization for RDF Datasets

Resource Description Framework (RDF) Schema (RDFS)

MapReduce Hadoop Models for Distributed Neural Network Processing of Big Data Using Cloud Services

Pornographic video detection with MapReduce

Using independent resource allocation strategies to solve conflicts of Hadoop distributed architecture in virtualization

Semantic Segmentation Datasets for Resource Constrained Training