Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce

In the Semantic Web community many approaches have been evolved for generating RDF (Resource Description Framework) resources. However, they often capture duplicate resources, that are stored without elimination. In consequence, duplicate resources reduce

  • PDF / 187,039 Bytes
  • 9 Pages / 439.37 x 666.142 pts Page_size
  • 76 Downloads / 215 Views

DOWNLOAD

REPORT


Abstract In the Semantic Web community many approaches have been evolved for generating RDF (Resource Description Framework) resources. However, they often capture duplicate resources, that are stored without elimination. In consequence, duplicate resources reduce the data quality as well as increase unnecessary size of the dataset. We propose an approach for detecting duplicate resources in RDF datasets using Hadoop and MapReduce framework. RDF resources are compared using similarity metrics defined at resource level, RDF statement level as well as object level. The performance is evaluated with the evaluation metrics and the experimental evaluation showed the accuracy, effectiveness, and efficiency of the proposed approach. Keywords Duplicate data

 Semantic Web  RDF  Hadoop  MapReduce

1 Introduction Duplication is the most common factor for affecting the data quality. It occupies more space than needed and ingests more time during accessing. It causes irrelevant observations, as users often depend on valuable information for drawing some conclusions and disclosing new insights. Duplicate data reduces the data quality and brings interoperability-related issues. Prior to retaining the data quality and K. Sharma (&)  U. Biswas Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India e-mail: [email protected] U. Biswas e-mail: [email protected] U. Marjit Centre for Information Resource Management (CIRM), University of Kalyani, Kalyani, West Bengal, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 A. Kalam et al. (eds.), Advances in Electronics, Communication and Computing, Lecture Notes in Electrical Engineering 443, https://doi.org/10.1007/978-981-10-4765-7_26

253

254

K. Sharma et al.

reducing the storage area, it is necessary to detect duplicate information and to decide whether a given dataset maintains its data quality. Duplicate data detection [1] in large datasets is a very difficult job. As it requires distinct pairwise comparison of the resources in order to compare and determine the amount of duplicity. For large datasets, it takes more computations and time complexity increases as data size grows on. In order to determine duplicate information for large datasets in a very short period of time, we need a system that can perform multiple jobs in parallel. Only distributed systems have this facility and one such system is Hadoop and MapReduce framework [2–4]. Given an RDF dataset, the proposed approach splits the dataset into multiple files based on resource type. Each split files are supplied to the resource comparison job for comparing individual resources in parallel. This paper is organized as follows: Sect. 2 presents the related work and Sect. 3 discusses the background information. Section 4 presents the proposed approach and Sect. 5 shows the experimental results and finally Sect. 6 concludes the paper.

2 Related Work Duplicate detection is the process of determining resources having similar identities. This problem is a