Data Harmonization for Heterogeneous Datasets in Big Data - A Conceptual Model

Data comes from machines, transactions, and social media, which is gigantic and disparate in nature. About 80% of today’s data is unstructured, while the remaining percentage is semistructured and structured. It is a big challenge for management to make e

PDF / 705,063 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
74 Downloads / 262 Views

DOWNLOAD

REPORT

3

Computer and Information Science Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar 32610, Perak, Malaysia {ganesh_17005106,shuib_basri,abdullahi_g03618, abdullateef_16005851}@utp.edu.my 2 Ahmadu Bello University, Zaria, Nigeria Department of Computer Science, University of Ilorin, Ilorin 1515, Nigeria

Abstract. Data comes from machines, transactions, and social media, which is gigantic and disparate in nature. About 80% of today’s data is unstructured, while the remaining percentage is semistructured and structured. It is a big challenge for management to make efﬁcient decisions on run time and also to store heterogeneous nature of data by existing tools. Data Harmonization can be used to solve the heterogeneity problem; the idea of data harmonization is to provide a uniform representation and remove all forms of heterogeneity from the heterogeneous datasets. In recent studies, various models have been developed for integrating, mapping, and fusion of structured and semistructured datasets, but no such model has been developed for structured, semistructured, and unstructured datasets. Information extraction is used as a vital component to extract data from different textual datasets that information formats may comprise in different ﬁle formats, i.e., Excel, JSON, and text. For developing textual data harmonization model for heterogeneous datasets, comprises of structured, semistructured, and unstructured data based on phrases similarity techniques, it needs to be ﬁrst preprocessed using Natural Language Processing and its techniques like Bag of Phrases, Parts of Speech and so on. Therefore this paper focuses on the conceptual data harmonization model based on text similarity technique, which will help to blend structured, semistructured, and unstructured data. The selected phrases from heterogeneous datasets will go through training and testing using Recurrent Neural Network. Keywords: Data harmonization

Text similarity Heterogeneous dataset

1 Introduction Big Data describes an occurrence in the complex and dynamic growth of data. Big Data represented in structural and functional dimensions by researchers as a concept. The structural dimension is the insurance of the element’s variety, volume, veriﬁcation, velocity, veracity, and value. The large heterogeneous datasets are managed by functional dimensions [1]. Variety is the result of the growth of virtually unlimited heterogeneous data sets—the data shaped in various forms, such as structured, © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 R. Silhavy et al. (Eds.): CoMeSySo 2020, AISC 1294, pp. 723–734, 2020. https://doi.org/10.1007/978-3-030-63322-6_61

724

G. Kumar et al.

semistructured, and unstructured (SSU). Structured data is stored in the predeﬁned model of row and column, and structured data is 5–10% of entire data examples are RDBMS and Microsoft Excel [2]. Unstructured data cannot organize in a predeﬁned model; examples are audio, video, image, text, whereas se

Data Loading...

Data Harmonization for Heterogeneous Datasets in Big Data - A Conceptual Model

Recommend Documents

Conceptual Data Model

A Conceptual Framework for Sensitive Big Data Publishing

Conceptual Image Data Model

Towards Designing Conceptual Data Models for Big Data Warehouses: The Genomics Case

A Big Data Driven Model for Screening Electricity Customers

Conceptual and Logical Data Model Production

Big Data A Primer

A scalable semantic data fusion framework for heterogeneous sensors data

Big Data

Big Data

Big Data in Networks

Hydrological evaluation of global gridded precipitation datasets in a heterogeneous and data-scarce basin in Iran