A survey on data fusion: what for? in what form? what is next?

  • PDF / 4,477,247 Bytes
  • 26 Pages / 439.642 x 666.49 pts Page_size
  • 94 Downloads / 228 Views

DOWNLOAD

REPORT


A survey on data fusion: what for? in what form? what is next? Gabrielle Karine Canalle1

· Ana Carolina Salgado1 · Bernadette Farias Loscio1

Received: 20 November 2019 / Revised: 15 October 2020 / Accepted: 15 October 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Data fusion is the process of merging records from multiple sources which represent the same real-world object into a single representation. This review of the literature concerns Data Fusion in the context of data integration, i.e., the integration of structured and semistructured data from the same domain, and provides an overview of this field of research. We present why data fusion is becoming increasingly necessary, what it is used for (What for?), what methods and solutions for data fusion have been proposed in the literature (In what form?), what research challenges are still open in the data fusion area and what future research directions could usefully take (What is next?) Keywords Data integration · Data fusion · Truth discovery

1 Introduction The Big Data era has produced petabytes of data together with several challenges, including those of attempting to identify and fuse data which represent the same real world object. In general, gathering a very significant amount of data leads to having a substantial volume of contradictory and redundant data. In this scenario, data that describe the same object can come from multiple sources and may contain conflicting information. For example, a Google search on “What is the population of the city of Recife – Brazil?”, will obtain different results, viz. “1,625,583 population”, “1,633,697 population” and “1,537,704 population”. Due to incomplete, erroneous, and out-of-date data, data from different sources of the same domain may conflict with each other (i.e., different values of the same attribute of  Gabrielle Karine Canalle

[email protected] Ana Carolina Salgado [email protected] Bernadette Farias Loscio [email protected] 1

Federal University of Pernambuco, Recife, Brazil

Journal of Intelligent Information Systems

an entity). The main reasons for this are an increase in the volume of conflicting data that are published on the Web as well as the fact that people are using the Web to spread false information. Currently, the concepts of data quality and trustworthiness have become more important than ever. Thus, Data Fusion, the focus of this survey, has become an important topic of research that aims to detect and solve data conflicts from multiple sources. Nowadays, most Data Fusion approaches aim to resolve conflicts based on the trustworthiness of the sources that provide the data. In these approaches, the notion that guides them is that the more reliable data sources are, the more accurate the data they provide will be. Data fusion is also applied in different fields, always with the same main purpose: to provide a unified view of data, thereby resolving conflicts and finding truth values. Examples of such applications include sensor data fusion (Sethi and S