Data Fusion: Resolving Conflicts from Multiple Sources

Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values, and differen

  • PDF / 325,254 Bytes
  • 26 Pages / 439.36 x 666.15 pts Page_size
  • 9 Downloads / 294 Views

DOWNLOAD

REPORT


Abstract Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values, and different sources can often provide conflicting values. To present quality data to users, it is critical to resolve conflicts and discover values that reflect the real world; this task is called data fusion. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the largest number of sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this chapter, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We describe a novel approach that considers copying between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are unlikely to be provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide copying between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information. We also consider accuracy of data sources and similarity between values in fusion to further improve the results. We present a case study on real-world data showing that

X.L. Dong () Google Inc., 1600 Amphitheater Pkwy, Mountain View, CA 94043, USA e-mail: [email protected] L. Berti-Equille IRD - Institut de Recherche pour le D´eveloppement, UMR 228 ESPACE-DEV, Maison de la T´el´ed´etection, 500 rue Jean-Franc¸ ois Breton, 34093 MONTPELLIER Cedex 05, FRANCE e-mail: [email protected] D. Srivastava AT&T Labs-Research, 180 Park Ave., Florham Park, NJ 07932, USA e-mail: [email protected] S. Sadiq (ed.), Handbook of Data Quality, DOI 10.1007/978-3-642-36257-6 13, © Springer-Verlag Berlin Heidelberg 2013

293

294

X.L. Dong et al.

the described algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.

1 Introduction The amount of useful information available on the Web has been growing at a dramatic pace in recent years. In a variety of domains, such as science, business, technology, arts, entertainment, politics, government, sports, and tourism, there are a huge number of data sources that seek to provide information to a wide spectrum of information users. In addition to enabling the availability of useful information, the Web has also eased the ability to publish and spread false information across multiple sources. For example, an obituary of Apple founder Steve Jobs was published and sent to thousands of corporate clients on August 28, 2008, before it was retracted.1 Such false information can often result in considerable damage; for example, the recent