HiDER: Query-Driven Entity Resolution for Historical Data

Entity Resolution (ER) is the task of finding references that refer to the same entity across different data sources. Cleaning a data warehouse and applying ER on it is a computationally demanding task, particularly for large data sets that change dynamic

  • PDF / 495,173 Bytes
  • 4 Pages / 439.37 x 666.142 pts Page_size
  • 86 Downloads / 185 Views

DOWNLOAD

REPORT


2

Maastricht University, Maastricht, The Netherlands [email protected] Eindhoven University of Technology, Eindhoven, The Netherlands 3 Universit´e Libre de Bruxelles, Brussels, Belgium 4 University of Liverpool, Liverpool, UK

Abstract. Entity Resolution (ER) is the task of finding references that refer to the same entity across different data sources. Cleaning a data warehouse and applying ER on it is a computationally demanding task, particularly for large data sets that change dynamically. Therefore, a query-driven approach which analyses a small subset of the entire data set and integrates the results in real-time is significantly beneficial. Here, we present an interactive tool, called HiDER, which allows for querydriven ER in large collections of uncertain dynamic historical data. The input data includes civil registers such as birth, marriage and death certificates in the form of structured data, and notarial acts such as estate tax and property transfers in the form of free text. The outputs are family networks and event timelines visualized in an integrated way. The HiDER is being used and tested at BHIC center(Brabant Historical Information Center, https://www.bhic.nl); despite the uncertainties of the BHIC input data, the extracted entities have high certainty and are enriched by extra information.

1

Introduction

In the domain of historical research vast amount of historical data exists. Digitization and correction of data is an everyday process in historical centers. Additionally, some projects such as Ancestory.com1 are using crowdsourcing and volunteering efforts to improve the quality of their database on census records and civil registers. This results in many dynamically changing large data corpora, requiring efficient ER. This work develops, based on the work of [1], a query-driven tool for Historical Data Entity Resolution called HiDER. HiDER has the following advantages: (a) HiDER allows for ER across different data sources; (b) the changes in input data and ER algorithms can be incorporated in generating outcomes in real time; (c) by using Lucene’s inverted indexing, both structured and unstructured data are handled, and fuzzy search allows for compensating missing data and spelling 1

Ancestry.com Inc., http://www.ancestry.com

c Springer International Publishing Switzerland 2015  A. Bifet et al. (Eds.): ECML PKDD 2015, Part III, LNAI 9286, pp. 281–284, 2015. DOI: 10.1007/978-3-319-23461-8 30

282

B. Ranjbar-Sahraei et al.

variations, and (d) graph-based ER allows for detecting and visualizing “family networks”.

2

The HiDER System

The HiDER system is developed on an Apache web server, equipped with Solr search platform. HiDER works as follows: a user gives a query which consists of at least a family name, but can also contain names of a couple, date and location and relatives’ names. Subsequently, HiDER searches for relevant records existing in different sources and presents them in an integrated way. To do this, HiDER uses an inverted index data structure to retrieve