Arabic real time entity resolution using inverted indexing

  • PDF / 3,059,026 Bytes
  • 21 Pages / 439.37 x 666.142 pts Page_size
  • 2 Downloads / 166 Views

DOWNLOAD

REPORT


Arabic real time entity resolution using inverted indexing Marwah Alian1,3 Banda Ramadan4

· Ghazi Al-Naymat2,3 ·

Accepted: 3 September 2020 / Published online: 7 October 2020 © Springer Nature B.V. 2020

Abstract Arabic datasets that have two or more records for the same world entity (i.e. person, object, etc.) make institutions suffer from low quality and degraded performance due to duplication in their Arabic datasets without having any mechanism for detecting these duplicates. The operation that distinguishes records for the same real-world entity is called Entity Resolution (ER). It is considered as a tool for linking records across databases as well as for matching query records with existing databases in real-time. Indexing is a major step in the ER process that aims at reducing the search space. Several indexing techniques are available for use with the ER process in general for English Databases. However, such techniques are not validated if they work well with other languages, such as Arabic. The Dynamic Similarity Aware Inverted Index (DySimII) is one of the indexing techniques that are utilized with dynamic databases to match query records in real time and is demonstrated to work well with English language. In this paper, we propose a framework—Arabic Real Time Entity Resolution (ARTER)—that uses DySimII with Arabic databases to perform real time ER. We also examine using different string similarity functions required for comparing records in the matching process for the aim of evaluating which similarity function is more suitable for comparing & Marwah Alian [email protected] Ghazi Al-Naymat [email protected] Banda Ramadan [email protected] 1

Hashemite University, Zarqa, Jordan

2

Ajman University, Ajman, United Arab Emirates

3

Princess Sumaya University for Technology, Amman, Jordan

4

Prince Sultan University, Riyadh, Saudi Arabia

123

922

M. Alian et al.

Arabic strings. A real-world Arabic database is used to conduct our experimental evaluation where two stemmers and three similarity functions are used to see the effect on DySimII with Arabic dataset. The results represent that matching accuracy is improved using Asem stemmer when the number of corrupted attributes is increased, also testing the three similarity functions show that using winkler similarity function provides better matching accuracy while N-gram provides better results when used with Asem stemmer. Keywords Arabic Entity Resolution · Similarity Aware Inverted Indexes · Similarity functions · Record pair comparison

1 Introduction ER is the process of determining and matching records in a dataset or different datasets that represent the same real-world entity. It is sometimes referred to as data matching or record linkage (Elmagarmid et al. 2007). The entity may represent a person, a product or an organization. Duplication in records within a database of a business or an organization will affect their outcomes. Therefore, ER is used as a technique for identifying duplicates and cleaning databases in order to enha