Name identification and extraction with formal concept analysis

  • PDF / 698,320 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 99 Downloads / 154 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Name identification and extraction with formal concept analysis Kazem Taghva1

Received: 31 August 2015 / Accepted: 16 February 2016  Springer-Verlag Berlin Heidelberg 2016

Abstract One of the applications of the Formal concept analysis (FCA) is the ability to extract structured information from textual documents. Typically, one can define a set of attributes that will characterize the objects. Consequently, these defined objects will be extracted by standard FCA algorithms. In this paper, we describe how FCA identifies and extracts personal names as units of thought similar to the decoding of text sequences by Viterbi algorithm as used with Hidden Markov Models. We further exhibit how FCA mimics the thought process that goes into a rule-based information extraction system. We then observe that the formal approach of FCA combined with already established computational techniques such as bottom up intersection algorithm avoids the difficulties associated with hand coding and maintenance of rule-based systems. Keywords Data mining  Information extraction  Big data  Entity extraction  Data science  Hidden Markov models  Learning algorithms

1 Introduction The identification of names in documents is typically a part of a bigger task. For example, in our setting, we were involved with development of two classifiers to detect privacy and sensitive unclassified information in a large digitized collection for the U.S. Department of Energy & Kazem Taghva [email protected] 1

(DOE) [10]. More specifically, in the first classifier, we were looking for Personally Identifiable Information (PII) which refers to any information that identifies or can be used to identify, contact, or locate the person to whom such information pertains. This includes information that is used in a way that is personally identifiable, including linking it with identifiable information from other sources, or from which other personally identifiable information can easily be derived, including, but not limited to, name, address, phone number, fax number, email address, financial profiles, social security number, and credit card information. Our second classifier deals with detection and redaction of sensitive unclassified information. The Sensitive Unclassified (SU) information is defined as any unclassified information that may cause adverse consequences against the government facilities. The techniques were used in development of these two classifiers are explained in [19, 22]. There are many other applications similar to ours such as MUC [5] or the 2003 Los Alamos National Lab (LANL) project on Advanced Knowledge Integration In Assessing Terrorist Threats [16], the task deals with identification of individuals who are involved with terrorist activities. Almost all of these applications deal with extraction of certain entities such as personal name, date of birth, place of event, acronyms [20], time, and or money. In this paper, we solely concentrate on personal name extraction. Detecting names in general is difficult be