Name identification and extraction with formal concept analysis

PDF / 698,320 Bytes
8 Pages / 595.276 x 790.866 pts Page_size
99 Downloads / 181 Views

ORIGINAL ARTICLE

Name identification and extraction with formal concept analysis Kazem Taghva1

Received: 31 August 2015 / Accepted: 16 February 2016 Springer-Verlag Berlin Heidelberg 2016

Abstract One of the applications of the Formal concept analysis (FCA) is the ability to extract structured information from textual documents. Typically, one can define a set of attributes that will characterize the objects. Consequently, these defined objects will be extracted by standard FCA algorithms. In this paper, we describe how FCA identifies and extracts personal names as units of thought similar to the decoding of text sequences by Viterbi algorithm as used with Hidden Markov Models. We further exhibit how FCA mimics the thought process that goes into a rule-based information extraction system. We then observe that the formal approach of FCA combined with already established computational techniques such as bottom up intersection algorithm avoids the difficulties associated with hand coding and maintenance of rule-based systems. Keywords Data mining Information extraction Big data Entity extraction Data science Hidden Markov models Learning algorithms

1 Introduction The identification of names in documents is typically a part of a bigger task. For example, in our setting, we were involved with development of two classifiers to detect privacy and sensitive unclassified information in a large digitized collection for the U.S. Department of Energy & Kazem Taghva [email protected] 1

(DOE) [10]. More specifically, in the first classifier, we were looking for Personally Identifiable Information (PII) which refers to any information that identifies or can be used to identify, contact, or locate the person to whom such information pertains. This includes information that is used in a way that is personally identifiable, including linking it with identifiable information from other sources, or from which other personally identifiable information can easily be derived, including, but not limited to, name, address, phone number, fax number, email address, financial profiles, social security number, and credit card information. Our second classifier deals with detection and redaction of sensitive unclassified information. The Sensitive Unclassified (SU) information is defined as any unclassified information that may cause adverse consequences against the government facilities. The techniques were used in development of these two classifiers are explained in [19, 22]. There are many other applications similar to ours such as MUC [5] or the 2003 Los Alamos National Lab (LANL) project on Advanced Knowledge Integration In Assessing Terrorist Threats [16], the task deals with identification of individuals who are involved with terrorist activities. Almost all of these applications deal with extraction of certain entities such as personal name, date of birth, place of event, acronyms [20], time, and or money. In this paper, we solely concentrate on personal name extraction. Detecting names in general is difficult be

Data Loading...

Name identification and extraction with formal concept analysis

Recommend Documents

Formal Concept Analysis Foundations and Applications

Characterizing Movie Genres Using Formal Concept Analysis

Characterizing Image Sets Using Formal Concept Analysis

Formal Concept Analysis for the Identification of Combinatorial Biomarkers in Breast Cancer

Modal Interpretation of Formal Concept Analysis for Incomplete Representations

Formal Concept Analysis 13th International Conference, ICFCA 2015, N

Formal Concept Analysis 11th International Conference, ICFCA 201

Formal Concept Analysis 6th International Conference, ICFCA 2008, Mo

Formal Concept Analysis 5th International Conference, ICFCA 2007

Formal Concept Analysis 9th International Conference, ICFCA 2011, Ni

Formal Concept Analysis 12th International Conference, ICFCA 2014, C

Formal Concept Analysis 7th International Conference, ICFCA 2009