Knowledge Discovery with CRF-Based Clustering of Named Entities without a Priori Classes

Knowledge discovery aims at bringing out coherent groups of entities. It is usually based on clustering which necessitates defining a notion of similarity between the relevant entities. In this paper, we propose to divert a supervised machine learning tec

PDF / 546,302 Bytes
14 Pages / 439.363 x 666.131 pts Page_size
26 Downloads / 176 Views

DOWNLOAD

REPORT

Abstract. Knowledge discovery aims at bringing out coherent groups of entities. It is usually based on clustering which necessitates deﬁning a notion of similarity between the relevant entities. In this paper, we propose to divert a supervised machine learning technique (namely Conditional Random Fields, widely used for supervised labeling tasks) in order to calculate, indirectly and without supervision, similarities among text sequences. Our approach consists in generating artiﬁcial labeling problems on the data to reveal regularities between entities through their labeling. We describe how this framework can be implemented and experiment it on two information extraction/discovery tasks. The results demonstrate the usefulness of this unsupervised approach, and open many avenues for deﬁning similarities for complex representations of textual data.

1

Introduction

Labeling sequences are tasks of particular interest for NLP (part-of-speech tagging, semantic annotation, information extraction, etc.). Many tools have been proposed, but in recent years, the Conditional Random Fields (CRF [1]) have emerged as the most eﬀective for many applications. These models are supervised machine learning: examples of sequences with their labels are required. The work presented in this paper is placed in a diﬀerent context in which the goal is to bring out information from these sequences. So, we ﬁt in a task of knowledge discovery in which supervision is not applicable: the aim is to discover how the data can be grouped into categories that make sense rather than providing these categories from expert knowledge. Therefore, these discovery tasks are based most often on clustering [2,3,4]; the crucial question is how to calculate the similarity between two interesting entities. In this paper, we propose to divert CRF by producing fake labeling problems in order to make appear entities that are regularly labeled the same way. Of these regularities is then built a notion of similarity between entities, which is thus deﬁned by extension and not by intention. On the application point of view, in addition to the use for knowledge discovery, the similarities obtained by our approach or the clusters produced can be used upstream of supervised tasks: A. Gelbukh (Ed.): CICLing 2014, Part I, LNCS 8403, pp. 415–428, 2014. c Springer-Verlag Berlin Heidelberg 2014

416

V. Claveau and A. Ncibi

– it can be used to reduce the cost of data annotation. It is indeed easier to label a cluster than annotate a text instance by instance. – it can help to identify classes diﬃcult to discriminate, or on the contrary exhibit classes whose instances are very diverse. It then makes it possible to adapt the supervised classiﬁcation task by changing the set of labels. In the remainder of this article, we position our work in the state-of-the-art and brieﬂy present CRF by introducing some useful concepts for the rest of the article. We then describe in Section 3 the principle of our discovery approach using supervised ML technique in an unsupervised mode fo

Data Loading...

Knowledge Discovery with CRF-Based Clustering of Named Entities without a Priori Classes

Recommend Documents

DKDD_C: A Clustering-Based Approach for Distributed Knowledge Discovery

Recognizing Named Entities in Specific Domain

Time Series Clustering for Knowledge Discovery on Metal Additive Manufacturing

Autoconfiguration of a Vibration-Based Anomaly Detection System with Sparse a-priori Knowledge Using Autoencoder Network

Knowledge Discovery, Knowledge Engineering and Knowledge Management

Knowledge Discovery, Knowledge Engineering and Knowledge Management

Knowledge Discovery, Knowledge Engineering and Knowledge Management

Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network

Named Entity Recognition with Context-Aware Dictionary Knowledge

Statistical Guideline #3: Designate and Justify Covariates A Priori, and Report Results With and Without Covariates

A priori

Confidence intervals with a priori parameter bounds