Using Distant Supervision and Paragraph Vector for Large Scale Relation Extraction

Distant supervision has the ability to generate a huge amount training data. Recently, the multi-instance multi-label learning is imported to distant supervision to combat noisy data and improve the performance of relation extraction. But multi-instance m

  • PDF / 586,127 Bytes
  • 13 Pages / 439.37 x 666.142 pts Page_size
  • 112 Downloads / 173 Views

DOWNLOAD

REPORT


Abstract. Distant supervision has the ability to generate a huge amount training data. Recently, the multi-instance multi-label learning is imported to distant supervision to combat noisy data and improve the performance of relation extraction. But multi-instance multi-label learning only uses hidden variables when inference relation between entities, which could not make full use of training data. Besides, traditional lexical and syntactic features are defective reflecting domain knowledge and global information of sentence, which limits the system’s performance. This paper presents a novel approach for multi-instance multilabel learning, which takes the idea of fuzzy classification. We use cluster center as train-data and in this way we can adequately utilize sentencelevel features. Meanwhile, we extend feature set by paragraph vector, which carries semantic information of sentences. We conduct an extensive empirical study to verify our contributions. The result shows our method is superior to the state-of-the-art distant supervised baseline. Keywords: Relation extraction vector

1

·

Distant supervision

·

Paragraph

Introduction

We are living in information era, still we have difficulty finding knowledge. Relation extraction (RE), the process of generating structural data from plain text, continues to gain attention when PB of natural-language text are readily available. However, most approaches to RE use supervised learning of relation-specific examples, which can achieve high precision and recall. Unfortunately fully supervised methods are limited by the availability of training data and are unlikely to scale to the thousands of relation found on the web. One of the most promising approaches to RE that addresses this limitation is distant supervision, which generates training data automatically by aligning a knowledge base with text (Bunescu and Mooney [1]; Mintz [2]). For example, taking Fig. 1, we would create a datum for each of the two sentences containing LEBRON JAMES and AKRON, labeled with city of birth, and likewise with city of residence, creating 4 training examples overall. Similarly, both sentences involving LEBRON JAMES and PLAYER would be marked as expressing the title relation. c Springer Science+Business Media Singapore 2016  W. Chen et al. (Eds.): BDTA 2015, CCIS 590, pp. 178–190, 2016. DOI: 10.1007/978-981-10-0457-5 17

Using Distant Supervision and Paragraph Vector

179

Distant supervision introduces two challenges. The first challenge is that some training examples obtained through this hypothesis are not valid, though a sentence contains both entities, it may express no relation on the entity pair. The second challenge is that the same pair of entities may have multiple labels and it is unclear which label is instantiated by any textual mention of the given tuple. To fix these problems Surdeanu [3] cast distant supervision as a form of multi-instance multi-label learning. However, Surdeanu’s model only use a latent label of each entity mention when inference relations, which loss too much u