Towards Keyphrase Assignment for Texts in Portuguese Language

Keyphrase assignment has often been confounded with keyphrase extraction, since the basic hypothesis is that a keyphrase of a text must be extracted from this text. Typically, keyphrase extraction approaches use a training set restricted to textual terms,

PDF / 589,599 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
14 Downloads / 283 Views

DOWNLOAD

REPORT

ract. Keyphrase assignment has often been confounded with keyphrase extraction, since the basic hypothesis is that a keyphrase of a text must be extracted from this text. Typically, keyphrase extraction approaches use a training set restricted to textual terms, reducing the learning capabilities of any inductive algorithm. Our research investigates ways to improve the accuracy of the keyphrase assignment systems for texts in Portuguese language by allowing classiﬁcation algorithms to learn from non-textual terms as well. The basic assumption we have followed is that non-textual terms can be included into the training set by inference from an eventual semantic relationship with textual terms. In order to discover the latent relationship between non-textual and textual terms, we use deductive strategies to be applied in Portuguese common sense bases such as Wikipedia and InferenceNet. We show that algorithms that follow our approach outperform others that do not use the same methods introduced here. Keywords: Keyphrase extraction annotation Information retrieval

Keyphrase assignment

Semantic

1 Introduction The task of assigning a text with keyphrases is important because they enable text categorization [1], advertising [2], or simply for the purpose of summarizing the content to allow a rapid understanding of the subject matter [3]. This task, when done manually, is tedious and time consuming. When there is a need to consolidate a pre-deﬁned vocabulary, this activity is non-trivial and its automation becomes mandatory. Traditionally, automatic keyphrase extraction concerns “the automatic selection of important and topical phrases from the body of a document” [4]. Its goal is to extract a set of phrases that are related to the main topics discussed in a given document [5]. In fact, the task of keyphrase assignment (discovery of keyphrases contained or no in the text) has often been confounded with keyphrase extraction, whose basic hypothesis is that a keyphrase of a text must be extracted from this text.

© Springer International Publishing Switzerland 2016 J. Silva et al. (Eds.): PROPOR 2016, LNAI 9727, pp. 165–176, 2016. DOI: 10.1007/978-3-319-41552-9_17

166

R. Silveira et al.

Our preliminary analysis from a corpus of news in Portuguese with keyphrases assigned by humans has shown that approximately 20 % of them are not in the text. Lately we have fortiﬁed the conclusions reached in the preliminary study by exploring a corpus of thesis and dissertations abstracts in Portuguese, which showed us that 55 % of the keyphrases assigned by the authors are not found in the text. The literature of automatic extraction of keyphrases is dominated by inductive learning (typically, classiﬁcation). This kind of learning discovers patterns based on examples composed of statistical, structural and syntactic features of textual terms such as their frequency, their topological position in the text, and external resource-based features computed based on information gathered from resources other, such as knowledge bases (e

Data Loading...

Towards Keyphrase Assignment for Texts in Portuguese Language

Recommend Documents

Portuguese as an Additional Language

Learning Portuguese as a Second Language

Self-training classifier of natural-language texts

Computer Based Stylometric Analysis of Texts in Ukrainian Language

The Chinese Language in European Texts The Early Period

Computational Processing of the Portuguese Language 6th Internat

Computational Processing of the Portuguese Language 14th Internation

Computational Processing of the Portuguese Language 9th Internat

Achieving in Content Through Language: Towards a CEFR Descriptor Scale for Academic Language Proficiency

KazNLP: A Pipeline for Automated Processing of Texts Written in Kazakh Language

Automatic Processing of Natural-Language Electronic Texts with NooJ

Computational Processing of the Portuguese Language 8th Internationa