Automatic Extraction of Definitions in Portuguese: A Rule-Based Approach

In this paper we present a rule-based system for automatic extraction of definitions from Portuguese texts. As input, this system takes text that is previously annotated with morpho-syntactic information, namely on POS and inflection features. It handles

  • PDF / 177,806 Bytes
  • 12 Pages / 430 x 660 pts Page_size
  • 84 Downloads / 220 Views

DOWNLOAD

REPORT


bstract. In this paper we present a rule-based system for automatic extraction of definitions from Portuguese texts. As input, this system takes text that is previously annotated with morpho-syntactic information, namely on POS and inflection features. It handles three types of definitions, whose connector between definiendum and definiens is the so-called copula verb “to be”, a verb other that one, or punctuation marks. The primary goal of this system is to act as a tool for supporting glossary construction in e-learning management systems. It was tested using a collection of texts that can be taken as learning objects, in three different domains: information society, computer science for non experts, and e-learning. For each one of these domains and for each type of definition typology, evaluation results are presented. On average, we obtain 14% for precision, 86% for recall and 0.33 for F2 score.

1

Introduction

The aim of this paper is to present a rule-based system for the automatic extraction of definitions from Portuguese texts, and the result of its evaluation against test data made of texts belonging to the domains of computer science, information society and e-learning. In this work, a definition is assumed to be a sentence containing an expression (the definiendum) and its definition (the definiens). In line with the Aristotelic characterization, there are two types of definitions that typically can be considered, the formal and the semi-formal ones [1]. Formal definitions follow the schema X = Y + C, where X is the definiendum, “ = ” is the equivalence relation expressed by some connector, Y is the Genus, the class of which X is a subclass, and C represents the characteristics that turn X distinguishable from other subclasses of Y . Semi-formal definitions present a list of characteristics without the Genus. In both types, in case the equivalence relation is expressed by the verb “to be”, such definition is classified as a copula definition, as exemplified below: J. Neves, M. Santos, and J. Machado (Eds.): EPIA 2007, LNAI 4874, pp. 659–670, 2007. c Springer-Verlag Berlin Heidelberg 2007 

660

R. Del Gaudio and A. Branco

FTP e ´ um protocolo que possibilita a transf^ erencia de arquivos de um local para outro pela Internet. FTP is a protocol that allows the transfer of archives from a place to another through the Internet.

Definitions are not limited to this pattern [2, 3]. It is possible to find definitions expressed by: – punctuation clues: TCP/IP: protocolos utilizados na troca de informa¸ c~ oes entre computadores. TCP/IP: protocols used in the transfer of information between computers.

– linguistic expressions other than the copular verb: Uma ontologia pode ser descrita como uma defini¸ c~ ao formal de objectos. An ontology can be described as a formal definition of objects.

˜ – complex syntactic patterns such as apposition, inter alia: Os Browsers, Navegadores da Web, podem executar som. Browsers, tools for navigating the Web, can also reproduce sound.

The definitions taken into account in the present work a