An Improved NER Methodology to the Portuguese Language
- PDF / 966,937 Bytes
- 7 Pages / 595.276 x 790.866 pts Page_size
- 4 Downloads / 212 Views
An Improved NER Methodology to the Portuguese Language Rogerio de Aquino Silva 1
&
Luana da Silva 1
&
Moisés Lima Dutra 1
&
Gustavo Medeiros de Araujo 1
# Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract The text mining process typically involves the application of natural language processing (NLP) techniques, in order to obtain important information and extract insights from texts. This is achieved by detecting patterns, which are not explicitly a priori in this unstructured or semi-structured dataset. One of the most significant NLP tasks is Named Entity Recognition (NER). The NER process seeks to extract and classify mentioned entities detected in a text written in natural language. These categories are predefined and can be names of people or organizations, locations, dates, monetary values, specific codes, etc. A wide range of algorithms based on LSTM (Long-Short Term Memory) architecture has being proposed to enhance the NER accuracy. However, a key component to a successful information extraction is the corpora used for NER training. Another key issue concerns the language being worked on, since the vast majority of algorithms were designed to work with English. According to the literature, while the NER process applied to the English language reaches about 90% accuracy, when it is applied to the Portuguese language, this precision reaches a maximum of 83.38%. This paper proposes a methodology to improve the Portuguese-based NER, which uses journalistic corpora as a basis for text corpora training. We believe the journalistic writting has the best adherence to the contemporaneity of any language, since it preserves features such as objectivity, simplicity, impartiality, and is a reference of transmitting the information without ambiguity. The proposed methodology provides a model to extract entities and assess the obtained results with the use of Recurrent Neural Network architectures. At the best of our knowledge, the proposed methodology applied to the Portuguese language not only overcomes the average accuracy found in the literature by increasing it from 83.38% to 85.64%, but also could decrease the computational costs related to the NER processing tasks. Keywords Natural language processing . Name entity recognition . Entity extraction model . Brazilian Portuguese Corpus . Recurrent neural networks
1 Introduction Information retrieval (IR) has emerged from efforts on facilitating large-scale data manipulation [8]. Efforts to IR development still face major challenges when it comes to Natural Language Processing (NLP). The extraction of information from Portuguese texts is still an open field of investigation. * Rogerio de Aquino Silva [email protected] Luana da Silva [email protected] Moisés Lima Dutra [email protected] Gustavo Medeiros de Araujo [email protected] 1
Engineering and Data Science Lab, Federal University of Santa Catarina, Florianopolis, Brazil
It is a consequence of the weak results of NLP models for Portuguese language
Data Loading...