Development of Kazakh Named Entity Recognition Models

Named entity recognition is one of the important tasks in natural language processing. Its practical application can be found in various areas such as speech recognition, information retrieval, filtering, etc. Nowadays there are a variety of available met

PDF / 1,862,344 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
15 Downloads / 402 Views

DOWNLOAD

REPORT

,

1 Al-Farabi Kazakh National University, Almaty, Kazakhstan [email protected], [email protected], [email protected], [email protected] 2 University of International Business, Almaty, Kazakhstan 3 Institute of Computational Technologies, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russian Federation [email protected] 4 Novosibirsk State University, Novosibirsk, Russian Federation 5 Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Pozna´n, Poland [email protected]

Abstract. Named entity recognition is one of the important tasks in natural language processing. Its practical application can be found in various areas such as speech recognition, information retrieval, filtering, etc. Nowadays there are a variety of available methods for implementing named entity recognition. In this work we experimented with three models and compared the performances of machine learning based models and probabilistic sequence modeling method on the task of Kazakh language named entity recognition. We considered three models based on BERT, Bi-LSTM and CRF baseline. In the future these models can be parts of an ensemble learning system for name entity recognition in order to achieve better performance results. Keywords: Named entity recognition · Conditional random fields · BERT · Bi-LSTM

1 Introduction In the information age with the increasing amount of digital data the need for automatic information extraction tools is bigger than ever. While there is a large number of information extraction tools available now for such languages as English or Russian, the situations with Kazakh differs. Kazakh is one of the low-resourced languages and it belongs to the group of agglutinative languages. In this paper we experiment on Kazakh data using different named entity recognition methods. Currently, there are various approaches for extracting information. They are diverse and it is difficult to say that one is better than the other, since one or another shows good results in different situations. Information retrieval approaches can be classified into the following categories: © Springer Nature Switzerland AG 2020 N. T. Nguyen et al. (Eds.): ICCCI 2020, LNAI 12496, pp. 697–708, 2020. https://doi.org/10.1007/978-3-030-63007-2_54

698

D. Akhmed-Zaki et al.

• rule-based approaches. The experts manually create the rule sets needed to extract certain data. • knowledge-based approaches. These include models based on ontologies [1], models based on thesauri [2]. • statistical approaches. They include hidden Markov models [3–5], conditional Markov models [6], conditional random fields [7]. • machine learning based approaches [8]. One of the foundational tasks in the process of information extraction is the recognition of named entities, i.e. spans of text that are proper names of people, organizations, locations and other objects1 . The task consists of identifying the location of names in text and recognizing their type, as illustrated in Fig. 1.

Fig. 1. A sentence with the pro

Data Loading...

Development of Kazakh Named Entity Recognition Models

Recommend Documents

Named Entity Recognition for Icelandic: Annotated Corpus and Models

A Survey on Named Entity Recognition

ALBERT-Based Chinese Named Entity Recognition

Named Entity Recognition with Context-Aware Dictionary Knowledge

When to Use OCR Post-correction for Named Entity Recognition?

Cross-Lingual Transfer Learning for Medical Named Entity Recognition

Incorporating Boundary and Category Feature for Nested Named Entity Recognition

Improving biomedical named entity recognition with syntactic information

Reinforcement Learning for Named Entity Recognition from Noisy Data

Named Entity Recognition from Arabic-French Herbalism Parallel Corpora

A Neural Framework for Chinese Medical Named Entity Recognition

Iterative Strategy for Named Entity Recognition with Imperfect Annotations