Development of Kazakh Named Entity Recognition Models

Named entity recognition is one of the important tasks in natural language processing. Its practical application can be found in various areas such as speech recognition, information retrieval, filtering, etc. Nowadays there are a variety of available met

  • PDF / 1,862,344 Bytes
  • 12 Pages / 439.37 x 666.142 pts Page_size
  • 15 Downloads / 276 Views

DOWNLOAD

REPORT


,

1 Al-Farabi Kazakh National University, Almaty, Kazakhstan [email protected], [email protected], [email protected], [email protected] 2 University of International Business, Almaty, Kazakhstan 3 Institute of Computational Technologies, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russian Federation [email protected] 4 Novosibirsk State University, Novosibirsk, Russian Federation 5 Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Pozna´n, Poland [email protected]

Abstract. Named entity recognition is one of the important tasks in natural language processing. Its practical application can be found in various areas such as speech recognition, information retrieval, filtering, etc. Nowadays there are a variety of available methods for implementing named entity recognition. In this work we experimented with three models and compared the performances of machine learning based models and probabilistic sequence modeling method on the task of Kazakh language named entity recognition. We considered three models based on BERT, Bi-LSTM and CRF baseline. In the future these models can be parts of an ensemble learning system for name entity recognition in order to achieve better performance results. Keywords: Named entity recognition · Conditional random fields · BERT · Bi-LSTM

1 Introduction In the information age with the increasing amount of digital data the need for automatic information extraction tools is bigger than ever. While there is a large number of information extraction tools available now for such languages as English or Russian, the situations with Kazakh differs. Kazakh is one of the low-resourced languages and it belongs to the group of agglutinative languages. In this paper we experiment on Kazakh data using different named entity recognition methods. Currently, there are various approaches for extracting information. They are diverse and it is difficult to say that one is better than the other, since one or another shows good results in different situations. Information retrieval approaches can be classified into the following categories: © Springer Nature Switzerland AG 2020 N. T. Nguyen et al. (Eds.): ICCCI 2020, LNAI 12496, pp. 697–708, 2020. https://doi.org/10.1007/978-3-030-63007-2_54

698

D. Akhmed-Zaki et al.

• rule-based approaches. The experts manually create the rule sets needed to extract certain data. • knowledge-based approaches. These include models based on ontologies [1], models based on thesauri [2]. • statistical approaches. They include hidden Markov models [3–5], conditional Markov models [6], conditional random fields [7]. • machine learning based approaches [8]. One of the foundational tasks in the process of information extraction is the recognition of named entities, i.e. spans of text that are proper names of people, organizations, locations and other objects1 . The task consists of identifying the location of names in text and recognizing their type, as illustrated in Fig. 1.

Fig. 1. A sentence with the pro