Recognizing Named Entities in Specific Domain
- PDF / 546,372 Bytes
- 12 Pages / 612 x 792 pts (letter) Page_size
- 79 Downloads / 197 Views
Recognizing Named Entities in Specific Domain M. M. Tikhomirov1* , N. V. Loukachevitch1** , and B. V. Dobrov1*** (Submitted by E. E. Tyrtyshnikov) 1
Moscow State University, Moscow, 119991 Russia
Received March 30, 2020; revised April 12, 2020; accepted April 18, 2020
Abstract—The paper presents the results of applying the BERT representation model in the named entity recognition task (NER) for the cybersecurity domain in Russian. We compare several approaches to domain-specific NER combining BERT fine-tuning on a domain-specific text collection, general labeled data, domain-specific data augmentation, and a domain-specific annotated dataset. We showed that using a BERT model fine-tuned on a domain text collection and pretrained on the combination of a general dataset and augmented data achieves the best results of named entity recognition. We also studied computational performance of the BERT model in socalled mixed precision regime. DOI: 10.1134/S199508022008020X Keywords and phrases: cybersecurity, named entity recognition, pretraining, augmentation.
1. INTRODUCTION Named entity recognition (NER) is an important first step in most information extraction systems. The current main approach for named entity recognition is based on machine learning methods, requiring text data, annotated with named entities of several types. The majority of well-known datasets for NER consist of news documents with three types of named entities labeled: person (people’s names), organization (names of organizations), location (places, mostly geographical objects) [3, 17, 20]. For these types of named entities, the state-of-the-art methods usually obtain impressive results. However, applying a NER system to a novel domain can yield a dramatic loss in accuracy. The costs of preparing an annotated corpus for domain-specific NER categories are quite large. One has to establish principles of annotation for domain-specific entities and to ensure that these principles are applied consistently. Annotating in specific domains can require special expertise [5, 22]. Besides, processing such text genres as social network posts (for example, in Twitter), comments, can lead to further degradation of the results [18, 23]. In this paper we discuss the NER task in the cybersecurity domain [21]. Several additional types of named entities for this domain were annotated if compared to general datasets such as software programs, devices, technologies, hackers, and malicious programs (vulnerabilities). The obtained results of domain-specific NER are much lower than usually achieved for general datasets. This is partially due to smaller number of annotated entities. To improve NER quality in such conditions, we utilize an additional general dataset and domainspecific augmentation of training data, by which we mean extending a training data with sentences containing automatically labeled named entities. We use the BERT transformer [7] architecture as a method of named entity recognition. We compare the following approaches combining BERT fine-tuning on a domain-specific
Data Loading...