Recognizing Named Entities in Specific Domain

PDF / 546,372 Bytes
12 Pages / 612 x 792 pts (letter) Page_size
79 Downloads / 215 Views

Recognizing Named Entities in Speciﬁc Domain M. M. Tikhomirov1* , N. V. Loukachevitch1** , and B. V. Dobrov1*** (Submitted by E. E. Tyrtyshnikov) 1

Moscow State University, Moscow, 119991 Russia

Received March 30, 2020; revised April 12, 2020; accepted April 18, 2020

Abstract—The paper presents the results of applying the BERT representation model in the named entity recognition task (NER) for the cybersecurity domain in Russian. We compare several approaches to domain-speciﬁc NER combining BERT ﬁne-tuning on a domain-speciﬁc text collection, general labeled data, domain-speciﬁc data augmentation, and a domain-speciﬁc annotated dataset. We showed that using a BERT model ﬁne-tuned on a domain text collection and pretrained on the combination of a general dataset and augmented data achieves the best results of named entity recognition. We also studied computational performance of the BERT model in socalled mixed precision regime. DOI: 10.1134/S199508022008020X Keywords and phrases: cybersecurity, named entity recognition, pretraining, augmentation.

1. INTRODUCTION Named entity recognition (NER) is an important ﬁrst step in most information extraction systems. The current main approach for named entity recognition is based on machine learning methods, requiring text data, annotated with named entities of several types. The majority of well-known datasets for NER consist of news documents with three types of named entities labeled: person (people’s names), organization (names of organizations), location (places, mostly geographical objects) [3, 17, 20]. For these types of named entities, the state-of-the-art methods usually obtain impressive results. However, applying a NER system to a novel domain can yield a dramatic loss in accuracy. The costs of preparing an annotated corpus for domain-speciﬁc NER categories are quite large. One has to establish principles of annotation for domain-speciﬁc entities and to ensure that these principles are applied consistently. Annotating in speciﬁc domains can require special expertise [5, 22]. Besides, processing such text genres as social network posts (for example, in Twitter), comments, can lead to further degradation of the results [18, 23]. In this paper we discuss the NER task in the cybersecurity domain [21]. Several additional types of named entities for this domain were annotated if compared to general datasets such as software programs, devices, technologies, hackers, and malicious programs (vulnerabilities). The obtained results of domain-speciﬁc NER are much lower than usually achieved for general datasets. This is partially due to smaller number of annotated entities. To improve NER quality in such conditions, we utilize an additional general dataset and domainspeciﬁc augmentation of training data, by which we mean extending a training data with sentences containing automatically labeled named entities. We use the BERT transformer [7] architecture as a method of named entity recognition. We compare the following approaches combining BERT ﬁne-tuning on a domain-speciﬁc

Data Loading...

Recognizing Named Entities in Specific Domain

Recommend Documents

SciNER: A Novel Scientific Named Entity Recognizing Framework

Automatic Annotation Service APPI: Named Entity Linking in Legal Domain

Recognition of Named Entities in the Russian Subcorpus Google Books Ngram

Mining Domain-Specific Design Patterns

Domain-Specific Knowledge Graph Construction

Learning to Optimize Domain Specific Normalization for Domain Generalization

A framework for crime data analysis using relationship among named entities

Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network

Knowledge Discovery with CRF-Based Clustering of Named Entities without a Priori Classes

Haralick Features from Wavelet Domain in Recognizing Fingerprints Using Neural Network

Domain-Specific Language for Sensors in the Internet of Production

Business Entities