Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network

  • PDF / 1,223,063 Bytes
  • 15 Pages / 595.276 x 790.866 pts Page_size
  • 43 Downloads / 348 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Automatic extraction of named entities of cyber threats using a deep Bi‑LSTM‑CRF network Gyeongmin Kim1 · Chanhee Lee1 · Jaechoon Jo2 · Heuiseok Lim1  Received: 15 June 2019 / Accepted: 6 April 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Countless cyber threat intelligence (CTI) reports are used by companies around the world on a daily basis for security reasons. To secure critical cybersecurity information, analysts and individuals should accordingly analyze information on threats and vulnerabilities. However, analyzing such overwhelming volumes of reports requires considerable time and effort. In this study, we propose a novel approach that automatically extracts core information from CTI reports using a named entity recognition (NER) system. During the process of constructing our proposed NER system, we defined meaningful keywords in the security domain as entities, including malware, domain/URL, IP address, Hash, and Common Vulnerabilities and Exposures. Furthermore, we linked these keywords with the words extracted from the text data of the report. To achieve a higher performance, we utilized the character-level feature vector as an input to bidirectional long-short-term memory using a conditional random field network. We finally achieved an average F1-score of 75.05%. We release 498,000 tag datasets created during our research. Keywords  Cybersecurity · Vulnerability · Cyber threat intelligence · Named entity recognition · Bidirectional long-shortterm memory conditional random field

1 Introduction Cyber properties such as IP addresses, URLs, and private data are continuously under threat of malware, viruses, and malicious actors. The use of unsecured data or websites makes users vulnerable to hackers. Users are rarely capable of detecting such attacks and have a lack of information regarding attack patterns and methods. Recent cyber threats are not only aimed at individual users but also businesses regardless of their scale [13]. For this reason, people should * Heuiseok Lim [email protected] Gyeongmin Kim [email protected] Chanhee Lee [email protected] Jaechoon Jo [email protected] 1



Korea University, Anam‑dong, Seongbuk‑gu, Seoul 02841, Republic of Korea



Hanshin University, 137, Hanshindae‑gil, Osan‑si 18101, Republic of Korea

2

always be aware of cyber threats and vulnerabilities. Cyber threat intelligence (CTI) reports provide useful data, information, and insight into cybersecurity, including important keywords such as malware names, attack schemes, and the IP address of attackers and other victims. Extracting such significant entities through a structural methodology from CTI reports is valuable for professional practitioners and a necessary step in cybersecurity research. In studies applying various text mining and machine learning methods, researchers have recently attempted to extract key entities optimized within the cybersecurity domain [30, 34]. Traditional statistical-based extraction methods that rely on feature engineeri