Enhancing data quality in real-time threat intelligence systems using machine learning

  • PDF / 958,127 Bytes
  • 22 Pages / 595.276 x 790.866 pts Page_size
  • 14 Downloads / 205 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Enhancing data quality in real‑time threat intelligence systems using machine learning Ariel Rodriguez1 · Koji Okamura1 Received: 25 May 2020 / Revised: 26 October 2020 / Accepted: 28 October 2020 © Springer-Verlag GmbH Austria, part of Springer Nature 2020

Abstract In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods. Keywords  Data mining · Social media · Cyber threat intelligence · Machine learning

1 Introduction The task of protecting the computer systems which run our society often falls to security operations centers (SOC) and the analysts and engineers who work within them. Security analysts are tasked with performing a range of functions such as log monitoring, testing, threat detection, and investigation. With the increase in data from companies’ increased technological footprint and migration to data-driven systems, security teams are struggling to investigate the large number of alerts they receive. Among organizations that receive daily alerts, 44% of alerts are not able to be investigated, and of those investigated, only 34% are found to be legitimate (cisco 2018 annual cybersecurity report 2018). Furthermore, 78% of * Ariel Rodriguez [email protected] Koji Okamura [email protected]‑u.ac.jp 1



Graduate School if Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan

analysts state it takes 10+ minutes to investigate an alert leaving analysts with what is referred to as alert overload (The impact of security alert overload 2019). When looking at these statistics it is not surprising to find that the most important activities for SOC’s are considered to be the minimization of false-positives, threat intelligence