A comparative evaluation of novelty detection algorithms for discrete sequences

  • PDF / 2,476,566 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 52 Downloads / 200 Views

DOWNLOAD

REPORT


A comparative evaluation of novelty detection algorithms for discrete sequences Rémi Domingues1

· Pietro Michiardi1 · Jérémie Barlet2 · Maurizio Filippone1

© Springer Nature B.V. 2019

Abstract The identification of anomalies in temporal data is a core component of numerous research areas such as intrusion detection, fault prevention, genomics and fraud detection. This article provides an experimental comparison of candidate methods for the novelty detection problem applied to discrete sequences. The objective of this study is to identify which state-of-the-art methods are efficient and appropriate candidates for a given use case. These recommendations rely on extensive novelty detection experiments based on a variety of public datasets in addition to novel industrial datasets. We also perform thorough scalability and memory usage tests resulting in new supplementary insights of the methods’ performance, key selection criteria to solve problems relying on large volumes of data and to meet the expectations of applications subject to strict response time constraints. Keywords Novelty detection · Discrete sequences · Temporal data · Fraud detection · Outlier detection · Anomaly detection

1 Introduction Novelty detection is an unsupervised learning problem and an active research area (Aggarwal 2015; Hodge and Austin 2004). Given a set of training samples, novelty detection is the task of classifying new test samples as nominal when the test data relates to the training set, or as anomalous when they significantly differ. Anomalous data is called novelties or anomalies and is assumed to be generated by a different generative process. Since novelty detection

B

Rémi Domingues [email protected] Pietro Michiardi [email protected] Jérémie Barlet [email protected] Maurizio Filippone [email protected]

1

Department of Data Science, EURECOM, 450 Route des Chappes, Sophia Antipolis, France

2

Amadeus, 485 Route du Pin Montard, Sophia Antipolis, France

123

R. Domingues et al.

can be considered a one-class classification problem, it has also been described as a semisupervised problem (Chandola et al. 2012) when the training set is exempt of outliers. While most anomaly detection problems deal with numerical data (Emmott et al. 2016; Breunig et al. 2000; Ramaswamy et al. 2000), novelty detection methods have been successfully applied to categorical data (Hodge and Austin 2004), time-series (Marchi et al. 2015; Kundzewicz and Robson 2004; Taylor and Letham 2018), discrete sequences (Chandola et al. 2008; Warrender et al. 1999; Cohen 1995) and mixed data types (Domingues et al. 2018). This paper surveys the problem of detecting anomalies in temporal data, specifically in discrete sequences of events which have a temporal order. Such a problem can be divided into two categories. The first one is change point detection. In this problem, the dataset is a single and often long sequence in which we seek anomalous subsequences composed of contiguous events. Anomalies denote a sudden change of beh