Randomized outlier detection with trees
- PDF / 488,483 Bytes
- 14 Pages / 595.276 x 790.866 pts Page_size
- 58 Downloads / 213 Views
REGULAR PAPER
Randomized outlier detection with trees Sebastian Buschjäger1
· Philipp-Jan Honysz1
· Katharina Morik1
Received: 3 July 2020 / Accepted: 31 October 2020 © The Author(s) 2020
Abstract Isolation forest (IF) is a popular outlier detection algorithm that isolates outlier observations from regular observations by building multiple random isolation trees. The average number of comparisons required to isolate a given observation can then be used as a measure of its outlierness. Multiple extensions of this approach have been proposed in the literature including the extended isolation forest (EIF) as well as the SCiForest. However, we find a lack of theoretical explanation on why IF, EIF, and SCiForest offer such good practical performance. In this paper, we present a theoretical framework that views these approaches from a distributional viewpoint. Using this viewpoint, we show that isolation-based approaches first accurately approximate the data distribution and then secondly approximate the coefficients of mixture components using the average path length. Using this framework, we derive the generalized isolation forest (GIF) that also trains random isolation trees, but combining them moves beyond using the average path length. That is, GIF splits the data into multiple sub-spaces by sampling random splits as do the original IF variants do and directly estimates the mixture coefficients of a mixture distribution to score the outlierness on entire regions of data. In an extensive evaluation, we compare GIF with 18 state-of-the-art outlier detection methods on 14 different datasets. We show that GIF outperforms three competing tree-based methods and has a competitive performance to other nearest-neighbor approaches while having a lower runtime. Last, we highlight a use-case study that uses GIF to detect transaction fraud in financial data. Keywords Outlier detection · Isolation forest · Density estimation · Ensemble · Tree
1 Introduction Outlier detection is an important data mining task and often one of the first steps when acquiring new data. In the financial sector, it is used to detect transactional fraud [24], money laundering [12,13], and to solve many other related problems [4]. Due to its simplicity and speed, isolation forest (IF) is one of the most popular outlier detection algorithms [16]. It operates on the key observation that decision trees tend to Sebastian Buschjäger and Philipp-Jan Honysz have contributed equally to this work.
B
Sebastian Buschjäger [email protected] Philipp-Jan Honysz [email protected] Katharina Morik [email protected]
1
Artificial Intelligence Unit, TU Dortmund, Dortmund, Germany
isolate outlier examples relatively early in the tree. Thus, the path length of an example when sorted into a tree gives a (somewhat crude) indication of the outlierness of the observation. IF leverages this empirical insight into an ensemble algorithm that trains multiple isolation trees on bootstrap samples and scores observations based on the
Data Loading...