A New Algorithm for High-Dimensional Outlier Detection Based on Constrained Particle Swarm Intelligence

In this paper we present an algorithm for outlier detection in high-dimensional spaces based on constrained particle swarm optimization techniques. The concept of outliers is defined as sparsely populated patterns in lower dimensional subspaces. The searc

  • PDF / 320,591 Bytes
  • 8 Pages / 430 x 660 pts Page_size
  • 46 Downloads / 181 Views

DOWNLOAD

REPORT


Abstract. In this paper we present an algorithm for outlier detection in high-dimensional spaces based on constrained particle swarm optimization techniques. The concept of outliers is defined as sparsely populated patterns in lower dimensional subspaces. The search for best abnormally sparse subspaces is done by an innovative use of particle swarm optimization methods with a specifically designed particle coding and conversion strategy as well as some dimensionality-preserving updating techniques. Experimental results show that the proposed algorithm is feasible and effective for high-dimensional outlier detection problems.

1

Introduction

Outlier detection has now become a hot issue in the area of data mining with numerous applications. Most of the existing algorithms for outlier detection use concepts of proximity to define and detect outliers, such as the notion of distancebased or density-based outliers [1]-[6]. However, as explained in [7], these algorithms are not appropriate for high dimensional cases because the data are sparse and the notion of proximity fails to retain effectiveness. Actually, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions [6][7]. Given the fact that most applications as mentioned above are high dimensional problems, it is of significance to reasonably define outliers in a high dimensional space and to design specific algorithms for their detection. Recent research results indicate that mining outliers in lower-dimensional projections is feasible for high dimensional data. On one hand, it is in practice not qualitatively meaningful to detect outliers in full dimensional space. On the other hand, only the subsets of the attributes are affected in some applications, such as credit card fraud, and so on. Along this line, Aggarwal and Yu [7] presented a new way of defining outliers in a high dimensional space as well as a new technique for their detection. The point is to observe the density distributions of projections from the data that could be measured by the so called sparsity coefficients and a data point is considered an outlier, if it is located in some abnormally low density subspace. Hence, the outlier detection in this context boils down to finding those combinations of dimensions with most abnormally sparse data. This turns out to G. Wang et al. (Eds.): RSKT 2008, LNAI 5009, pp. 516–523, 2008. c Springer-Verlag Berlin Heidelberg 2008 

A New Algorithm for High-Dimensional Outlier Detection

517

be a very difficult combinatorial optimization problem since the combinations of dimensions exponentially increase with increasing dimensionality and it is hard to examine all possible subsets of dimensions. Actually, a brute-force algorithm will become computationally untenable even for a problem with moderate dimensionality. To solve this problem, Aggarwal and Yu [7] developed an effective evolutionary computation based algorithm, denoted hereafter as Geno , by making an innovative use of genetic al