Uncertain distance-based outlier detection with arbitrarily shaped data objects

  • PDF / 1,384,800 Bytes
  • 24 Pages / 439.642 x 666.49 pts Page_size
  • 63 Downloads / 249 Views

DOWNLOAD

REPORT


Uncertain distance-based outlier detection with arbitrarily shaped data objects Fabrizio Angiulli1

· Fabio Fassetti1

Received: 7 February 2020 / Revised: 24 September 2020 / Accepted: 24 September 2020 / © The Author(s) 2020

Abstract Enabling information systems to face anomalies in the presence of uncertainty is a compelling and challenging task. In this work the problem of unsupervised outlier detection in large collections of data objects modeled by means of arbitrary multidimensional probability density functions is considered. We present a novel definition of uncertain distance-based outlier under the attribute level uncertainty model, according to which an uncertain object is an object that always exists but its actual value is modeled by a multivariate pdf. According to this definition an uncertain object is declared to be an outlier on the basis of the expected number of its neighbors in the dataset. To the best of our knowledge this is the first work that considers the unsupervised outlier detection problem on data objects modeled by means of arbitrarily shaped multidimensional distribution functions. We present the UDBOD algorithm which efficiently detects the outliers in an input uncertain dataset by taking advantages of three optimized phases, that are parameter estimation, candidate selection, and the candidate filtering. An experimental campaign is presented, including a sensitivity analysis, a study of the effectiveness of the technique, a comparison with related algorithms, also in presence of high dimensional data, and a discussion about the behavior of our technique in real case scenarios. Keywords Nearest neighbors · Outlier detection · Uncertain data · Unsupervised learning

1 Introduction Traditional data analysis techniques deal with feature vectors having deterministic values. Thus, data uncertainty is usually ignored in the problem formulation. However, uncertainty

A preliminary version of this work appears in Angiulli and Fassetti (2013).  Fabrizio Angiulli

[email protected] Fabio Fassetti [email protected] 1

DIMES, University of Calabria, 87036, Rende, CS, Italy

Journal of Intelligent Information Systems

arises in real data in many ways, since the data may contain errors or may be only partially complete (Lindley 2006). The uncertainty may result from the limitations of the equipment, indeed physical devices are often imprecise due to measurement errors. Another source of uncertainty are repeated measurements, e.g. sea surface temperature could be recorded multiple times during a day. Also, in some applications data values are continuously changing, as positions of devices or observations associated with natural phenomena, and these quantities can be represented by using an uncertain model. Simply disregarding uncertainty may led to less accurate conclusions or even inexact ones. This has created a need for uncertain data management techniques (Aggarwal and Yu 2009) managing data records typically represented by probability distributions (Mohri 2003; Kriegel and