Mixture Models in Forward Search Methods for Outlier Detection

Forward search (FS) methods have been shown to be usefully employed for detecting multiple outliers in continuous multivariate data (Hadi, (1994); Atkinson et al., (2004)). Starting from an outlier-free subset of observations, they iteratively enlarge thi

  • PDF / 246,474 Bytes
  • 8 Pages / 439.37 x 666.142 pts Page_size
  • 59 Downloads / 297 Views

DOWNLOAD

REPORT


1 Introduction Mixtures of multivariate normal densities are widely used in cluster analysis, density estimation and discriminant analysis, usually resorting to maximum likelihood (ML) estimation, via the EM algorithm (for an overview, see McLachlan and Peel, (2000)). When the number of components K is treated as fixed, ML estimation is not robust against outlying data: a single extreme point can make the parameter estimation of at least one of the mixture components break down. Among the solutions presented in the literature, the main computable approaches in the multivariate setting are: the addition of a noise component modelled as a uniform distribution on the convex hull of the data, implemented in the software MCLUST (Fraley and Raftery, (1998)); a mixture of t-distributions instead of normal distributions, implemented in the software EMMIX (McLachlan and Peel, (2000)). According to Hennig, both the alternatives “ ... do not possess a substantially better breakdown behavior than estimation based on normal mixtures" (Hennig, (2004)). An alternative approach to the problem is based on the idea that a good outlier detection method defines a robust estimation method, that works by omitting the observations nominated as outliers and computing a standard non-robust estimate on the remaining observations. Here, attention is focussed on the so-called forward search (FS) methods, which have been usefully employed for detecting multiple outliers in continuous multivariate data. These methods are based on the assumption that

104

Daniela G. Calò

non-outlying data stem form a multivariate normal distribution or they are roughly elliptically symmetric. In this paper, an alternative formulation of the FS algorithm is proposed, which is specifically designed for situations where non-outlying data stem from a mixture of a known number of normal components. It could not only enlarge the applicability of FS outlier detection methods, but could also provide a possible strategy for robust fitting in multivariate normal mixture models.

2 The Forward Search The Forward search (FS) is a powerful general method for detecting multiple masked outliers in continuous multivariate data (Hadi, (1994); Atkinson, (1993)). The search starts by fitting the multivariate normal model to a small subset Sm , consisting of m = m0 observations, that can be safely presumed to be free of outliers: it can be specified by the data analyst or obtained by an algorithm. All n observations are ordered by their Mahalanobis distance and Sm is updated as the set of the m + 1 observations with the smallest Mahalanobis distances. Then, the number m is increased by 1 and the search goes on, by fitting the normal model to the current subset Sm and updating Sm as stated above – so that its size is increased by one unit at a time – until Sm includes all n observations (that is, m = n). By ordering the data according to their closeness to the fitted model (by means of Mahalanobis distance), the various steps of the search provide subsets which are designed to be