Study on Statistical Outlier Detection and Labelling

  • PDF / 3,943,502 Bytes
  • 24 Pages / 595.26 x 841.82 pts (A4) Page_size
  • 113 Downloads / 248 Views

DOWNLOAD

REPORT


on Statistical Outlier Detection and Labelling Paweł D. Domański Institute of Control and Computation Engineering, Warsaw University of Technology, Warsaw 00-665, Poland

  Abstract:     Outliers  accompany  control  engineers  in  their  real  life  activity.  Industrial  reality  is  much  richer  than  elementary  linear, quadratic, Gaussian assumptions. Outliers appear due to various and varying, often unknown, reasons. They meet research interest in statistical and regression analysis and in data mining. There are a lot of interesting algorithms and approaches to outlier detection, labelling, filtering and finally interpretation. Unfortunately, their impact on control systems has not been found sufficient attention in research. Their influence is frequently unnoticed, ignored or not mentioned. This work focuses on the subject of outlier detection and labelling in the context of control system performance analysis. Selected statistical data-driven approaches are analyzed, as they can be easily  implemented  with  limited  a  priori  knowledge.  The  study  consists  of  a  simulation  study  followed  by  the  analysis  of  real  control data. Different generation mechanisms are simulated, like overlapping Gaussian processes, symmetric and asymmetric, artificially shifted  points  and  fat-tailed  distributions.  Simulation  observations  are  confronted  with  industrial  control  loops  datasets.  The  work  concludes with a practical procedure, which should help practitioners in dealing with outliers in control engineering temporal data. Keywords:   Outlier detection, control loop quality, statistical analysis, robust estimation, heavy tails.

 

1 Introduction An outlier is a strange phenomenon. Varying perspectives may give different interpretations. Simple definitions proposed by Dixon[1] define outliers as values, dubious in the eyes of the researcher or by Weiner[2] as contaminants. One of the most popular definitions has been formulated by Hawkins[3] naming, an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism to be an outlier. Johnson and Wichern[4] define an outlier, as an observation in a data set which appears to be inconsistent with the remainder of that set of data. Barnett and Lewis[5] say that, an outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. As one can see there are various other names for the outliers, for instance anomalies, contaminants or fringeliers reflecting, unusual events which occur more often than seldom[2]. These strange phenomena may have disastrous effects on further data analysis, whatever it will be[6]. They may increase signal variance and reduce the power of statistical tests performed during analysis[7]. They destroy signal normality and introduce fat tails[8]. Finally, Rousseeuw and Leroy[9] point out that they significantly bias regression analysis. Following presented definitions, we may try to investigate their or