Data Preprocessing

In almost all real applications, data contain errors and noise, need to be scaled and transformed, or need to be collected from different and possibly heterogeneous information sources. We distinguish deterministic and stochastic errors. Deterministic err

  • PDF / 647,019 Bytes
  • 14 Pages / 476.214 x 680.305 pts Page_size
  • 11 Downloads / 255 Views

DOWNLOAD

REPORT


Data Preprocessing

Abstract

In almost all real applications, data contain errors and noise, need to be scaled and transformed, or need to be collected from different and possibly heterogeneous information sources. We distinguish deterministic and stochastic errors. Deterministic errors can sometimes be easily corrected. Outliers need to be identified and removed or corrected. Outliers or noise can be reduced by filtering. We distinguish many different filtering methods with different effectiveness and computational complexities: moving statistical measures, discrete linear filters, finite impule response, infinite impulse response. Data features with different ranges often need to be standardized or transformed.

3.1

Error Types

Data often contain errors that may cause incorrect data analysis results. We distinguish stochastic and deterministic errors. Examples for stochastic errors are measurement or transmission errors, which can be modeled by additive noise. The left view of Fig. 3.1 shows again the data set from the right view of Fig. 2.5. To mimic a corresponding data set containing stochastic errors we generate Gaussian noise data using a random generator that produces data following a Gaussian distribution with mean zero and standard deviation 0:1, a so-called N.0; 0:1/ distribution. The middle and right views of Fig. 3.1 show the Gaussian noise data and the data obtained by adding the noise to the original data, respectively. The original (left) and noisy (right) data look very similar, and in fact low noise has often only little impact on data analysis results. Another type of problematic data are outliers, which are defined as individual data with large deviations from normal. Outliers may be caused by stochastic or deterministic effects, for example by extreme individual measurement errors, or by packet losses in data transmission. In manual data assessment outliers may be caused when individual data are © Springer Fachmedien Wiesbaden 2016 T.A. Runkler, Data Analytics, DOI 10.1007/978-3-658-14075-5_3

23

24

3 Data Preprocessing 1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

0

10

20

30

−1

0

10

20

30

−1

0

10

20

30

−1 0

10

20

30

Fig. 3.1 Original data, Gaussian noise, and noisy data 1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

0

10

20

30

−1 0

10

20

30

Fig. 3.2 Original data, outliers, and drift

stored in the wrong data fields, by typos, for example when the decimal point is put at the wrong position. Decimal point errors may also be caused by a deterministic effect, for example when data are exchanged between systems using different meanings for the . and , characters, so 1.234 might be transformed to 1,234, which may refer to 1:234100 and 1:234  103 , depending on country specific notation, and therefore differ by a factor of 1000. Other types of deterministic errors include the use of wrong formulas for the computation of derived data, or measurement errors caused by wrong calibration, wrong scaling, or sensor drift. Data with such deterministic error