Data Preprocessing

In almost all real applications, data contain errors and noise, need to be scaled and transformed, or need to be collected from different and possibly heterogeneous information sources. We distinguish deterministic and stochastic errors. Deterministic err

PDF / 647,019 Bytes
14 Pages / 476.214 x 680.305 pts Page_size
11 Downloads / 282 Views

DOWNLOAD

REPORT

Data Preprocessing

Abstract

In almost all real applications, data contain errors and noise, need to be scaled and transformed, or need to be collected from different and possibly heterogeneous information sources. We distinguish deterministic and stochastic errors. Deterministic errors can sometimes be easily corrected. Outliers need to be identified and removed or corrected. Outliers or noise can be reduced by filtering. We distinguish many different filtering methods with different effectiveness and computational complexities: moving statistical measures, discrete linear filters, finite impule response, infinite impulse response. Data features with different ranges often need to be standardized or transformed.

3.1

Error Types

Data often contain errors that may cause incorrect data analysis results. We distinguish stochastic and deterministic errors. Examples for stochastic errors are measurement or transmission errors, which can be modeled by additive noise. The left view of Fig. 3.1 shows again the data set from the right view of Fig. 2.5. To mimic a corresponding data set containing stochastic errors we generate Gaussian noise data using a random generator that produces data following a Gaussian distribution with mean zero and standard deviation 0:1, a so-called N.0; 0:1/ distribution. The middle and right views of Fig. 3.1 show the Gaussian noise data and the data obtained by adding the noise to the original data, respectively. The original (left) and noisy (right) data look very similar, and in fact low noise has often only little impact on data analysis results. Another type of problematic data are outliers, which are defined as individual data with large deviations from normal. Outliers may be caused by stochastic or deterministic effects, for example by extreme individual measurement errors, or by packet losses in data transmission. In manual data assessment outliers may be caused when individual data are © Springer Fachmedien Wiesbaden 2016 T.A. Runkler, Data Analytics, DOI 10.1007/978-3-658-14075-5_3

23

24

3 Data Preprocessing 1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

0

10

20

30

−1

0

10

20

30

−1

0

10

20

30

−1 0

10

20

30

Fig. 3.1 Original data, Gaussian noise, and noisy data 1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

0

10

20

30

−1 0

10

20

30

Fig. 3.2 Original data, outliers, and drift

stored in the wrong data fields, by typos, for example when the decimal point is put at the wrong position. Decimal point errors may also be caused by a deterministic effect, for example when data are exchanged between systems using different meanings for the . and , characters, so 1.234 might be transformed to 1,234, which may refer to 1:234100 and 1:234 103 , depending on country specific notation, and therefore differ by a factor of 1000. Other types of deterministic errors include the use of wrong formulas for the computation of derived data, or measurement errors caused by wrong calibration, wrong scaling, or sensor drift. Data with such deterministic error

Data Loading...

Data Preprocessing

Recommend Documents

Robust Techniques for Data Preprocessing

Data Preprocessing and Data Mining as Generalization

Big Data Analytics and Preprocessing

Data Quality Visualization for Preprocessing

Research on the Web Data Preprocessing

Advanced Data Preprocessing and Feature Engineering

Imbalanced Data Stream Classification Using Hybrid Data Preprocessing

Preprocessing

Big Data Preprocessing: An Application on Online Social Networks

Preprocessing in Stochastic Programming

Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data

Data Preprocessing and Dynamic Ensemble Selection for Imbalanced Data Stream Classification