Data Quality Visualization for Preprocessing

Preprocessing is often the most time-consuming phase in data analysis and interdependent data quality issues a cause of suboptimal modelling results. The design problem addressed in this paper is: what kind of framework can support visualization of data q

PDF / 835,499 Bytes
10 Pages / 439.37 x 666.142 pts Page_size
33 Downloads / 259 Views

DOWNLOAD

REPORT

Abstract. Preprocessing is often the most time-consuming phase in data analysis and interdependent data quality issues a cause of suboptimal modelling results. The design problem addressed in this paper is: what kind of framework can support visualization of data quality issue interdependencies for faster and more effective preprocessing? An object framework was designed that uses constructed features as a basis of visualizations. Six real datasets from business performance measurement system domain were acquired to demonstrate the implementation. The framework was found to be a viable preprocessing analysis supplement to both industry practice of exploratory data analysis and research benchmark of preprocessing combinations. Keywords: Preprocessing

Data quality visualization Feature construction

1 Introduction Preprocessing can take as much as 85 % of the duration of a knowledge discovery project [35]. Although the importance of good data quality practices have long since been recognized in the businesses [46], preprocessing a given dataset and its possible data quality issues such as missing values, outliers, noise, insufﬁcient variance, duplicates, class imbalance etc. often requires laborious manual work. Preprocessing can also be a cause of suboptimal results in knowledge discovery tasks. Several papers [12, 24, 45, 48, 49] envision that data preprocessing should have tools speciﬁcally designed for it, which would report data quality observations, guide preprocessing steps needed and evaluate preprocessing outcomes. From the knowledge discovery research point of view, unrecognized or unresolved data quality problems add to the uncertainty of research ﬁndings. Complexity of data production, acquisition and integration processes demands recognition of multiple data quality dimensions simultaneously. Only a few of the dimensions such as incompleteness have computational operationalizations (e.g. labeling missing values, see [3]) and there is no solid theoretical guidance for operationalization of data quality dimensions for preprocessing. Most importantly, data quality issues can be interdependent and a method for empirical identiﬁcation of these interdependencies is a gap in current knowledge. The design problem addressed in this paper is: what kind of framework can support visualization of data quality issue interdependencies for faster and more effective preprocessing? The design problem is operationalized through its elements. First, focus is on visualization. Secondly, data quality is understood either as compliance to basic © Springer International Publishing Switzerland 2016 P. Perner (Ed.): ICDM 2016, LNAI 9728, pp. 428–437, 2016. DOI: 10.1007/978-3-319-41561-1_32

Data Quality Visualization for Preprocessing

429

data quality requirements (e.g. regarding acceptable amount of missing values) or ﬁtness for purpose in knowledge discovery tasks (e.g. classiﬁcation accuracy achieved with the data). Data quality measurement is operationalized exclusively in the context of preprocessing as constructed d

Data Loading...

Data Quality Visualization for Preprocessing

Recommend Documents

Data Preprocessing

Robust Techniques for Data Preprocessing

Data Preprocessing and Data Mining as Generalization

Big Data Analytics and Preprocessing

NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data

Strategies for Multidimensional Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Research on the Web Data Preprocessing