Data Quality Visualization for Preprocessing
Preprocessing is often the most time-consuming phase in data analysis and interdependent data quality issues a cause of suboptimal modelling results. The design problem addressed in this paper is: what kind of framework can support visualization of data q
- PDF / 835,499 Bytes
- 10 Pages / 439.37 x 666.142 pts Page_size
- 33 Downloads / 241 Views
Abstract. Preprocessing is often the most time-consuming phase in data analysis and interdependent data quality issues a cause of suboptimal modelling results. The design problem addressed in this paper is: what kind of framework can support visualization of data quality issue interdependencies for faster and more effective preprocessing? An object framework was designed that uses constructed features as a basis of visualizations. Six real datasets from business performance measurement system domain were acquired to demonstrate the implementation. The framework was found to be a viable preprocessing analysis supplement to both industry practice of exploratory data analysis and research benchmark of preprocessing combinations. Keywords: Preprocessing
Data quality visualization Feature construction
1 Introduction Preprocessing can take as much as 85 % of the duration of a knowledge discovery project [35]. Although the importance of good data quality practices have long since been recognized in the businesses [46], preprocessing a given dataset and its possible data quality issues such as missing values, outliers, noise, insufficient variance, duplicates, class imbalance etc. often requires laborious manual work. Preprocessing can also be a cause of suboptimal results in knowledge discovery tasks. Several papers [12, 24, 45, 48, 49] envision that data preprocessing should have tools specifically designed for it, which would report data quality observations, guide preprocessing steps needed and evaluate preprocessing outcomes. From the knowledge discovery research point of view, unrecognized or unresolved data quality problems add to the uncertainty of research findings. Complexity of data production, acquisition and integration processes demands recognition of multiple data quality dimensions simultaneously. Only a few of the dimensions such as incompleteness have computational operationalizations (e.g. labeling missing values, see [3]) and there is no solid theoretical guidance for operationalization of data quality dimensions for preprocessing. Most importantly, data quality issues can be interdependent and a method for empirical identification of these interdependencies is a gap in current knowledge. The design problem addressed in this paper is: what kind of framework can support visualization of data quality issue interdependencies for faster and more effective preprocessing? The design problem is operationalized through its elements. First, focus is on visualization. Secondly, data quality is understood either as compliance to basic © Springer International Publishing Switzerland 2016 P. Perner (Ed.): ICDM 2016, LNAI 9728, pp. 428–437, 2016. DOI: 10.1007/978-3-319-41561-1_32
Data Quality Visualization for Preprocessing
429
data quality requirements (e.g. regarding acceptable amount of missing values) or fitness for purpose in knowledge discovery tasks (e.g. classification accuracy achieved with the data). Data quality measurement is operationalized exclusively in the context of preprocessing as constructed d
Data Loading...