SICE: an improved missing data imputation technique

  • PDF / 2,011,382 Bytes
  • 21 Pages / 595.276 x 790.866 pts Page_size
  • 111 Downloads / 350 Views

DOWNLOAD

REPORT


pen Access

RESEARCH

SICE: an improved missing data imputation technique Shahidul Islam Khan1,2*  and Abu Sayed Md Latiful Hoque1 *Correspondence: [email protected] 1 Department of CSE, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh Full list of author information is available at the end of the article

Abstract  In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time. Keywords:  Missing Data Imputation, Single Imputation, Multiple Imputation, MICE, Data Analytics

Introduction In the past few years, the generation of digital data has been increased swiftly, along with the rapid development of computational power. These enable the way to extract novel insights from massive datasets, known as big data. In different disciplines such as healthcare, banking, e-commerce, and finance, data analysts are working to discover hidden knowledge from a vast volume of data [1, 2]. Quality of data is a significant concern to them for fruitful data analytics. Although the outcome of data analysis tasks depends on several factors such as attribute selection, algorithm selection, sampling techniques, etc., a key dependency relays upon efficient handling of missing values [3, 4]. Different machine learning and data mining algorithms are widely used to predict outcomes from large datasets. These algorithms usually make proper prediction unless the data used for training the algorithms are flawed. An essential step of

© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Common