A novel discretization algorithm based on multi-scale and information entropy

PDF / 2,125,822 Bytes
19 Pages / 595.224 x 790.955 pts Page_size
46 Downloads / 163 Views

A novel discretization algorithm based on multi-scale and information entropy Yaling Xun1

· Qingxia Yin1 · Jifu Zhang1 · Haifeng Yang1 · Xiaohui Cui1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Discretization is one of the data preprocessing topics in the field of data mining, and is a critical issue to improve the efficiency and quality of data mining. Multi-scale can reveal the structure and hierarchical characteristics of data objects, the representation of the data in different granularities will be obtained if we make a reasonable hierarchical division for a research object. The multi-scale theory is introduced into the process of data discretization and a data discretization method based on multi-scale and information entropy called MSE is proposed. MSE first conducts scale partition on the domain attribute to obtain candidate cut point set with different granularity. Then, the information entropy is applied to the candidate cut point set, and the candidate cut point with the minimum information entropy is selected and detected in turn to determine the final cut point set using the MDLPC criterion. In such way, MSE avoids the problem that the candidate cut points are limited to only certain limited attribute values caused by considering only the statistical attribute values in the traditional discretization methods, and reduces the number of candidates by controlling the data division hierarchy to an optimal range. Finally, the extensive experiments show that MSE achieves high performance in terms of discretization efficiency and classification accuracy, especially when it is applied to support vector machines, random forest, and decision trees. Keywords Data mining · Discretization · Information entropy · Multi-scale · MDLPC criterion

1 Introduction Data discretization is one of the data preprocessing methods in the field of data mining and knowledge discovery, which is to transform quantitative data into qualitative data by dividing continuous domains [35]. For data mining and machine learning, the discretization of continuous attribute can effectively reduce the granularity of the information system to Yaling Xun

[email protected] Qingxia Yin yqx [email protected] Jifu Zhang [email protected] Haifeng Yang [email protected] Xiaohui Cui cuixh [email protected] 1

Taiyuan University of Science and Technology (TYUST), Taiyuan, Shanxi, 030024, China

improve the performance and learning accuracy of data mining/ machine learning algorithms, and enhance the ability of classify, cluster and anti-noise. In addition, many machine learning and data mining algorithms can only deal with discrete attributes, for example, C4.5/ C5.0 decision trees [26], association rules [32, 33], Naive Bayes [34] and rough sets [31]. In essence, data discretization is a data reduction mechanism. Continuous data is grouped into discrete intervals, while it still ensures the correlation between each discrete value and a certain interval. Therefore, data discretization can effectively hide the defects in origi

Data Loading...

A novel discretization algorithm based on multi-scale and information entropy

Recommend Documents

Micro-Expression Recognition Algorithm Based on Information Entropy Feature

Information Entropy Based Planning

Novel Mutual Information Analysis of Attentive Motion Entropy Algorithm for Sports Video Summarization

Entropy and Information Theory

Mining Defects of Result-Sensitive Function Based on Information Entropy

An Entropy Based Algorithm for Credit Scoring

Feature Selection Method Based on Differential Correlation Information Entropy

Group decision making under social influences based on information entropy

A novel classification algorithm based on kernelized fuzzy rough sets

Intuitionistic fuzzy c-means clustering algorithm based on a novel weighted proximity measure and genetic algorithm

Complexity analysis of multiscale multivariate time series based on entropy plane via vector visibility graph

Industrial Smoke Image Segmentation Based on a New Algorithm of Cross-Entropy Model