A novel discretization algorithm based on multi-scale and information entropy
- PDF / 2,125,822 Bytes
- 19 Pages / 595.224 x 790.955 pts Page_size
- 46 Downloads / 163 Views
A novel discretization algorithm based on multi-scale and information entropy Yaling Xun1
· Qingxia Yin1 · Jifu Zhang1 · Haifeng Yang1 · Xiaohui Cui1
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Discretization is one of the data preprocessing topics in the field of data mining, and is a critical issue to improve the efficiency and quality of data mining. Multi-scale can reveal the structure and hierarchical characteristics of data objects, the representation of the data in different granularities will be obtained if we make a reasonable hierarchical division for a research object. The multi-scale theory is introduced into the process of data discretization and a data discretization method based on multi-scale and information entropy called MSE is proposed. MSE first conducts scale partition on the domain attribute to obtain candidate cut point set with different granularity. Then, the information entropy is applied to the candidate cut point set, and the candidate cut point with the minimum information entropy is selected and detected in turn to determine the final cut point set using the MDLPC criterion. In such way, MSE avoids the problem that the candidate cut points are limited to only certain limited attribute values caused by considering only the statistical attribute values in the traditional discretization methods, and reduces the number of candidates by controlling the data division hierarchy to an optimal range. Finally, the extensive experiments show that MSE achieves high performance in terms of discretization efficiency and classification accuracy, especially when it is applied to support vector machines, random forest, and decision trees. Keywords Data mining · Discretization · Information entropy · Multi-scale · MDLPC criterion
1 Introduction Data discretization is one of the data preprocessing methods in the field of data mining and knowledge discovery, which is to transform quantitative data into qualitative data by dividing continuous domains [35]. For data mining and machine learning, the discretization of continuous attribute can effectively reduce the granularity of the information system to Yaling Xun
[email protected] Qingxia Yin yqx [email protected] Jifu Zhang [email protected] Haifeng Yang [email protected] Xiaohui Cui cuixh [email protected] 1
Taiyuan University of Science and Technology (TYUST), Taiyuan, Shanxi, 030024, China
improve the performance and learning accuracy of data mining/ machine learning algorithms, and enhance the ability of classify, cluster and anti-noise. In addition, many machine learning and data mining algorithms can only deal with discrete attributes, for example, C4.5/ C5.0 decision trees [26], association rules [32, 33], Naive Bayes [34] and rough sets [31]. In essence, data discretization is a data reduction mechanism. Continuous data is grouped into discrete intervals, while it still ensures the correlation between each discrete value and a certain interval. Therefore, data discretization can effectively hide the defects in origi
Data Loading...