A New Decision Tree Construction Using the Cloud Transform and Rough Sets

Many present methods for dealing with the continuous data and missing values in information systems for constructing decision tree do not perform well in practical applications. In this paper, a new algorithm, Decision Tree Construction based on the Cloud

  • PDF / 375,688 Bytes
  • 8 Pages / 430 x 660 pts Page_size
  • 17 Downloads / 152 Views

DOWNLOAD

REPORT


4

School of Information Science and Technology Southwest Jiaotong University, Chengdu 610031, P.R. China [email protected], [email protected] 2 Research Center for Secure Application in Networks and Communications Southwest Jiaotong University, Chengdu 610031, P.R. China 3 Belgian Nuclear Research Centre (SCK•CEN), 2400 Mol, Belgium [email protected] Transportation Research Institute, Hasselt University, 3590 Diepenbeek, Belgium

Abstract. Many present methods for dealing with the continuous data and missing values in information systems for constructing decision tree do not perform well in practical applications. In this paper, a new algorithm, Decision Tree Construction based on the Cloud Transform and Rough Set Theory under Characteristic Relation (DTCCRSCR), is proposed for mining classification knowledge from the data set. The cloud transform is applied to discretize continuous data and the attribute whose weighted mean roughness under the characteristic relation is the smallest will be selected as the current splitting node. Experimental results show the decision trees constructed by DTCCRSCR tend to have a simpler structure, much higher classification accuracy and more understandable rules than C5.0 in most cases. Keywords: Rough sets, Cloud transform, Decision trees, Weighted mean roughness, Characteristic relation.

1

Introduction

Decision trees are considered as one of the most popular data-mining techniques for knowledge discovery. It systematically analyzes information contained in a large amount of data source to extract valuable rules and relationships [1]. Many approaches for constructing decision trees have been presented. One of the representative methods is ID3 algorithm, which is based on the information theory and attempts to minimize the expected number of comparisons [2]. The basic idea of the induction algorithm is that the attribute which has a maximum gain value of information entropy will be chosen as the current splitting node. C4.5 [3] and C5.0 [4], based on ID3, allow the use of missing data, continuous data and 

This work is partially supported by NSFC (No.60074014), the Research Fund for the Doctoral Program of Higher Education (No.20060613007) and the Basic Science Foundation of Southwest Jiaotong University (No.2007B13).

G. Wang et al. (Eds.): RSKT 2008, LNAI 5009, pp. 524–531, 2008. c Springer-Verlag Berlin Heidelberg 2008 

A New Decision Tree Construction Using the Cloud Transform

525

improved techniques for splitting. For example, when a decision tree is built by C4.5, continuous data are divided into ranges based on the attribute values, while missing data are simply ignored. To classify a record with a missing attribute value, the value for that item can be predicted based on the attribute values for other records [1]. However, the existing algorithms for dealing with the continuous data and missing values in information systems do not perform well in real applications. The classical rough set theory (RST), proposed by Pawlak in 1982, is a mathematical tool to deal with vag