Transactions on Rough Sets XI

The LNCS journal Transactions on Rough Sets is devoted to the entire spectrum of rough sets related issues, from logical and mathematical foundations, through all aspects of rough set theory and its applications, such as data mining, knowledge discovery,

  • PDF / 178,200 Bytes
  • 13 Pages / 430 x 660 pts Page_size
  • 67 Downloads / 218 Views

DOWNLOAD

REPORT


ntroduction

For knowledge acquisition (or data mining) from data with numerical attributes special techniques are applied [13]. Most frequently, an additional step, taken before the main step of rule induction or decision tree generation and called discretization is used. In this preliminary step numerical data are converted into symbolic or, more precisely, a domain of the numerical attribute is partitioned into intervals. Many discretization techniques, using principles such as equal interval width, equal interval frequency, minimal class entropy, minimum description length, clustering, etc., were explored, e.g., in [1,2,3,5,6,8,9,10,20,23,24,25,26], and [29]. Discretization algorithms which operate on the set of all attributes and which do not use information about decision (concept membership) are called unsupervised, as opposed to supervised, where the decision is taken into account [9]. Methods processing the entire attribute set are called global, while methods working on one attribute at a time are called local [8]. In all of these methods discretization is a preprocessing step and is undertaken before the main process of knowledge acquisition. Another possibility is to discretize numerical attributes during the process of knowledge acquisition. Examples of such methods are MLEM2 [14] and MODLEM [21,31,32] for rule induction and C4.5 [30] and CART [4] for decision tree J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 1–13, 2010. c Springer-Verlag Berlin Heidelberg 2010 

2

J.W. Grzymala-Busse

generation. These algorithms deal with original, numerical data and the process of knowledge acquisition and discretization are conducted at the same time. The MLEM2 algorithm produces better rule sets, in terms of both simplicity and accuracy, than clustering methods [15]. However, discretization is an art rather than a science, and for a specific data set it is advantageous to use as many discretization algorithms as possible and then select the best approach. In this paper we present the MLEM2 algorithm, one of the most successful approaches to mining numerical data. This algorithm uses rough set theory and calculus of attribute-value pair blocks. A similar approach is represented by MODLEM. Both MLEM2 and MODLEM algorithms are outgrowths of the LEM2 algorithm. However, in MODLEM the most essential part of selecting the best attribute-value pair is conducted using entropy or Laplacian conditions, while in MLEM2 this selection uses the most relevance condition, just like in the original LEM2. Additionally, we present experimental results on a comparison of three commonly used discretization techniques: equal interval width, equal interval frequency and minimal class entropy (all three methods combined with the LEM2 rule induction algorithm) with MLEM2. Our conclusion is that even though MLEM2 was most frequently a winner, the differences between all four data mining methods are statistically insignificant. A preliminary version of this paper was presented at the International Conferenc