Data Preprocessing and Data Mining as Generalization

We present here an abstract model in which data preprocessing and data mining proper stages of the Data Mining process are are described as two different types of generalization. In the model the data mining and data preprocessing algorithms are defined a

  • PDF / 253,327 Bytes
  • 16 Pages / 439 x 666 pts Page_size
  • 91 Downloads / 276 Views

DOWNLOAD

REPORT


2

Department of Computer Science, State University of New York, Stony Brook, NY, USA [email protected] Departamento de Lenguajes y Sistemas Informaticos Facultad de Informatica, U.P.M, Madrid, Spain [email protected]

Summary. We present here an abstract model in which data preprocessing and data mining proper stages of the Data Mining process are are described as two different types of generalization. In the model the data mining and data preprocessing algorithms are defined as certain generalization operators. We use our framework to show that only three Data Mining operators: classification, clustering, and association operator are needed to express all Data Mining algorithms for classification, clustering, and association, respectively. We also are able to show formally that the generalization that occurs in the preprocessing stage is different from the generalization inherent to the data mining proper stage.

1 Introduction We build models in order to be able to address formally intuitively expressed notions, or answer intuitively formulated questions. We say for example, that Data Mining generalizes the data by transforming them into a more general information. But what in fact is a generalization? When a transformation of data is, and when is not a generalization? How one kind of generalization differs from the other? The model presented here addresses and answers, even if partially these questions. There are many data mining algorithms and thousands of implementations. A natural questions arise: why very different algorithms are all called, for example, the classification algorithms? What do they have in common? How do they differ from other algorithms? We hence build our model in such a way as to be able to define characteristics common to one type of algorithms, and not to the other types. We present here three models: generalization model (Definition 1) and its particular cases, data mining model (Definition 12), and preprocessing model (Definition 26). A. Wasilewska and E. Menasalvas: Data Preprocessing and Data Mining as Generalization, Studies in Computational Intelligence (SCI) 118, 469–484 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com 

470

A. Wasilewska and E. Menasalvas

In data mining model each class of data mining algorithms is represented by an operator. Theses operators are also generalization operators of the generalization model, i.e. they capture formally the intuitive notion of information generalization. Moreover, we show that (Theorem 4) all operators belonging to one category are distinctive with other categories. The generalization model presented here is an extension of the model presented in [16], preliminary version of the data mining model and preprocessing models was presented in [13, 14], respectively. We usually view Data Mining results and present them to the user in their descriptive form as it is the most natural form of communication. But the Data Mining process is deeply semantical in its nature. The algorithms process records (semantics) finding similarities