Efficient Clustering of Databases Induced by Local Patterns

In view of answering queries provided in multiple large databases, it might be required to mine relevant databases en block. In this chapter, we present an efficient solution to clustering multiple large databases. We present two measures of similarity be

  • PDF / 498,513 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 52 Downloads / 198 Views

DOWNLOAD

REPORT


Efficient Clustering of Databases Induced by Local Patterns

In view of answering queries provided in multiple large databases, it might be required to mine relevant databases en block. In this chapter, we present an efficient solution to clustering multiple large databases. We present two measures of similarity between a pair of databases and study their main properties. In the sequel, we design an algorithm for clustering multiple databases based on an introduced similarity measure. Also, we present a coding, referred to as IS coding, to represent itemsets space efficiently. The coding of this nature enables more frequent itemsets to participate in the determination of the similarity between two databases. Thus the invoked clustering process becomes more accurate. We also show that the IS coding attains maximum efficiency in most of the cases of the mining processes. The clustering algorithm becomes improved (in terms of its time complexity) when contrasted with the existing clustering algorithms. The efficiency of the clustering process has been improved using several strategies that is by reducing execution time of the clustering algorithm, using more suitable similarity measure, and storing frequent itemsets space efficiently.

6.1 Introduction Effective data analysis using a traditional data mining technique on multi-gigabyte repositories has proven difficult. A quick discovery of approximate knowledge from large databases would be adequate for many decision support applications. As before, let us consider a company that deals with multiple large databases. The company might need to carry out association analysis involving non-profit making items (products). The ultimate objective is to identify the items that neither make much profit nor help promoting other products. An association analysis involving non-profit making items might identify such items. The company could then stop dealing with them. The analysis of this nature might require identifying similar databases. Let us note that two databases are deemed similar if they contain many similar transactions. Again, two transactions are similar if they have many common items. We observe later that two databases containing many common items are not necessarily very similar. First, let us define a few concepts used frequently in this chapter. A. Adhikari et al., Developing Multi-database Mining Applications, Advanced Information and Knowledge Processing, DOI 10.1007/978-1-84996-044-1_6,  C Springer-Verlag London Limited 2010

95

96

6

Efficient Clustering of Databases Induced by Local Patterns

Let I(D) be the set of items in database D. An itemset is a set of items in a database. An itemset X in D is associated with a statistical measure called support (Agrawal et al. 1993), denoted by supp(X, D), for X ⊆ I(D). Support of an itemset X in D is the fraction of transactions in D containing X. The importance of an itemset could be judged by quantifying its support. X is called a frequent itemset (FIS) in D if supp(X, D) ≥ α, where α is the user-defined minim