Combining Apriori Approach with Support-Based Count Technique to Cluster the Web Documents
The dynamic Web where thousands of pages are updated in every second is growing at lightning speed. Hence, getting required Web documents in a fraction of time is becoming a challenging task for the present search engine. Clustering, which is an important
- PDF / 558,611 Bytes
- 14 Pages / 439.37 x 666.142 pts Page_size
- 63 Downloads / 167 Views
Abstract The dynamic Web where thousands of pages are updated in every second is growing at lightning speed. Hence, getting required Web documents in a fraction of time is becoming a challenging task for the present search engine. Clustering, which is an important technique of data mining can shed light on this problem. Association technique of data mining plays a vital role in clustering the Web documents. This paper is an effort in that direction where the following techniques have been proposed: (1) a new feature selection technique named term-term correlation has been introduced which reduces the size of the corpus by eliminating noise and redundant features. (2) a novel technique named Support Based Count (SBC) has been proposed which combines with traditional Apriori approach for clustering the Web documents. Empirical results on two benchmark datasets show that the proposed approach is more promising compared to the traditional clustering approaches. Keywords Apriori ⋅ Cluster ⋅ Fuzzy ⋅ K-means ⋅ Support based count
1 Introduction World Wide Web (WWW) is the most important place for Information Retrieval (IR). Tremendous exponentiation growth of WWW makes the end user difficult to find the desired results the search engine. Since the inception of WWW, the amount of data on the Web has expanded many manifolds and their size is doubling in every 6–10 months. Hundreds of millions of users each day submit queries to the Web search engines. According to Spin et al. [1], queries of length one (monogram) are R.K. Roul (✉) ⋅ S.K. Sahay BITS-Pilani, K.K. Birla Goa Campus, Sancoale, India e-mail: [email protected] S.K. Sahay e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2017 H.S. Behera and D.P. Mohapatra (eds.), Computational Intelligence in Data Mining, Advances in Intelligent Systems and Computing 556, DOI 10.1007/978-981-10-3874-7_12
119
120
R.K. Roul and S.K. Sahay
submitted by 48.4% of the total users, queries having length two (bigrams) are 20.8% and queries of length three or more are entered by only 31% of users. The authors also mentioned that 50% of the total Internet users never see beyond first two pages of the returned results, only the first page is seen by 65–70% users, second page by 20–25% users, the remaining results are seen by very few of 3–4% users. A similar kind of survey had been done by W.B. Croft [2]. The main challenge for a search engine is that how it satisfies the user request in an efficient manner. Clustering is one of the powerful data mining technique which can help in this direction by grouping the similar documents into one place and thus attract many research ideas [3–9]. In general, the returned search results of any search engine are not clustered automatically. To get the desired and efficient information, the search results need to be grouped into different clusters. Association-based techniques such as apriori algorithm [10] can be used for Web documents clustering. Similarity between the documents is decided based on the association of the ke
Data Loading...