Combining Apriori Approach with Support-Based Count Technique to Cluster the Web Documents

The dynamic Web where thousands of pages are updated in every second is growing at lightning speed. Hence, getting required Web documents in a fraction of time is becoming a challenging task for the present search engine. Clustering, which is an important

PDF / 558,611 Bytes
14 Pages / 439.37 x 666.142 pts Page_size
63 Downloads / 192 Views

DOWNLOAD

REPORT

Abstract The dynamic Web where thousands of pages are updated in every second is growing at lightning speed. Hence, getting required Web documents in a fraction of time is becoming a challenging task for the present search engine. Clustering, which is an important technique of data mining can shed light on this problem. Association technique of data mining plays a vital role in clustering the Web documents. This paper is an eﬀort in that direction where the following techniques have been proposed: (1) a new feature selection technique named term-term correlation has been introduced which reduces the size of the corpus by eliminating noise and redundant features. (2) a novel technique named Support Based Count (SBC) has been proposed which combines with traditional Apriori approach for clustering the Web documents. Empirical results on two benchmark datasets show that the proposed approach is more promising compared to the traditional clustering approaches. Keywords Apriori ⋅ Cluster ⋅ Fuzzy ⋅ K-means ⋅ Support based count

1 Introduction World Wide Web (WWW) is the most important place for Information Retrieval (IR). Tremendous exponentiation growth of WWW makes the end user diﬃcult to ﬁnd the desired results the search engine. Since the inception of WWW, the amount of data on the Web has expanded many manifolds and their size is doubling in every 6–10 months. Hundreds of millions of users each day submit queries to the Web search engines. According to Spin et al. [1], queries of length one (monogram) are R.K. Roul (✉) ⋅ S.K. Sahay BITS-Pilani, K.K. Birla Goa Campus, Sancoale, India e-mail: [email protected] S.K. Sahay e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2017 H.S. Behera and D.P. Mohapatra (eds.), Computational Intelligence in Data Mining, Advances in Intelligent Systems and Computing 556, DOI 10.1007/978-981-10-3874-7_12

119

120

R.K. Roul and S.K. Sahay

submitted by 48.4% of the total users, queries having length two (bigrams) are 20.8% and queries of length three or more are entered by only 31% of users. The authors also mentioned that 50% of the total Internet users never see beyond ﬁrst two pages of the returned results, only the ﬁrst page is seen by 65–70% users, second page by 20–25% users, the remaining results are seen by very few of 3–4% users. A similar kind of survey had been done by W.B. Croft [2]. The main challenge for a search engine is that how it satisﬁes the user request in an eﬃcient manner. Clustering is one of the powerful data mining technique which can help in this direction by grouping the similar documents into one place and thus attract many research ideas [3–9]. In general, the returned search results of any search engine are not clustered automatically. To get the desired and eﬃcient information, the search results need to be grouped into diﬀerent clusters. Association-based techniques such as apriori algorithm [10] can be used for Web documents clustering. Similarity between the documents is decided based on the association of the ke

Data Loading...

Combining Apriori Approach with Support-Based Count Technique to Cluster the Web Documents

Recommend Documents

SpSiSb: The Technique to Identify Forgery in Legal Handwritten Documents

An Unsupervised Technique to Generate Summaries from Opinionated Review Documents

Knowledge-based Approach to Gas Sorption in Glassy Polymers by Combining Experimental and Molecular Simulation Technique

A Robust Approach to Plagiarism Detection in Handwritten Documents

An Approach to Web Information Processing

Fast Categorization of Web Documents Represented by Graphs

A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

A Modern Approach to the Bier Block Technique

Approach to Extract Keywords and Keyphrases of Text Resources and Documents in the Kazakh Language

Noise reduction approach in pediatric abdominal CT combining deep learning and dual-energy technique

How to Count the Grains of Sand

Count Regression and Machine Learning Approach for Zero-Inflated Over-Dispersed Count Data. Application to Micro-Retail