Semantic Clustering of Website Based on Its Hypertext Structure

The volume of unstructured information presented on the Internet is constantly increasing, together with the total amount of websites and their contents. To process this vast amount of information it is important to distinguish different clusters of relat

PDF / 528,866 Bytes
13 Pages / 439.37 x 666.142 pts Page_size
54 Downloads / 192 Views

DOWNLOAD

REPORT

2

Saratov State Technical University, 410054 Saratov, Russia [email protected], [email protected], [email protected] Universit¨ at Leipzig, AKSW/BIS, PO BOX 100920, 04009 Leipzig, Germany {ivan.ermilov,speck}@informatik.uni-leipzig.de 3 Universit¨ at Bonn, CS/EIS, R¨ omerstraße 164, 53117 Bonn, Germany [email protected] Abstract. The volume of unstructured information presented on the Internet is constantly increasing, together with the total amount of websites and their contents. To process this vast amount of information it is important to distinguish diﬀerent clusters of related webpages. Such clusters are used, for example, for knowledge extraction, named entity recognition, and recommendation algorithms. A variety of applications (such as semantic analysis systems, crawlers and search engines) utilizes semantic clustering algorithms to recognize thematically connected webpages. The majority of them relies on text analysis of the web documents content, and this leads to certain limitations, such as long processing time, need of representative text content, or vagueness of natural language. In this article, we present a framework for unsupervised domain and language independent semantic clustering of the website, which utilizes its internal hypertext structure and does not require text analysis. As a basis, we represent the hypertext structure as a graph and apply known ﬂow simulation clustering algorithms to the graph to produce a set of webpage clusters. We assume these clusters contain thematically connected webpages. We evaluate our clustering approach with a corpus of real-world webpages and compare the approach with well-known text document clustering algorithms.

1

Introduction

The volume of unstructured information presented on the Internet in human-readable form is constantly increasing, together with the total amount of websites and their contents. Technologies for extracting, analyzing, automatic accessing and processing data become increasingly more important in Web with its continuous growth. However, ﬁnding and analyzing information relevant to a problem at hand from such a vast amount of information is still a major challenge in the Web. With processing such information volumes, it is important to distinguish collections of thematically connected webpages. Webpage clustering is used widely in a variety of web data extraction applications, c Springer International Publishing Switzerland 2015 P. Klinov and D. Mouromtsev (Eds.): KESW 2015, CCIS 518, pp. 182–194, 2015. DOI: 10.1007/978-3-319-24543-0 14

Semantic Clustering of Website Based on Its Hypertext Structure

183

for example, for knowledge extraction, search results representation or recommendation algorithms (i.e. [5,10,11,13]). The majority of such applications use clustering methods based on text analysis, treating webpages as common text documents, pushing aside their hypertext attributes. This leads to well-known limitations of text analysis techniques (varies for diﬀerent algorithms), i.e. polysemy-capturing problem, limitations

Data Loading...

Semantic Clustering of Website Based on Its Hypertext Structure

Recommend Documents

Hypertext

A Novel Web Anomaly Detection Approach Based on Semantic Structure

Research of Website Optimization Strategy Based on Search Engine

Hypertext und Storyboard

Hypertext-Layout und Bildschirmtypografie

Clustering Based on Genetic Algorithms

The HyperText Transfer Protocol

A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures

Spectral clustering of combinatorial fullerene isomers based on their facet graph structure

Effect of structure of nonisocyanate condensation polyurethanes based on benzoic acid on its susceptibility to biodegrad

Semantic Integrity Analysis Based on Transformer

Path-Based Semantic Relatedness on Linked Data and Its Use to Word and Entity Disambiguation