Automatic Summarization of Web Page Based on Statistics and Structure

This paper discusses the automatic text information extraction, and presents the automatic summarization based on analysis of HTML tags and statistics. This method combines the summary extraction and Web structure. We get structure levels of the document

  • PDF / 187,019 Bytes
  • 7 Pages / 429.725 x 659.895 pts Page_size
  • 10 Downloads / 245 Views

DOWNLOAD

REPORT


Engineering College of Management, University of Huazhong University of Science & Technology Wuhan, Hubei 430074, China 2 Net Center of Henan University, Kaifeng 475001, China [email protected]

Abstract. This paper discusses the automatic text information extraction, and presents the automatic summarization based on analysis of HTML tags and statistics. This method combines the summary extraction and Web structure. We get structure levels of the document using HTML tags and calculate the weight of sentences using text structure and the statistics of word frequency, in order to extract summary. Our experimental results indicate that this method accurately and completely specifies the main content of the document, having high accurate rate and recall rate. Keywords: Automatic summarization, Web page summary, Web page structure, Statistics.

1

Introduction

Summarization is to provide the broad outline of information without comments and additional explanatory, and it states the important content concisely and accurately. Automatic summary is the process of automatically compiling and generating the summary using the computer. So far, the existed automatic text summarization system can be divided into two categories: statistic-based text summarization and knowledgebased text summarization[1]. The statistic-based text summarization has simple method and realizes easily, but it generates unsatisfactory result[2]. The knowledgebased text summarization is based on understanding the text information, which gets better result, but it is too difficult. For the Web page, Web document summarization should be used for all kinds of users not limited by fields. It should have certain content coverage which states the outline of Web page accurately and completely, and at the same time the generated summary must reach certain speed, in order to meet the requirement of processing a large number of Web documents. HTML tags divide the web page into different structures, which represent different degrees of importance. For example, is the headline of the Web page; is the main content;

is the paragraph tag; contains keywords or the abstract. For the words in the Web page, if they are composed by normal types, they are not as important as the words which are composed by special types. We depend on the analysis of HTML tags in the Web page, combine with calculating the weights H. Tan (Ed.): Knowledge Discovery and Data Mining, AISC 135, pp. 643–649. springerlink.com © Springer-Verlag Berlin Heidelberg 2012

644

S. Zheng and J. Yu

of keywords, and design a method to extract the summary of Web page. This paper automatically summarizes the web document, with a combination of HTML tags analysis and statistic. First, we analysis HTML tags of the document and get the paragraph information and all levels of subhead information; then we extract the document’s keywords and key sentences using the statistic method and heuristic rules; at last, we generate the summary of the document after eliminating redundancy among key sentences