Finding and Extracting Academic Information from Conference Web Pages

This paper proposes a method for finding and extracting academic information from conference Web pages. The main contributions include: (1) A lightweight topic crawling method based on search engine is used to crawl academic conference Web pages. (2) An n

  • PDF / 839,136 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 46 Downloads / 235 Views

DOWNLOAD

REPORT


School of Computer Science and Engineering, Southeast University, Nanjing, China {[email protected] , x.zhang}@seu.edu.cn 2 Focus Technology Co., Ltd, Nanjing, China [email protected]

Abstract. This paper proposes a method for finding and extracting academic information from conference Web pages. The main contributions include: (1) A lightweight topic crawling method based on search engine is used to crawl academic conference Web pages. (2) An new vision-based page segmentation algorithm is proposed to improve the result of classical VIPS algorithm by introducing complete tree. This algorithm can divide Web pages into text blocks. (3) Using bayesian network classifier, all text blocks are classified as 10 categories according to its vision features, key-word features and text content features. The initial classification results have 75 % precision and 67 % recall. (4) The context information of text blocks are employed to repair and refine initial classification results, which are improved to 96 % precision and 98 % recall. Finally, academic information is easily extracted from the classified text blocks. Experimental results on real-world datasets show that our method is effective and efficient for finding and extracting academic information from conference Web pages. Keywords: Topic crawler  Web information extraction  Page segmentation

1 Introduction Current structural or semantic academic data such as ArnetMiner academic researcher social network [1], is based on database like DBLP and ACM library. These academic data mainly describes paper publication information of researchers. However, academic activity knowledge is not included by current academic data. Academic conferences websites not only contain paper information, but also contain many academic activity information, which includes research topic, conference time, location, participants, academic awards, and so on. Obtaining such information is not only useful for predicting research trends and analyzing academic social network, but also is the important supplement to current academic linked data. In order to automatically and efficiently obtain the clean and high quality academic data, it is necessary to extract useful academic information from these conference Web pages. Since academic conferences Web pages are usually semi-structured and have diversity content, there is no a unified way to automatically find conference sites and extract the academic information. S. Zhou and Z. Wu (Eds.): ADMA 2012 Workshops, CCIS 387, pp. 65–79, 2013. DOI: 10.1007/978-3-642-41629-3_6,  Springer-Verlag Berlin Heidelberg 2013

66

P. Wang et al.

To find academic information, it needs to crawl thousands of conference websites automatically, that is a topic crawling problem. To avoid to check all Web pages one by one, we should design a lightweight topic crawler. Web information extraction is a classical problem [2, 3], which aims at identifying interested information from unstructured or semi-structured data in pages, and translates it to into a semantic