An effective approach to enhancing a focused crawler using Google

  • PDF / 1,435,129 Bytes
  • 18 Pages / 439.37 x 666.142 pts Page_size
  • 15 Downloads / 272 Views

DOWNLOAD

REPORT


An effective approach to enhancing a focused crawler using Google Jae‑Gil Lee1 · Donghwan Bae1 · Sansung Kim1 · Jungeun Kim1 · Mun Yong Yi1

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract In this paper, we share our experience in augmenting a focused crawler of our vertical search engine designed to work with academic slides. The goal of the focused crawler was to collect Microsoft PowerPoint files from academic institutions. A previous approach based on a general web crawler can fail to collect a sufficient number of files mainly because of the robots exclusion protocol and missing hyperlinks. As a remedy to these problems, we propose a combinatory approach in which the indexing information maintained by a general web search engine such as Google is utilized for target URL list generation through our query generator, further then complemented by our URL extractor and file downloader. Because Google has already crawled billions of web pages, it will be more cost-efficient and potentially effective to systematically retrieve the desired information from Google than to redo crawling from scratch by ourselves. Our focused crawler, which we call SlideCrawler, has been used for our vertical search engine CourseShare since the fall of 2011. The capability of SlideCrawler  was verified for the top-500 world wide universities. SlideCrawler  collected about one million files from the top-500 universities. Further, the study results show that SlideCrawler outperforms Nutch, collecting 3.7 times more slide files. Keywords  Web crawler · Focused crawler · Google · Vertical search engine

* Jae‑Gil Lee [email protected] Donghwan Bae [email protected] Sansung Kim [email protected] Jungeun Kim [email protected] Mun Yong Yi [email protected] 1



Graduate School of Knowledge Service Engineering, KAIST, Daejeon, Republic of Korea

13

Vol.:(0123456789)



J.-G. Lee et al.

1 Introduction Entering the era of Big Data, we are experiencing unprecedented growth of data and resources on the Web. It has been predicted that the amount of data being produced annually will be 44 times greater in 2020 than it was in 2009 [12]. Search engines such as Google, Bing, and Yahoo are being widely used to find and access the information users are in need of. These general search engines return satisfactory results in most cases, but the precision of the results is known to improve when the scope of a query is limited to specific domains [6]. A vertical search engine, as distinct from a general web search engine, focuses on a specific segment of online contents. The vertical content area may be based on the topicality, media type, or genre of contents [28]. For example, as for the topicality, WebMD  (http://www.webmd​.com/) is a vertical search engine for medical issues; as for the media type, SlideShare (http://www.slide​share​.net/) or SlideFinder (http:// www.slide​finde​r.net/) for PowerPoint files; as for the genre, Microsoft Academic Search  (http://acade​mic.micro​soft.com/) for academic contents. As the