An effective approach to enhancing a focused crawler using Google

PDF / 1,435,129 Bytes
18 Pages / 439.37 x 666.142 pts Page_size
15 Downloads / 289 Views

An effective approach to enhancing a focused crawler using Google Jae‑Gil Lee1 · Donghwan Bae1 · Sansung Kim1 · Jungeun Kim1 · Mun Yong Yi1

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract In this paper, we share our experience in augmenting a focused crawler of our vertical search engine designed to work with academic slides. The goal of the focused crawler was to collect Microsoft PowerPoint files from academic institutions. A previous approach based on a general web crawler can fail to collect a sufficient number of files mainly because of the robots exclusion protocol and missing hyperlinks. As a remedy to these problems, we propose a combinatory approach in which the indexing information maintained by a general web search engine such as Google is utilized for target URL list generation through our query generator, further then complemented by our URL extractor and file downloader. Because Google has already crawled billions of web pages, it will be more cost-efficient and potentially effective to systematically retrieve the desired information from Google than to redo crawling from scratch by ourselves. Our focused crawler, which we call SlideCrawler, has been used for our vertical search engine CourseShare since the fall of 2011. The capability of SlideCrawler was verified for the top-500 world wide universities. SlideCrawler collected about one million files from the top-500 universities. Further, the study results show that SlideCrawler outperforms Nutch, collecting 3.7 times more slide files. Keywords Web crawler · Focused crawler · Google · Vertical search engine

* Jae‑Gil Lee [email protected] Donghwan Bae [email protected] Sansung Kim [email protected] Jungeun Kim [email protected] Mun Yong Yi [email protected] 1

Graduate School of Knowledge Service Engineering, KAIST, Daejeon, Republic of Korea

13

Vol.:(0123456789)

J.-G. Lee et al.

1 Introduction Entering the era of Big Data, we are experiencing unprecedented growth of data and resources on the Web. It has been predicted that the amount of data being produced annually will be 44 times greater in 2020 than it was in 2009 [12]. Search engines such as Google, Bing, and Yahoo are being widely used to find and access the information users are in need of. These general search engines return satisfactory results in most cases, but the precision of the results is known to improve when the scope of a query is limited to specific domains [6]. A vertical search engine, as distinct from a general web search engine, focuses on a specific segment of online contents. The vertical content area may be based on the topicality, media type, or genre of contents [28]. For example, as for the topicality, WebMD (http://www.webmd.com/) is a vertical search engine for medical issues; as for the media type, SlideShare (http://www.slideshare.net/) or SlideFinder (http:// www.slidefinder.net/) for PowerPoint files; as for the genre, Microsoft Academic Search (http://academic.microsoft.com/) for academic contents. As the

Data Loading...

An effective approach to enhancing a focused crawler using Google

Recommend Documents

An Effective Technique for Enhancing an Intrauterine Catheter Fetal Electrocardiogram

Using Google Sites to Communicate with Parents: A Case Study

Enhancing Employee Engagement An Evidence-Based Approach

Correction to: an automatic water detection approach using Landsat 8 OLI and Google earth engine cloud computing to map

Crawler

Moving Window Method: An Effective Approach to Measure Surrounding Greenness

Crawler

Using Google Classroom

A Focused Crawler for Web Feature Service and Web Map Service Discovering

An effective recognition approach for contactless palmprint

Humanising Higher Education A Positive Approach to Enhancing Wel

An automatic water detection approach using Landsat 8 OLI and Google Earth Engine cloud computing to map lakes and reser