The Crawl and Analysis of Recruitment Data Based on the Distributed Crawler

Because of the rapid development of Internet, how to efficiently and quickly obtain useful data has become an importance. In this paper, a distributed crawler crawling system is designed and implemented to capture the recruitment data of online recruitmen

  • PDF / 1,650,615 Bytes
  • 7 Pages / 439.37 x 666.142 pts Page_size
  • 6 Downloads / 168 Views

DOWNLOAD

REPORT


Harbin 150022, China [email protected] 2 School of Computer and Information Engineering, Hei Longjiang University of Science and Technology, Harbin 150022, China

Abstract. Because of the rapid development of Internet, how to efficiently and quickly obtain useful data has become an importance. In this paper, a distributed crawler crawling system is designed and implemented to capture the recruitment data of online recruitment websites. The architecture and operation workflow of the Scrapy crawler framework is combined with Python, the composition and functions of Scrapy-Redis and the concept of data visualization. Echarts is applied on crawlers, which describes the characteristics of the web page where the employer publishes recruitment information. In the base of Scrapy framework, the middleware, proxy IP and dynamic UA are used to prevent crawlers from being blocked by websites. Data cleaning and encoding conversion is used to make data processing. Keywords: Distributed crawler · Scrapy framework · Data processing

1 Introduction With the widespread of modern network, especially after the application of 5G, people are spending more time in searching useful information through piles of data. Therefore the distributed web crawler is adopted to search and obtain Internet data, which can greatly improve the search efficiency. The Internet has been thriving rapidly and changing people’s life greatly for decades. According to China Internet development report 2019 issued by The sixth World Internet Conference held in Wuzhen, Zhejiang Province in Oct.20, 2019, China has 0.89 billion netizens and Internet penetration mounted 59.6% [1–4]. There are 5.23 million websites and 281.6 billion web pages. What’s more, with the promotion of the commercial process of 5G, there will be a dramatic development in Big data, cloud computing, IOT and data size. The traditional way to find the relevant information is to use Internet search engine, but the efficiency is low if search from Big data, and it is also not conductive to data processing and analysis. Web Crawler, also called Web Spider, was originally developed by Matthew Grey from MIT in 1993 [5]. It is a contemporary to World Wide Web and it can’t survive without the Internet by nature. If compare the Internet to a spider net, the web crawler is the crawling spider on the net. By © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2020 Published by Springer Nature Switzerland AG 2020. All Rights Reserved X. Jiang and P. Li (Eds.): GreeNets 2020, LNICST 333, pp. 162–168, 2020. https://doi.org/10.1007/978-3-030-62483-5_18

The Crawl and Analysis of Recruitment

163

Requesting URL address, the crawlers collect and analyze data by responded contents. For example, if the responded content is html, a DOM structure will be analyzed, parse it and adopt regular match; If the responded content is xml/json, a data objects will be converted and make further analysis. The distributed crawler uses many computers and many crawlers to coincide with m