Optimal bandwidth allocation for web crawler systems with time constraints

  • PDF / 2,582,001 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 0 Downloads / 231 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Optimal bandwidth allocation for web crawler systems with time constraints Weiping Zhu1   · Yaodong Li1 · Shu Li2 · Yi Xu3 · Xiaohui Cui4 Received: 2 November 2019 / Accepted: 21 July 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Web crawler is an important tool to obtain information from the Internet in a timely manner. In a typical web crawler system with limited bandwidth, many websites are crawled with different time constraints. Existing studies regarding web crawler systems do not consider the bandwidth allocation in such a complex environment; hence, the time constraints may not be satisfied. In this study, we investigate the bandwidth allocation approaches for such a web crawler system. The approaches are designed for two scenarios, i.e., when the number of websites exceeds or does not exceed the maximum number of web crawlers that the system can execute simultaneously. For the latter situation, we propose approaches to control the bandwidth for web crawlers to minimize the maximum complete time or minimize the sum of execution times of all web crawlers, considering assumptions of both sufficient and insufficient bandwidths. For the former situation, we propose a round-based reallocation approach to schedule both the sequence and bandwidth allocation of the web crawlers. Extensive simulations are conducted to validate the proposed approaches, and the results show that our approaches satisfy the time constraints well and achieve desirable execution performances in various scenarios. Keywords  Bandwidth allocation · Web crawler · Time constraint · Optimization

1 Introduction

* Xiaohui Cui [email protected] Weiping Zhu [email protected] Yaodong Li [email protected] Shu Li whu‑[email protected] Yi Xu [email protected] 1



School of Computer Science, Wuhan University, Wuhan, People’s Republic of China

2



School of Mathematics and Statistics, Wuhan University, Wuhan, People’s Republic of China

3

Department of Mathematics, Southeast University, Nanjing, People’s Republic of China

4

School of Cyber Science and Engineering, Wuhan University, Wuhan, People’s Republic of China



In the last decade, the amount of data on the Internet has grown significantly (Ding and Wang 2018). The data contain a significant amount of useful information; however, it is difficult to obtain the information in a timely manner (Kumar et al. 2017). For example, when graduates seek jobs, they often browse several dozens of websites multiple times daily to obtain job-related information. Hence, they may read redundant or useless content and encounter difficulties in obtaining the latest information. In another example, people browsing the Internet may overlook information regarding food safety from multiple data sources, and this may affect their health. A web crawler is a program that can automatically download web pages from the Internet and extract the required information from them (Wang et al. 2018b; Thelwall 2001). A web crawler starts from one or several initial web pages, a