Content-aware web robot detection

  • PDF / 1,712,103 Bytes
  • 12 Pages / 595.224 x 790.955 pts Page_size
  • 106 Downloads / 213 Views

DOWNLOAD

REPORT


Content-aware web robot detection Athanasios Lagopoulos1

· Grigorios Tsoumakas1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Web crawlers account for more than a third of the total web traffic and they are threatening the security, privacy and veracity of web applications and their users. Businesses in finance, ticketing, and publishing, as well as websites with rich and unique content are the ones mostly affected by their actions. To deal with this problem, we present a novel web robot detection approach that takes advantage of the content of a website based on the assumption that human web users are interested in specific topics, while web robots crawl the web randomly. Our approach extends the typical user session representation of log-based features with a novel set of features that capture the semantics of the content of the requested resources. In addition, we contribute a new real-world dataset, which we make publicly available, towards alleviating the scarcity of open data in this field. Empirical results on this dataset validate our assumption and show that our approach outranks state-of-the-art methods for web robot detection. Keywords Web robot · Crawler · Semantics · Supervised learning · Latent dirichlet allocation

1 Introduction Web (ro)bots constantly request resources from web servers across the Internet, without human intervention, indexing and scraping content with an aim to make information reachable and available on demand. Recent industry reports show that 37.9% (42.2%) of all the web traffic in 2018 (2017) was generated by web robots, affecting every industry all over the world [11, 22]. Bots may access web applications for beneficial reasons, such as indexing and health monitoring [10]. However, around half of the bot traffic is considered to be malicious, threatening the security and privacy of a web application and

This research is co-financed by Greece and the European Union (European Social Fund- ESF) through the Operational Programme Human Resources Development, Education and Lifelong Learning in the context of the project Strengthening Human Resources Research Potential via Doctorate Research (MIS-5000432), implemented by the State Scholarships Foundation (IKY).  Athanasios Lagopoulos

[email protected] Grigorios Tsoumakas [email protected] 1

Aristotle University of Thessaloniki, Thessaloniki, Greece

its users. With an ultimate goal to monetize the information requested, they perform actions such as price and content scraping, account take over and creation, credit card fraud and denial of service attacks [13]. Businesses in finance, ticketing and education sectors are the ones most affected by these actions and need to deal not only with security issues but also with the unfair competition deriving from such fraudulent practices. Furthermore, another common threat that web applications need to deflect is analytics skewing, which is caused by otherwise benign bots. Websites with unique and rich content, like data repositories, marketplaces and