Website Privacy Preservation for Query Log Publishing

In this paper we study privacy preservation for the publication of search engine query logs. We introduce a new privacy concern, website privacy as a special case of business privacy. We define the possible adversaries who could be interested in disclosin

  • PDF / 630,787 Bytes
  • 17 Pages / 430 x 660 pts Page_size
  • 22 Downloads / 226 Views

DOWNLOAD

REPORT


Web Research Group, University Pompeu Fabra, Barcelona, Spain 2 Otto-von-Guericke-University Magdeburg, Germany 3 Yahoo! Research, Barcelona, Spain [email protected], [email protected], [email protected]

Abstract. In this paper we study privacy preservation for the publication of search engine query logs. We introduce a new privacy concern, website privacy as a special case of business privacy. We define the possible adversaries who could be interested in disclosing website information and the vulnerabilities in the query log, which they could exploit. We elaborate on anonymization techniques to protect website information, discuss different types of attacks that an adversary could use and propose an anonymization strategy for one of these attacks. We then present a graph-based heuristic to validate the effectiveness of our anonymization method and perform an experimental evaluation of this approach. Our experimental results show that the query log can be appropriately anonymized against the specific attack, while retaining a significant volume of useful data.

1

Introduction

Query logs are very rich sources of information, from which the scientific community can benefit immensely. These logs allow among other things the discovery of interesting behavior patterns and rules. These can be used in turn for sophisticated user models, for improvements in ranking, for spam detection and other useful applications. However, the publication of query logs raises serious and well-justified privacy concerns: It has been demonstrated that naively anonymized query logs pose too great a risk in disclosing private information. The awareness towards privacy threats has increased by the publication of the American Online (AOL) query log in 2006 [1]. This dataset, which contained 20 million Web queries from 650, 000 AOL users, was subjected to a rather rudimentary anonymization before being published. After its release, it turned out that the users appearing in the log had issued queries that disclosed their identity either directly or in combination with other searches [2]. Some users even had their identities published along with their queries [3]. This increased the awareness to the fact that query logs can be manipulated in order to reveal private information if published without proper anonymization. Privacy preservation in query logs is a very current scientific challenge. Some solutions have been proposed recently [4,5]. Similarly to the general research F. Bonchi et al. (Eds.): PinKDD 2007, LNCS 4890, pp. 80–96, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Website Privacy Preservation for Query Log Publishing

81

advances in privacy preserving data mining, they refer to the privacy of persons. Little attention has been paid to another type of privacy concern, which we consider of no less importance: website privacy or, more general, business privacy. In this work we argue that important and confidential information about websites and their owners can be discovered from query logs and that naive forms of URL an