Thompson Sampling for Bayesian Bandits with Resets

Multi-armed bandit problems are challenging sequential decision problems that have been widely studied as they constitute a mathematical framework that abstracts many different decision problems in fields such as machine learning, logistics, industrial op

  • PDF / 356,223 Bytes
  • 12 Pages / 439.363 x 666.131 pts Page_size
  • 74 Downloads / 200 Views

DOWNLOAD

REPORT


Abstract. Multi-armed bandit problems are challenging sequential decision problems that have been widely studied as they constitute a mathematical framework that abstracts many different decision problems in fields such as machine learning, logistics, industrial optimization, management of clinical trials, etc. In this paper we address a non stationary environment with expected rewards that are dynamically evolving, considering a particular type of drift, that we call resets, in which the arm qualities are re-initialized from time to time. We compare different arm selection strategies with simulations, focusing on a Bayesian method based on Thompson sampling (a simple, yet effective, technique for trading off between exploration and exploitation).

1 Introduction The multi-armed bandit problem is a general framework that can represent several different sequential decision problems. Essentially, a multi-armed bandit is a a slot machine with n arms (representing possible decisions), each associated with a different and unknown expected payoff (reward). The problem is to select the optimal (or nearoptimal) sequence of arms to pull in order to maximize reward (in expectation). Previous rewards obtained from earlier steps are taken into account in order to identify arms that are associated with high payoff; but since reward is uncertain, several pulls of the same arms are usually necessary to assess the quality of an arm with some confidence. A key concept in bandit problems is the trade-off between exploitation (pulling the arm with the highest estimated expected payoff) and exploration (focusing on getting more information about the expected payoffs of the other arms). A large number of works have addressed bandit problems [12,11,2,9,14,10]. In particular, large attention has been given to methods (including heuristics) that are computationally fast and produce a decision about the next arm to pull in very short time. This include action-value strategies, but also the family of UCB methods [1]. Thompson sampling is a simple and effective strategy for multi armed bandits. It can be implemented very efficiently and it is based on principled Bayesian reasoning. Essentially, this strategy maintain a probabilistic estimate on the value of the arms, and select arms according to their probability of being the optimal arm (it is a randomized selection strategy). This idea was first described eighty years ago [13]. However, it has been surprisingly neglected until it has been recently rediscovered [7,8,5,9], showing its effectiveness in a number of different settings. Most of the works on bandits assume a stationary distribution of rewards. In many situations, however, one cannot expect the quality of the different arms to be constant. P. Perny, M. Pirlot, and A. Tsouki`as (Eds.): ADT 2013, LNAI 8176, pp. 399–410, 2013. c Springer-Verlag Berlin Heidelberg 2013 

400

P. Viappiani

If the values of the different possibilities change with time (non stationarity) the situation is that of dynamic bandits (also called restless in