Exploration from Generalization Mediated by Multiple Controllers

Intrinsic motivation involves internally governed drives for exploration, curiosity, and play. These shape subjects over the course of development and beyond to explore to learn and expand the actions they are capable of performing and to acquire skills t

  • PDF / 373,444 Bytes
  • 19 Pages / 439.36 x 666.15 pts Page_size
  • 52 Downloads / 221 Views

DOWNLOAD

REPORT


Abstract Intrinsic motivation involves internally governed drives for exploration, curiosity, and play. These shape subjects over the course of development and beyond to explore to learn and expand the actions they are capable of performing and to acquire skills that can be useful in future domains. We adopt a utilitarian view of this learning process, treating it in terms of exploration bonuses that arise from distributions over the structure of the world that imply potential benefits from generalizing knowledge and skills to subsequent environments. We discuss how functionally and architecturally different controllers may realize these bonuses in different ways.

1 Introduction The Gittins index (Berry and Fristedt 1985; Gittins 1989) is a famous pinnacle of the analytical analysis of the trade-off between exploration and exploitation. Although it applies slightly more generally, it is often treated in the most straightforward case of an infinite horizon, exponentially discounted, multiarmed bandit problem with appropriate known prior distributions for the payoffs of the arms. Under these circumstances, the index quantifies precisely an exploration bonus (Dayan and Sejnowski 1996; Kakade and Dayan 2002; Ng et al. 1999; Sutton 1990) for choosing (i.e., exploring) an arm whose payoff is incompletely known. This bonus arises because if, when explored, the arm is found to be better than expected, then it can be exploited in all future choices. An exactly equivalent way of thinking about the Gittins index is in terms of generalization. Under the conditions above, there is perfect generalization over time

P. Dayan () University College London Gatsby Computational Neuroscience Unit, London, UK e-mail: [email protected] G. Baldassarre and M. Mirolli (eds.), Intrinsically Motivated Learning in Natural and Artificial Systems, DOI 10.1007/978-3-642-32375-1 4, © Springer-Verlag Berlin Heidelberg 2013

73

74

P. Dayan

for each arm of the bandit. That is, an arm does not change its character—what is learned at one time is exactly appropriate at subsequent times also. Thus, what might typically be considered as intrinsically motivated actions such as playing with, exploring, and engaging with an uncertain arm in order to gain the skill of valuing it are useful, since this skill can be directly generalized to future choices. The exploration bonus (which is formally the difference between the Gittins index for an arm and its mean, certainty-equivalent, worth) translates the uncertainty into units of value (a sort of information value; Howard 1966) and thus quantifies the potential motivational worth of the actions concerned. This worth is intrinsic to the extent that it depends on intrinsic expectations about the possibilities for generalization. This account applies equally to more complex Markov decision problems, as in nonmyopic, Bayesian reinforcement learning (RL; e.g., Duff 2000; Poupart et al. 2006; S¸ims¸ek and Barto 2006; Wang et al. 2005). This redescription emphasizes two well-understood points. First, skills