Fast and slow curiosity for high-level exploration in reinforcement learning

  • PDF / 3,247,773 Bytes
  • 22 Pages / 595.224 x 790.955 pts Page_size
  • 2 Downloads / 187 Views

DOWNLOAD

REPORT


Fast and slow curiosity for high-level exploration in reinforcement learning Nicolas Bougie1,2

· Ryutaro Ichise1,2

© The Author(s) 2020

Abstract Deep reinforcement learning (DRL) algorithms rely on carefully designed environment rewards that are extrinsic to the agent. However, in many real-world scenarios rewards are sparse or delayed, motivating the need for discovering efficient exploration strategies. While intrinsically motivated agents hold promise of better local exploration, solving problems that require coordinated decisions over long-time horizons remains an open problem. We postulate that to discover such strategies, a DRL agent should be able to combine local and high-level exploration behaviors. To this end, we introduce the concept of fast and slow curiosity that aims to incentivize long-time horizon exploration. Our method decomposes the curiosity bonus into a fast reward that deals with local exploration and a slow reward that encourages global exploration. We formulate this bonus as the error in an agent’s ability to reconstruct the observations given their contexts. We further propose to dynamically weight local and high-level strategies by measuring state diversity. We evaluate our method on a variety of benchmark environments, including Minigrid, Super Mario Bros, and Atari games. Experimental results show that our agent outperforms prior approaches in most tasks in terms of exploration efficiency and mean scores. Keywords Reinforcement learning · Exploration · Autonomous exploration · Curiosity in reinforcement learning

1 Introduction In recent years, deep reinforcement learning (DRL) has achieved many accomplishments in a wide range of application domains, such as game playing [40, 51], robot control [35], and autonomous vehicles [33]. DRL algorithms rely on maximizing the cumulative rewards that are provided by the environment. However, most DRL algorithms rely on well-designed and dense rewards to guide the behavior of the agent. Hand-crafting such reward functions is a challenging engineering problem. In order to deploy DRL to real-world settings wherein rewards are often sparse or poorly defined, DRL agents will have to discover efficient exploration strategies. Multiple heuristics such as entropy  Nicolas Bougie

[email protected] Ryutaro Ichise [email protected] 1

National Institute of Informatics, Tokyo, Japan

2

The Graduate University for Advanced Studies, Sokendai, Tokyo, Japan

regularization [41] were introduced but did not yield significant improvements in sparse reward tasks. Several works attempt to tackle this challenge by providing a new intrinsic exploration bonus (i.e. curiosity) to the agent. For example, count-based exploration [55] keeps visit counts for states and favors the exploration of states rarely visited. Another class of methods relies on predicting dynamics of the environment [2]. For instance, ICM [44] predicts the feature representation of the next state based on the current state and the action taken by the agent. Nevertheless, maximizing the pre