Hindsight-Combined and Hindsight-Prioritized Experience Replay

Reinforcement learning has proved to be of great utility; execution, however, may be costly due to sampling inefficiency. An efficient method for training is experience replay, which recalls past experiences. Several experience replay techniques, namely,

  • PDF / 851,711 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 95 Downloads / 241 Views

DOWNLOAD

REPORT


Division of Information Science, Nara Institute of Science and Technology, Takayama Town, Ikoma, Nara 6300192, Japan {tan.renzo roel perez.tp7,kazushi}@is.naist.jp 2 School of Science and Engineering, Ateneo de Manila University, Katipunan Avenue, National Capital Region, 1108 Quezon City, Philippines {rrtan,jpvergara}@ateneo.edu

Abstract. Reinforcement learning has proved to be of great utility; execution, however, may be costly due to sampling inefficiency. An efficient method for training is experience replay, which recalls past experiences. Several experience replay techniques, namely, combined experience replay, hindsight experience replay, and prioritized experience replay, have been crafted while their relative merits are unclear. In the study, one proposes hybrid algorithms – hindsight-combined and hindsight-prioritized experience replay – and evaluates their performance against published baselines. Experimental results demonstrate the superior performance of hindsight-combined experience replay on an OpenAI Gym benchmark. Further, insight into the nonconvergence of hindsightprioritized experience replay is presented towards the improvement of the approach. Keywords: Experience replay · Deep Q-Network learning · Sample efficiency · Hybrid algorithm

1

· Reinforcement

Introduction

Reinforcement learning [20] has been the subject of research. Its uncomplicated formulation is capable of capturing a vast number of problems in artificial intelligence. Fields such as resource management [13], traffic signal control [2], and robotics [8] abound with practical applications. Generally, the learning problem is to control a system so as to maximize a numerical value representing a long-term objective [7]. One calls the learner the agent and the agent is established to be in an environment. The standard reinforcement learning formalism, therefore, concurs with a decision making framework consisting of an agent that interacts with an environment and improves its performance based on feedback. At each time step, the agent is given a state and Supported by the Japan Society for the Promotion of Science through the Grants-inAid for Scientific Research Program (KAKENHI 18K19821). c Springer Nature Switzerland AG 2020  H. Yang et al. (Eds.): ICONIP 2020, LNCS 12533, pp. 429–439, 2020. https://doi.org/10.1007/978-3-030-63833-7_36

430

R. R. P. Tan et al.

it selects an action; the environment then presents a reward and a new state. By and large, the goal is to maximize the cumulative reward. While reinforcement learning shows promise, implementation in real-world contexts can be costly because of sampling inefficiency. This means that a multitude of runs are needed for the algorithm to achieve success. A way to address such a complication is through the utilization of experience replay [11], where previous experiences are reused. As an aside, there are other methods through which one may grapple with the problem. Recent alternatives include using Gaussian processes [5] and using babbling [9,14] in speeding up learning. The paper,