Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator

  • PDF / 1,078,508 Bytes
  • 15 Pages / 547.044 x 736.903 pts Page_size
  • 91 Downloads / 167 Views

DOWNLOAD

REPORT


Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator Shan Zhong · Jack Tan · Husheng Dong · Xuemei Chen · Shengrong Gong · Zhenjiang Qian

Received: 27 February 2019 / Accepted: 6 February 2020 © Springer Nature B.V. 2020

Abstract The tasks with continuous state and action spaces are difficult to be solved with high sample efficiency. Model learning and planning, as a wellknown method to improve the sample efficiency, is achieved by learning a system dynamics model first and then using it for planning. However, the convergence of the algorithm will be slowed if the system dynamics model is not captured accurately, with the consequence of low sample efficiency. Therefore, to S. Zhong () · S. Gong · Z. Qian School of Computer Science and Engineering, Changshu Institute of Technology, Changshu, China e-mail: [email protected] S. Zhong · S. Gong · Z. Qian CIT-KIT-UWEC-MU Joint Laboratory of International Cooperation in Information Science, Changshu, China S. Zhong · J. Tan School of Computer Science, University of Wisconsin, Eau Claire, WI, USA

solve the problems with continuous state and action spaces, a model-learning-based actor-critic algorithm with the Gaussian process approximator is proposed, named MLAC-GPA, where the Gaussian process is selected as the modeling method due to its valuable characteristics of capturing the noise and uncertainty of the underlying system. The model in MLAC-GPA is firstly represented by linear function approximation and then modeled by the Gaussian process. Afterward, the expectation value vector and the covariance matrix of the model parameter are estimated by Bayesian reasoning. The model is used for planning after being learned, to accelerate the convergence of the value function and the policy. Experimentally, the proposed method MLAC-GPA is implemented and compared with five representative methods in three classic benchmarks, Pole Balancing, Inverted Pendulum, and Mountain Car. The result shows MLAC-GPA overcomes the others both in learning rate and sample efficiency.

S. Zhong Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China

Keywords Gaussian process · Actor-critic · Model learning · Planning · Linear function approximation

H. Dong Suzhou Institute of Trade and Commerce, Suzhou, China

1 Introduction

X. Chen () School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China e-mail: [email protected]

Reinforcement learning (RL) targets learning an optimal policy through maximizing the long-term accumulative rewards [1]. Compared with the supervised

S. Zhong et al.

and unsupervised methods, RL can solve the interactive problems through approximating Bellman equation [2], so it is widely applied in the areas [3–9] such as robotic control, game theory, combinational optimizing and scheduling, signal processing, multi-agent system, automatic driving, grid computing, etc. The value function, taking the expectation form, provides an iterative estim