Multikernel Recursive Least-Squares Temporal Difference Learning

Traditional least-squares temporal difference (LSTD) algorithms provide an efficient way for policy evaluation, but their performance is greatly influenced by the manual selection of state features and their approximation ability is often limited. To over

PDF / 980,858 Bytes
13 Pages / 439.37 x 666.142 pts Page_size
35 Downloads / 239 Views

DOWNLOAD

REPORT

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China [email protected], {qxzhu,xinzhengniu}@uestc.edu.cn 2 College of Information Science and Technology, Hainan University, Haikou, China

Abstract. Traditional least-squares temporal difference (LSTD) algorithms provide an efﬁcient way for policy evaluation, but their performance is greatly influenced by the manual selection of state features and their approximation ability is often limited. To overcome these problems, we propose a multikernel recursive LSTD algorithm in this paper. Different from the previous kernel-based LSTD algorithms, the proposed algorithm uses Bellman operator along with projection operator, and constructs the sparse dictionary online. To avoid caching all history samples and reduce the computational cost, it uses the sliding-window technique. To avoid overﬁtting and reduce the bias caused by the sliding window, it also considers L2 regularization. In particular, to improve the approximation ability, it uses the multikernel technique, which may be the ﬁrst time to be used for value-function prediction. Experimental results on a 50-state chain problem show the good performance of the proposed algorithm in terms of convergence speed and prediction accuracy. Keywords: Reinforcement learning Temporal difference Recursive least squares Multiple kernels Online sparsiﬁcation L2 regularization

1 Introduction Reinforcement learning (RL) is a promising methodology for solving sequential decision problems. By interacting with an initially unknown environment, the agent aims to learn an optimal policy for maximizing its cumulative rewards [1]. In RL, a popular way to compute the optimal policy is through value functions, which are often estimated by temporal difference (TD) algorithms. For small discrete state-space problems, value functions can be stored in a table. But for many real-world problems, which have large or continuous state spaces, tabular TD algorithms will suffer from the “curse of dimensionality”. In such cases, a common strategy is to employ linear function approximators, which are better understood and have more convergence guarantees than nonlinear function approximators [2]. One dominant linear TD formulation is the least-squares TD (LSTD) [3–5]. Compared with the standard TD algorithm, LSTD algorithms use samples more efﬁciently and eliminate all step-size parameters [6, 7]. However, their performance is greatly influenced by the manual © Springer International Publishing Switzerland 2016 D.-S. Huang et al. (Eds.): ICIC 2016, Part III, LNAI 9773, pp. 205–217, 2016. DOI: 10.1007/978-3-319-42297-8_20

206

C. Zhang et al.

selection of state features, and their approximation ability is often limited [8]. In this paper, we focus on how to overcome these two drawbacks. In the last two decades, kernel methods have been extensively studied in supervised and unsupervised learning [9, 10]. One appeal of kernel methods is the ability to handle nonlinear operations on or

Data Loading...

Multikernel Recursive Least-Squares Temporal Difference Learning

Recommend Documents

Distributed multi-agent temporal-difference learning with full neighbor information

TEXPLORE: Temporal Difference Reinforcement Learning for Robots and Time-Constrained Domains

Recursive

Combining Spatial Optimization and Multi-Agent Temporal Difference Learning for Task Assignment in Uncertain Crowdsourci

Recursive and Fast Recursive Capon Spectral Estimators

Digital Difference Perspectives on Online Learning

Recursive Partitioning and Applications

Recursive Maximum Correntropy Criterion Based Randomized Recurrent Broad Learning System

Video Representation Learning by Recognizing Temporal Transformations

Deep Reinforcement Learning with Temporal-Awareness Network

Deep Reinforcement Learning with Temporal Logics

Temporal Distinct Representation Learning for Action Recognition