Multikernel Recursive Least-Squares Temporal Difference Learning
Traditional least-squares temporal difference (LSTD) algorithms provide an efficient way for policy evaluation, but their performance is greatly influenced by the manual selection of state features and their approximation ability is often limited. To over
- PDF / 980,858 Bytes
- 13 Pages / 439.37 x 666.142 pts Page_size
- 35 Downloads / 221 Views
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China [email protected], {qxzhu,xinzhengniu}@uestc.edu.cn 2 College of Information Science and Technology, Hainan University, Haikou, China
Abstract. Traditional least-squares temporal difference (LSTD) algorithms provide an efficient way for policy evaluation, but their performance is greatly influenced by the manual selection of state features and their approximation ability is often limited. To overcome these problems, we propose a multikernel recursive LSTD algorithm in this paper. Different from the previous kernel-based LSTD algorithms, the proposed algorithm uses Bellman operator along with projection operator, and constructs the sparse dictionary online. To avoid caching all history samples and reduce the computational cost, it uses the sliding-window technique. To avoid overfitting and reduce the bias caused by the sliding window, it also considers L2 regularization. In particular, to improve the approximation ability, it uses the multikernel technique, which may be the first time to be used for value-function prediction. Experimental results on a 50-state chain problem show the good performance of the proposed algorithm in terms of convergence speed and prediction accuracy. Keywords: Reinforcement learning Temporal difference Recursive least squares Multiple kernels Online sparsification L2 regularization
1 Introduction Reinforcement learning (RL) is a promising methodology for solving sequential decision problems. By interacting with an initially unknown environment, the agent aims to learn an optimal policy for maximizing its cumulative rewards [1]. In RL, a popular way to compute the optimal policy is through value functions, which are often estimated by temporal difference (TD) algorithms. For small discrete state-space problems, value functions can be stored in a table. But for many real-world problems, which have large or continuous state spaces, tabular TD algorithms will suffer from the “curse of dimensionality”. In such cases, a common strategy is to employ linear function approximators, which are better understood and have more convergence guarantees than nonlinear function approximators [2]. One dominant linear TD formulation is the least-squares TD (LSTD) [3–5]. Compared with the standard TD algorithm, LSTD algorithms use samples more efficiently and eliminate all step-size parameters [6, 7]. However, their performance is greatly influenced by the manual © Springer International Publishing Switzerland 2016 D.-S. Huang et al. (Eds.): ICIC 2016, Part III, LNAI 9773, pp. 205–217, 2016. DOI: 10.1007/978-3-319-42297-8_20
206
C. Zhang et al.
selection of state features, and their approximation ability is often limited [8]. In this paper, we focus on how to overcome these two drawbacks. In the last two decades, kernel methods have been extensively studied in supervised and unsupervised learning [9, 10]. One appeal of kernel methods is the ability to handle nonlinear operations on or
Data Loading...