Distributed multi-agent temporal-difference learning with full neighbor information

  • PDF / 2,117,638 Bytes
  • 11 Pages / 595.276 x 790.866 pts Page_size
  • 93 Downloads / 195 Views

DOWNLOAD

REPORT


RESEARCH ARTICLE

Distributed multi‑agent temporal‑difference learning with full neighbor information Zhinan Peng1 · Jiangping Hu1 · Rui Luo1 · Bijoy K. Ghosh1,2 Received: 4 July 2020 / Revised: 15 September 2020 / Accepted: 15 September 2020 © South China University of Technology, Academy of Mathematics and Systems Science, CAS and Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract This paper presents a novel distributed multi-agent temporal-difference learning framework for value function approximation, which allows agents using all the neighbor information instead of the information from only one neighbor. With full neighbor information, the proposed framework (1) has a faster convergence rate, and (2) is more robust compared to the state-of-the-art approaches. Then we propose a distributed multi-agent discounted temporal difference algorithm and a distributed multiagent average cost temporal difference learning algorithm based on the framework. Moreover, the two proposed algorithms’ theoretical convergence proofs are provided. Numerical simulation results show that our proposed algorithms are superior to the gossip-based algorithm in convergence speed, robustness to noise and time-varying network topology. Keywords  Distributed algorithm · Reinforcement learning · Temporal-difference learning · Multi-agent systems

1 Introduction Reinforcement learning (RL) has seen wide applications to scenarios involving sequential decision making, intelligent control and automation engineering [1–7]. In 2016, AlphaGo beat the top professional chess player, which booms the research area of artificial intelligence (AI). In the meantime, RL, as one of the key technologies used in AlphaGo, draws increasing attentions from both academic researchers and industrial entrepreneurs [8, 9]. Being one of the most popular RL methods, temporal-difference (TD) learning has been shown to be effective for dealing with model-free sequential decision making problems in Markov decision processes (MDPs) [10, 11]. Stochastic approximation methods [12, * Jiangping Hu [email protected] Zhinan Peng [email protected] Rui Luo [email protected] 1



School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China



Department of Mathematics and Statistics, Texas Tech University, Lubbock, TX 79409‑1042, USA

2

13] and gradient-based TD methods [14, 15] have been proposed to estimate the mean square projected Bellman error (MSPBE). These methods scale linearly with the number of features in terms of computation time and memory usage. Making TD learning distributed is of vital importance. The problem of distributed TD learning with linear value function approximation has been one of the most important open problems in this research field for more than a decade. More general and practical challenge is how to improve the ability of parameterizing and approximating the value function in a high dimensional or infinite state space, which is a very significant situation. Macua appl