Efficient policy detecting and reusing for non-stationarity in Markov games

  • PDF / 5,045,051 Bytes
  • 29 Pages / 439.37 x 666.142 pts Page_size
  • 42 Downloads / 380 Views

DOWNLOAD

REPORT


(2021) 35:2

Efficient policy detecting and reusing for non-stationarity in Markov games Yan Zheng1,2 · Jianye Hao1 · Zongzhang Zhang3 · Zhaopeng Meng1 · Tianpei Yang1 · Yanran Li4 · Changjie Fan5

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract One challenging problem in multiagent systems is to cooperate or compete with nonstationary agents that change behavior from time to time. An agent in such a non-stationary environment is usually supposed to be able to quickly detect the other agents’ policy during online interaction, and then adapt its own policy accordingly. This article studies efficient policy detecting and reusing techniques when playing against non-stationary agents in cooperative or competitive Markov games. We propose a new deep Bayesian policy reuse algorithm, a.k.a. DPN-BPR+, by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the rectified belief model taking advantage of the opponent model to infer the other agents’ policy from reward signals and its behavior. Instead of directly storing individual policies as BPR+, we introduce distilled policy network that serves as the policy library, and policy distillation to achieve efficient online policy learning and reuse. DPN-BPR+ inherits all the advantages of BPR+. In experiments, we evaluate DPN-BPR+ in terms of detection accuracy, cumulative reward and speed of convergence in four complex Markov games with raw visual inputs, including two cooperative games and two competitive games. Empirical results show that our proposed DPN-BPR+ approach has better performance than existing algorithms in all these Markov games. Keywords Non-stationary agents · Deep reinforcement learning · Opponent modeling · Bayesian policy reuse

1 Introduction As deep reinforcement learning (DRL) achieving tremendous success, various of advanced DRL techniques [26,27,31,34,37] have been proposed in droves to solve complex problems, including game playing [27], robotics [22] and recommendation [40]. However, this thread

This is an extended version of the paper [41] presented at the 32nd Conference on Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018. Extended author information available on the last page of the article 0123456789().: V,-vol

123

2

Page 2 of 29

Autonomous Agents and Multi-Agent Systems

(2021) 35:2

focuses on single-agent domains, without explicitly considering the coexisting agents in the environments. Another thread focuses on scenarios involving multiagent interactions, commonly known as multiagent systems (MAS), where agents need to cooperate or competitive with each other to maximize the long term reward respectively. To this end, agents need to leverage as much useful information as possible. Many proposed algorithms have proved that, in MAS, it is critically essential for agents to learn by taking the other coexisting agents’ behaviors into account [4,11,21,23,25]. However, some existing multiagent DRL algorithms