Efficient policy detecting and reusing for non-stationarity in Markov games

PDF / 5,045,051 Bytes
29 Pages / 439.37 x 666.142 pts Page_size
42 Downloads / 405 Views

(2021) 35:2

Efficient policy detecting and reusing for non-stationarity in Markov games Yan Zheng1,2 · Jianye Hao1 · Zongzhang Zhang3 · Zhaopeng Meng1 · Tianpei Yang1 · Yanran Li4 · Changjie Fan5

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract One challenging problem in multiagent systems is to cooperate or compete with nonstationary agents that change behavior from time to time. An agent in such a non-stationary environment is usually supposed to be able to quickly detect the other agents’ policy during online interaction, and then adapt its own policy accordingly. This article studies efficient policy detecting and reusing techniques when playing against non-stationary agents in cooperative or competitive Markov games. We propose a new deep Bayesian policy reuse algorithm, a.k.a. DPN-BPR+, by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the rectified belief model taking advantage of the opponent model to infer the other agents’ policy from reward signals and its behavior. Instead of directly storing individual policies as BPR+, we introduce distilled policy network that serves as the policy library, and policy distillation to achieve efficient online policy learning and reuse. DPN-BPR+ inherits all the advantages of BPR+. In experiments, we evaluate DPN-BPR+ in terms of detection accuracy, cumulative reward and speed of convergence in four complex Markov games with raw visual inputs, including two cooperative games and two competitive games. Empirical results show that our proposed DPN-BPR+ approach has better performance than existing algorithms in all these Markov games. Keywords Non-stationary agents · Deep reinforcement learning · Opponent modeling · Bayesian policy reuse

1 Introduction As deep reinforcement learning (DRL) achieving tremendous success, various of advanced DRL techniques [26,27,31,34,37] have been proposed in droves to solve complex problems, including game playing [27], robotics [22] and recommendation [40]. However, this thread

This is an extended version of the paper [41] presented at the 32nd Conference on Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018. Extended author information available on the last page of the article 0123456789().: V,-vol

123

2

Page 2 of 29

Autonomous Agents and Multi-Agent Systems

(2021) 35:2

focuses on single-agent domains, without explicitly considering the coexisting agents in the environments. Another thread focuses on scenarios involving multiagent interactions, commonly known as multiagent systems (MAS), where agents need to cooperate or competitive with each other to maximize the long term reward respectively. To this end, agents need to leverage as much useful information as possible. Many proposed algorithms have proved that, in MAS, it is critically essential for agents to learn by taking the other coexisting agents’ behaviors into account [4,11,21,23,25]. However, some existing multiagent DRL algorithms

Data Loading...

Efficient policy detecting and reusing for non-stationarity in Markov games

Recommend Documents

Hybrid Independent Learning in Cooperative Markov Games

Reusing

User Assistance for Serious Games Using Hidden Markov Model

Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution

Stochastic differential games for optimal investment problems in a Markov regime-switching jump-diffusion market

Detecting Anomalies in Credit Card Transaction Using Efficient Techniques

Finding the Strong Nash Equilibrium: Computation, Existence and Characterization for Markov Games

Sustainability in Jewellery Design Process: Reusing and Reinventing

An efficient approach for detecting vowel onset and offset points in speech signal

Fuzzy signal strength estimated Markov probabilistic graph for efficient handover and seamless data delivery in PAN

An Efficient Detecting Mechanism for Cross-Site Script Attacks in the Cloud

A Computationally-Efficient, Online-Learning Algorithm for Detecting High-Voltage Spindles in the Parkinsonian Rats