Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning

  • PDF / 1,506,528 Bytes
  • 17 Pages / 595.224 x 790.955 pts Page_size
  • 4 Downloads / 261 Views

DOWNLOAD

REPORT


Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning Diqi Chen1 · Yizhou Wang2 · Wen Gao3

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Multi-objective reinforcement learning (MORL) algorithms aim to approximate the Pareto frontier uniformly in multiobjective decision making problems. In the scenario of deep reinforcement learning (RL), gradient-based methods are often adopted to learn deep policies/value functions due to the fast convergence speed, while pure gradient-based methods can not guarantee a uniformly approximated Pareto frontier. On the other side, evolution strategies straightly manipulate in the solution space to achieve a well-distributed Pareto frontier, but applying evolution strategies to optimize deep networks is still a challenging topic. To leverage the advantages of both kinds of methods, we propose a two-stage MORL framework combining a gradient-based method and an evolution strategy. First, an efficient multi-policy soft actor-critic algorithm is proposed to learn multiple policies collaboratively. The lower layers of all policy networks are shared. The first-stage learning can be regarded as representation learning. Secondly, the multi-objective covariance matrix adaptation evolution strategy (MO-CMA-ES) is applied to fine-tune policy-independent parameters to approach a dense and uniform estimation of the Pareto frontier. Experimental results on three benchmarks (Deep Sea Treasure, Adaptive Streaming, and Super Mario Bros) show the superiority of the proposed method. Keywords Multi-objective reinforcement learning · Multi-policy reinforcement learning · Pareto frontier · Sampling efficiency

1 Introduction Deep reinforcement learning (RL) algorithms have been successfully applied in many challenging decision making problems, such as video games [32, 33, 60], the game of Go [48, 50] and robotics [19, 20, 27, 49]. In these scenarios, only one objective is optimized. Nevertheless, many realworld decision making problems consider more than one objective. Network routing takes energy, latency, and channel capacity into account [35]. Medical treatment needs to release symptoms and minimize side effects [28, 29]. Economic systems are analyzed from both economic and ecological perspectives [57]. Furthermore, through adding extra reward signals that encode domain knowledge, reward shaping methods are adopted to encourage exploration of the agents [11, 36].

 Diqi Chen

[email protected]

Extended author information available on the last page of the article.

The objectives of multi-objective reinforcement learning (MORL) are often conflicting, which means that maximizing one objective will normally lead to the minimization of the others [37]. Therefore, rather than learn only one optimal solution, MORL aims to achieve a set of mutually incomparable solutions. In this set, each solution is not dominated by all the other solutions. This non-dominated set is named as the Pareto frontier. And the non-dominated solution is