Reinforcement learning algorithm for non-stationary environments

  • PDF / 1,057,686 Bytes
  • 17 Pages / 595.224 x 790.955 pts Page_size
  • 23 Downloads / 349 Views

DOWNLOAD

REPORT


Reinforcement learning algorithm for non-stationary environments Sindhu Padakandla1

· Prabuchandran K. J.1 · Shalabh Bhatnagar1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL methods yield sub-optimal decisions. In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a non-stationary environment. The goal of this problem is to maximize the long-term discounted reward accrued when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the long-run reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward. We further validate the effectiveness of the proposed solution on non-stationary random Markov decision processes, a sensor energy management problem, and a traffic signal control problem. Keywords Markov decision processes · Reinforcement learning · Non-Stationary environments · Change detection

1 Introduction Autonomous agents are increasingly being designed for sequential decision-making tasks under uncertainty in various domains. For e.g., in traffic signal control [34], an autonomous agent decides on the green signal duration for all lanes at a traffic junction, while in robotic applications, human-like robotic agents are built to dexterously manipulate physical objects [3, 29]. The common aspect in these applications is the evolution of the state of the system based on decisions by the agent. In traffic signal control for instance, the state is the vector of current congestion levels at the various lanes of a junction and the agent decides on

 Sindhu Padakandla

[email protected] Prabuchandran K. J. [email protected] Shalabh Bhatnagar [email protected] 1

Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India

the green signal duration for all lanes at the junction, while in a robotic application, the state can be motor angles of the joints etc., and the robot decides on the torque for all motors. The key aspect is that the decision by the agent affects the immediate next state of the system, the reward (or cost) obtained as well as the future states. Further, the sequence of decisions by the agent is ranked based on a fixed performance criterion, which is a function of the rewards obtained for all decisions made. The central problem in sequential decision-making is that the agent must find a sequence of decisions for every state such that this per

Data Loading...