Monte Carlo Tree Search for Verifying Reachability in Markov Decision Processes

The maximum reachability probabilities in a Markov decision process can be computed using value iteration (VI). Recently, simulation-based heuristic extensions of VI have been introduced, such as bounded real-time dynamic programming (BRTDP), which often

  • PDF / 498,411 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 118 Downloads / 185 Views

DOWNLOAD

REPORT


l University of Munich, Munich, Germany [email protected] 2 Masaryk University, Brno, Czech Republic

Abstract. The maximum reachability probabilities in a Markov decision process can be computed using value iteration (VI). Recently, simulation-based heuristic extensions of VI have been introduced, such as bounded real-time dynamic programming (BRTDP), which often manage to avoid explicit analysis of the whole state space while preserving guarantees on the computed result. In this paper, we introduce a new class of such heuristics, based on Monte Carlo tree search (MCTS), a technique celebrated in various machine-learning settings. We provide a spectrum of algorithms ranging from MCTS to BRTDP. We evaluate these techniques and show that for larger examples, where VI is no more applicable, our techniques are more broadly applicable than BRTDP with only a minor additional overhead.

1

Introduction

Markov decision processes (MDP) [Put14] are a classical formalism for modelling systems with both non-deterministic and probabilistic behaviour. Although there are various analysis techniques for MDP that run in polynomial time and return precise results, such as those based on linear programming, they are rarely used. Indeed, dynamic programming techniques, such as value iteration (VI) or policy iteration, are usually preferred despite their exponential complexity. The reason is that for systems of sizes appearing in practice not even polynomial algorithms are useful and heuristics utilizing the structure of the humandesigned systems become a necessity. Consequently, probabilistic model checking has adopted [BCC+14,ACD+17,KM17,KKKW18,DJKV17] techniques, which generally come with weaker guarantees on correctness and running time, but in practice perform better. These techniques originate from reinforcement learning, such as delayed Q-learning [SLW+06], or probabilistic planning, such as bounded real-time dynamic programming (BRTDP) [MLG05]. This research was supported in part by Deutsche Forschungsgemeinschaft (DFG) through the TUM International Graduate School of Science and Engineering (IGSSE) project 10.06 PARSEC, the Czech Science Foundation grant No. 18-11193S, and the DFG project 383882557 Statistical Unbounded Verification. c Springer Nature Switzerland AG 2018  T. Margaria and B. Steffen (Eds.): ISoLA 2018, LNCS 11245, pp. 322–335, 2018. https://doi.org/10.1007/978-3-030-03421-4_21

Monte Carlo Tree Search for Verifying Reachability

323

Since verification techniques are applied in safety-critical settings, the results produced by the techniques are required to be correct and optimal, or at least ε-optimal for a given precision ε. To this end, we follow the general scheme of [BCC+14] and combine the non-guaranteed heuristics with the traditional guaranteed techniques such as VI. However, while pure VI analyses the whole state space, the heuristics often allow us to focus only on a small part of it and still give precise estimates of the induced error. This approach was applied in [BCC+14], yielding a variant of