A Novel Policy Iteration-Based Deterministic Q-Learning for Discrete-Time Nonlinear Systems

In this chapter, a novel iterative Q-learning algorithm, called “policy iteration-based deterministic Q-learning algorithm,” is developed to solve the optimal control problems for discrete-time deterministic nonlinear systems. The idea is to use an iterat

  • PDF / 1,078,428 Bytes
  • 25 Pages / 439.37 x 666.142 pts Page_size
  • 72 Downloads / 204 Views

DOWNLOAD

REPORT


A Novel Policy Iteration-Based Deterministic Q-Learning for Discrete-Time Nonlinear Systems

4.1 Introduction For many traditional iterative ADP algorithms, it requires to build the model of nonlinear systems and then perform the ADP algorithms to derive an improved control policy [1, 3, 6, 7, 11, 13–19, 24, 26, 30, 31, 33–37, 39–42]. These iterative ADP algorithms are denoted as “model-based ADP algorithms.” In contrast, Q-learning, proposed by Watkins [28, 29], is a typical data-based ADP algorithm. In [13, 23], Q-learning was named action-dependent heuristic dynamic programming (ADHDP). For Q-learning algorithms, Q functions are used instead of value functions in the traditional iterative ADP algorithms. Q functions depend on both system state and control, which means that they already include the information about the system and the utility function. Hence, it is easier to compute control policies from Q functions than the traditional performance index functions [5]. Because of this merit, Q-learning algorithms are preferred to unknown and model-free systems to obtain the optimal control [5, 12, 29]. In [29], a convergence proof of Q-learning algorithm was proposed under the stochastic environment. However, we should point out that many real-world control systems are deterministic, which need deterministic convergence and stability properties to optimize the control systems. Furthermore, previous iterative Q-learning algorithms were based on value iterations [4, 5, 9, 10, 12, 28, 29, 32]. Although the iterative Q functions were convergent to the optimum, stability of the system under the iterative control law could not be guaranteed. Thus, for previous iterative Q-learning algorithms, only the converged optimal control law can be used to control the nonlinear system, and all the iterative controls during the iteration procedure may not be stable. This makes the computation efficiency of the previous iterative Q-learning algorithms very low. Hence, new iterative Q-learning algorithms need to be developed for deterministic nonlinear systems with property analysis method. This motivates our research. In this chapter, a novel iterative Q-learning algorithm based on policy iteration is developed for discrete-time deterministic nonlinear systems, which is denoted as “policy iteration-based deterministic Q-learning algorithm.” First, the © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2018 Q. Wei et al., Self-Learning Optimal Control of Nonlinear Systems, Studies in Systems, Decision and Control 103, DOI 10.1007/978-981-10-4080-1_4

85

86

4 A Novel Policy Iteration-Based Deterministic Q-Learning …

policy iteration-based deterministic Q-learning algorithm is derived. The differences between the previous Q-learning algorithms and the developed policy iteration-based deterministic Q-learning algorithm are presented. Second, property analysis, including convergence and stability properties, for the developed iterative Q-learning algorithm are established. We emphasize that our theoretical contribution