Blackwell Optimality in the Class of Markov Policies for Continuous-Time Controlled Markov Chains

  • PDF / 485,046 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 6 Downloads / 228 Views

DOWNLOAD

REPORT


Blackwell Optimality in the Class of Markov Policies for Continuous-Time Controlled Markov Chains Tomás Prieto-Rumeau

Received: 23 August 2004 / Accepted: 3 July 2006 / Published online: 5 September 2006 © Springer Science + Business Media B.V. 2006

Abstract This paper deals with Blackwell optimality for continuous-time controlled Markov chains with compact Borel action space, and possibly unbounded reward (or cost) rates and unbounded transition rates. We prove the existence of a deterministic stationary policy which is Blackwell optimal in the class of all admissible (nonstationary) Markov policies, thus extending previous results that analyzed Blackwell optimality in the class of stationary policies. We compare our assumptions to the corresponding ones for discrete-time Markov controlled processes. Mathematics Subject Classifications (2000) 93E20 · 90C40 · 60J27 Key words continuous-time controlled Markov chains (or Markov decision processes) · Blackwell optimality · sensitive discount optimality. 1 Introduction This paper studies the Blackwell optimality criterion for continuous-time controlled Markov chains (CMCs), which was introduced in 1962 by Blackwell. First of all, we give a motivation of this optimality criterion. Consider an infinite horizon stochastic control problem with reward rate r, and let ϕ be a control policy. We define the total expected reward of ϕ when the initial state of the system is i as  ∞ J∞ (ϕ, i) := Eiϕ r(xϕ (t))dt, (1.1) 0

Research supported by the Spanish Secretaría de Estado de Educación y Universidades in cooperation with the European Social Funds. T. Prieto-Rumeau (B) Departamento de Estadística, Facultad de Ciencias, Universidad Nacional de Educación a Distancia, Senda del Rey no. 9, Madrid 28040, Spain e-mail: [email protected]

78

Acta Appl Math (2006) 92: 77–96

where {xϕ (t)} denotes the state process and Eiϕ is the corresponding expectation operator (obviously, we do not aim at formality). It should be clear that, unless we consider a very restrictive control model, J∞ (ϕ, i) will be, most of the time, infinite. Hence, we must find finite approximations to the ‘infinite’ total expected reward criterion. In fact, there are two different ways to make such an approximation. First, define the total expected reward of the policy ϕ on the time interval [0, T] as JT (ϕ, i) := Eiϕ



T

r(xϕ (t))dt.

0

Since J∞ (ϕ, i) = lim JT (ϕ, i), T→∞

it follows that we should try to find an ‘asymptotically optimal’ policy as T → ∞. This leads to the concept of overtaking optimality, which was studied for continuoustime CMCs by Prieto-Rumeau and Hernández-Lerma [19]; see also the references therein. The other approach is via the total expected discounted optimality criterion, that is, we suppose that the rewards earned by the controller are depreciated at a rate α > 0, and thus the corresponding expected discounted reward is  ∞ Vα (ϕ, i) := Eiϕ e−αt r(xϕ (t))dt. (1.2) 0

Observe that, as opposed to (1.1), the expected discounted reward (1.2) is finite under mild assumptions on the c