Policy Design
Once a suitable definition of the system’s belief state is found, the system designer must define how actions are to be taken. The policy, denoted by \(\pi \) , is the component which decides the action. Section 2.3 gave a brief overview of established t
- PDF / 220,558 Bytes
- 14 Pages / 439.37 x 666.142 pts Page_size
- 1 Downloads / 175 Views
Policy Design
Once a suitable definition of the system’s belief state is found, the system designer must define how actions are to be taken. The policy, denoted by π, is the component which decides the action. Section 2.3 gave a brief overview of established techniques for hand-crafting these decisions. This chapter will discuss algorithms that can be used to automate the decision making process. Section 5.1 starts the chapter with a brief introduction to policy learning theory. The use of learning in spoken dialogue requires approximations to be made to reduce the effect of large action sets and large state spaces. These are provided through the use of summary actions, discussed in Sect. 5.2, and function approximations, discussed in Sect. 5.4. Example summary actions and function approximations for the TownInfo domain are provided in Sects. 5.3 and 5.5. Once suitable approximations are found, standard algorithms can be used for the policy optimisation. Section 5.6 gives details of one learning algorithm called Natural Actor Critic. Training the dialogue system with human users can be problematic so the use of simulation is discussed in Sect. 5.7. An example application of learning for the TownInfo system is presented in Sect. 5.8.
5.1 Policy Learning Theory The key feature of systems that use reinforcement learning to optimise the policy is the reward function, r (b, a). This function defines the reward obtained by taking action a when the system is in belief state b. As discussed in Chap. 2, a suitable aim for the system is to choose actions that maximise the expected total reward in a dialogue, T r (bt , at ) . (5.1) E(R) = E t=1
B. Thomson, Statistical Methods for Spoken Dialogue Management, Springer Theses, DOI: 10.1007/978-1-4471-4923-1_5, © Springer-Verlag London 2013
57
58
5 Policy Design
This expected value is very difficult to compute without extra assumptions about the environment. The standard assumption, which will be used throughout this thesis, is that the belief state transitions are Markov and depend only on the previous value of b. The belief state b may be continuous1 and p(b |b, a) then denotes a probability density function. Policies, denoted π, are stochastic and define the probability of taking action a in state b. Under the Markov assumption, the expected future reward, V π , when starting in a state b and following policy π can be defined recursively as V π (b) =
π(b, a)r (b, a) +
a
b
a
π(b, a) p(b |b, a)V π (b ).
(5.2)
Several other quantities are useful when working with MDPs. The Q-function, Q π (b, a), is the expected future reward obtained by starting with a particular action and then following the policy. The advantage function, Aπ (b, a), is the difference between the Q-function and the value function, V π . The occupancy frequency, d π (b), gives the expected number of times each state is visited.2 These three quantities are given by the following equations: Q π (b, a) = r (b, a) +
b
p(b |b, a)V π (b )
Aπ (b, a) = Q π (b, a) − V π (b) d π (b)
Data Loading...