MDPs

A Markov decision process of the type we consider is a 5-tuple $\langle S,s_0,A,\Pr ,R\rangle$ , where

is a finite set of fully observable states, $s_{0}\in S$ is the initial state,

is a finite set of actions (

denotes the subset of actions applicable in $s\in S$ ), $\{\Pr(s,a,\bullet) \mid s\!\in\!S, a\!\in\!A(s)\}$ is a family of probability distributions over

, such that $\Pr(s,a,s')$ is the probability of being in state

after performing action

in state

, and $R: S \mapsto \mbox{I$\!$R}$ is a reward function such that

is the immediate reward for being in state

. It is well known that such an MDP can be compactly represented using dynamic Bayesian networks [18,11] or probabilistic extensions of traditional planning languages, see e.g., [37,45,50]. A stationary policy for an MDP is a function $\pi : S \mapsto A$ , such that $\pi(s)\in A(s)$ is the action to be executed in state

. The value $V_\pi$ of the policy at

, which we seek to maximise, is the sum of the expected future rewards over an infinite horizon, discounted by how far into the future they occur:

$\begin{displaymath} V_\pi(s_0) = \lim_{n\rightarrow\infty} \makebox[1em]{$\mathb... ...n} \beta^{i} R(\Gamma_i) \mid \pi, \Gamma_{0} = s_{0} \bigg{]} \end{displaymath}$

where $0 \leq \beta < 1$ is the discount factor controlling the contribution of distant rewards.

Sylvie Thiebaux 2006-01-20