A Markov decision process of the type we consider is a 5-tuple
, where is a finite set of fully
observable states, is the initial state, is a finite
set of actions ( denotes the subset of actions applicable in
),
is a family of
probability distributions over , such that is the
probability of being in state after performing action in
state , and
is a reward function such that
is the immediate reward for being in state . It is well
known that such an MDP can be compactly represented using dynamic
Bayesian networks [18,11] or
probabilistic extensions of traditional planning languages,
see e.g., [37,45,50].
A stationary policy for an MDP is a function
, such
that
is the action to be executed in state . The
value of the policy at , which we seek to maximise, is
the sum of the expected future rewards over an infinite horizon,
discounted by how far into the future they occur:
where
is the discount factor controlling the
contribution of distant rewards.
Sylvie Thiebaux
2006-01-20