A decision process with non-Markovian rewards is identical to an MDP
except that the domain of the reward function is
. The idea is
that if the process has passed through state sequence
up
to stage
, then the reward
is received at stage
. Figure 1 gives an example. Like the reward
function, a policy for an NMRDP depends on history, and is a mapping
from
to
. As before, the value of policy
is the
expectation of the discounted cumulative reward over an infinite
horizon:
For a decision process
and a state
, we let
stand for the set of state sequences rooted at
that are feasible under the actions in
, that is:
. Note that the definition of
does not depend on
and therefore applies
to both MDPs and NMRDPs.
Figure 1:
A Simple NMRDP
 |
Sylvie Thiebaux
2006-01-20