Having defined a language to represent behaviours to be rewarded, we
now turn to the problem of computing, given a reward formula, a
minimum allocation of rewards to states actually encountered in an
execution sequence, in such a way as to satisfy the formula. Because
we ultimately wish to use anytime solution methods which generate
state sequences incrementally via forward search, this computation is
best done on the fly, while the sequence is being generated. We
therefore devise an incremental algorithm based on a model-checking
technique normally used to check whether a state sequence is a model
of an FLTL formula [4]. This technique is known
as formula progression because it `progresses' or `pushes' the
formula through the sequence.
Our progression technique is shown in Algorithm 1.
In essence, it computes the modelling relation
given in
Section 3.2. However,unlike the definition of
, it is designed to be useful when states in the sequence
become available one at a time, in that it defers the evaluation of
the part of the formula that refers to the future to the point where
the next state becomes available. Let be a state, say ,
the last state of the sequence prefix that has been
generated so far, and let be a boolean true iff is in
the behaviour to be rewarded. Let the
formula
describe the allocation of rewards over all possible futures. Then the
progression of through given , written
, is a
new formula which will describe the allocation of rewards over all
possible futures of the next state, given that we have just
passed through . Crucially, the function is Markovian,
depending only on the current state and the single boolean value .
Note that is computable in linear time in the length of ,
and that for $-free formulae, it collapses to FLTL formula
progression [4], regardless of the value of .
We assume that Prog incorporates the usual simplification for
sentential constants and :
simplifies to ,
simplifies to , etc.
The fundamental property of is the following. Where
:
Proof:
See Appendix B.
Like
, the function seems to require (or at
least ) as input, but of course when progression is applied in
practice we only have and one new state at a time of , and
what we really want to do is compute the appropriate , namely
that represented by . So, similarly as in
Section 3.2, we now turn to the second step, which
is to use to decide on the fly whether a newly generated
sequence prefix is in and so should be allocated a
reward. This is the purpose of the functions
and
, also given in Algorithm 1. Given
and , the function
in
Algorithm 1 defines an infinite sequence of
formulae
in the obvious way:
To decide whether a prefix of is to be rewarded,
first tries progressing the formula through
with the boolean flag set to `false'. If that gives a consistent
result, we need not reward the prefix and we continue without
rewarding , but if the result is then we know that
must be rewarded in order for to satisfy . In
that case, to obtain we must progress through
again, this time with the boolean flag set to the value
`true'. To sum up, the behaviour corresponding to is
.
To illustrate the behaviour of $FLTL progression, consider the
formula
stating that a reward
will be received the first time is true. Let be a state in
which holds, then
. Therefore,
since the formula has progressed to ,
is true and a reward is received.
, so the reward formula
fades away and will not affect subsequent progression steps. If, on
the other hand, is false in , then
. Therefore, since the formula has not
progressed to ,
is false and no reward is received.
, so the
reward formula persists as is for subsequent progression steps.
The following theorem states that under weak assumptions, rewards are
correctly allocated by progression:
Theorem 1
Let be reward-normal, and let
be the result of progressing it through the successive states of a
sequence using the function
.
Then, provided no is
, for all
iff
.
Proof: See Appendix B
The premise of the theorem is that never progresses to
. Indeed if
for some , it means that even
rewarding does not suffice to make true, so something
must have gone wrong: at some earlier stage, the boolean was
made false where it should have been made true. The usual explanation
is that the original was not reward-normal. For instance
, which is reward unstable, progresses to in
the next state if p is true there: regardless of ,
,
, and
, so if
then
. However, other (admittedly
bizarre) possibilities exist: for example, although
is reward-unstable, its substitution instance
, which also progresses to
in a few steps, is logically equivalent to and is
reward-normal.
If the progression method were to deliver the correct minimal behaviour
in all cases (even in all reward-normal cases) it would have to
backtrack on the choice of values for the boolean flags. In the
interest of efficiency, we choose not to allow backtracking. Instead,
our algorithm raises an exception whenever a reward formula progresses
to , and informs the user of the sequence which caused the
problem. The onus is thus placed on the domain modeller to select
sensible reward formulae so as to avoid possible progression to .
It should be noted that in the worst case, detecting reward-normality
cannot be easier than the decision problem for $FLTL so it is not to
be expected that there will be a simple syntactic criterion for
reward-normality.
In practice, however, commonsense precautions such as avoiding making
rewards depend explicitly on future tense expressions suffice to keep
things normal in all routine cases. For a generous class of
syntactically recognisable reward-normal formulae, see Appendix
A.
Sylvie Thiebaux
2006-01-20