Rewards

Markovian rewards, associated with state transitions, can be encoded using fluents (numeric state variables). PPDDL reserves the fluent reward, accessed as (reward) or reward, to represent the total accumulated reward since the start of execution. Rewards are associated with state transitions through update rules in action effects. The use of the reward fluent is restricted to action effects of the form ($ \langle$ additive-op$ \rangle$ $ \langle$ reward fluent$ \rangle$ $ \langle$ f-exp$ \rangle$ ), where $ \langle$ additive-op$ \rangle$ is either increase or decrease, and $ \langle$ f-exp$ \rangle$ is a numeric expression not involving reward. Action preconditions and effect conditions are not allowed to refer to the reward fluent, which means that the accumulated reward does not have to be considered part of the state space. The initial value of reward is zero. These restrictions on the use of the reward fluent allow a planner to handle domains with rewards without having to implement full support for fluents.

A new requirements flag, :rewards, is introduced to signal that support for rewards is required. Domains that require both probabilistic effects and rewards can declare the :mdp requirements flag, which implies :probabilistic-effects and :rewards.

Figure 2 shows part of the PPDDL encoding of a coffee delivery domain described by Dearden and Boutilier (1997). A reward of 0.8 is awarded if the user has coffee after the “deliver-coffee” action has been executed, and a reward of 0.2 is awarded if is-wet is false after execution of “deliver-coffee”. Note that a total reward of 1.0 can be awarded as a result of executing the “deliver-coffee” action if execution of the action leads to a state where both user-has-coffee and ¬is-wet hold.

Figure 2: Part of PPDDL encoding of “Coffee Delivery“ domain.

(define (domain coffee-delivery)

(:requirements :negative-preconditions
:disjunctive-preconditions
:conditional-effects :mdp)
(:predicates (in-office) (raining) (has-umbrella) (is-wet)
(has-coffee) (user-has-coffee))
(:action deliver-coffee
:effect (and (when (and (in-office) (has-coffee))
(probabilistic
0.8 (and (user-has-coffee)
(not (has-coffee))
(increase (reward) 0.8))
0.2 (and (probabilistic 0.5 (not (has-coffee)))
(when (user-has-coffee)
(increase (reward) 0.8)))))
(when (and (not (in-office)) (has-coffee))
(and (probabilistic 0.8 (not (has-coffee)))
(when (user-has-coffee)
(increase (reward) 0.8))))
(when (and (not (has-coffee)) (user-has-coffee))
(increase (reward) 0.8))
(when (not (is-wet))
(increase (reward) 0.2))))
... )

Håkan L. S. Younes
2005-12-06