NMRDPP in the Probabilistic Planning Competition
We now report on the behaviour of NMRDPP in the probabilistic track
of the 4th International Planning Competition (IPC-4).
Since the competition did not feature non-Markovian rewards,
our original motivation in taking part was to further
compare the solution methods implemented in NMRDPP in a Markovian setting. This objective largely
underestimated the challenges raised by merely getting a planner ready
for a competition, especially when that competition is the first of its kind.
In the end, we decided that successfully preparing NMRDPP to attempt all problems in the competition using one solution
method (and possibly search control knowledge), would be an
honorable result.
The most crucial problem we encountered was the translation of PPDDL
[50], the probabilistic variant of PDDL used as
input language for the competition, into NMRDPP's ADD-based input
language. While translating PPDDL into ADDs is possible in theory,
devising a translation which is practical enough for the need of the
competition (small number of variables, small, quickly generated, and
easily manipulable ADDs) is another matter. MTBDD, the translator
kindly made available to participants by the competition organisers,
was not always able to achieve the required efficiency. At other
times, the translation was quick but NMRDPP was unable to use the
generated ADDs efficiently. Consequently, we implemented a state-based
translator on top of the PDDL parser as a backup, and opted for a
state-based solution method since it did not rely on ADDs and could
operate with both translators.
The version of NMRDPP entered in the competition did the following:
- Attempt to get a translation into ADDs using MTBDD, and
if that proves infeasible, abort it and rely on the state-based
translator instead.
- Run FLTL expansion of the state space, taking search control
knowledge into account when available. Break after 10mn if not complete.
- Run value iteration to convergence. Failing to achieve
any useful result (e.g. because expansion was not complete enough to
even reach a goal state), go back to step 2.
- Run as many of the 30 trials as possible in the remaining
time,16 following the generated policy where defined, and falling back on the
non-deterministic search control policy when available.
With Step 1 we were trying to maximise the instances in which the
original ADD-based NMRDPP version could be run intact. In Step
3, it was decided not to use LAO* because when run with no good
heuristic, it often incurs a significant
overhead compared to value iteration.
The problems featured in the competition can be classified into
goal-based or reward-based problems. In goal-based problems, a
(positive) reward is only received when a goal state is reached. In
reward-based problems, action performance may also incur a (usually
negative) reward. Another orthogonal distinction can be made between
problems from domains that were not communicated in advance to the
participants and those from domains that were. The latter
consisted of variants of blocks world and logistics (or box world)
problems, and gave the participating planners
an opportunity to exploit knowledge of the domain,
much as in the hand-coded deterministic planning track.
We decided to enroll NMRDPP in a control-knowledge mode and in a
domain-independent mode. The only difference between the two modes is
that the first uses FLTL search control knowledge written for the
known domains as additional input. Our main concern in writing the
control knowledge was to achieve a reasonable compromise between the
size and effectiveness of the formulae. For the blocks world domain,
in which the two actions pickup-from and putdown-to had a 25% chance
of dropping the block onto the table, the control knowledge we used
encoded a variant of the well-known GN1 near-optimal strategy for
deterministic blocks world planning [42]:
whenever possible, try putting a clear block in its goal position,
otherwise put an arbitrary clear block on the table. Because blocks
get dropped on the table whenever an action fails, and because the
success probabilities and rewards are identical across actions,
optimal policies for the problem are essentially made up of optimal
sequences of actions for the deterministic blocks world and there was
little need for a more sophisticated strategy.17 In the colored
blocks world domain, where several blocks can share the same color and
the goal only refers to the color of the blocks, the control knowledge
selected an arbitrary goal state of the non-colored blocks world
consistent with the colored goal specification, and then used the same
strategy as for the non-colored blocks world. The performance of this
strategy depends entirely on the goal-state selected and can therefore
be arbitrarily bad.
Logistics problems from IPC-2 distinguish between airports and
other locations within a city; trucks can drive between any two
locations in a city and planes can fly between any two airports. In
contrast, the box world only features cities, some of which have an
airport, some of which are only accessible by truck. A priori, the map
of the truck and plane connections is arbitrary. The goal is to get
packages from their city of origin to their city of
destination. Moving by truck has a 20% chance of resulting in reaching
one of the three cities closest to the departure city rather than the
intended one. The size of the box world search space turned out to be
quite challenging for NMRDPP. Therefore, when writing search control
knowledge, we gave up any optimality consideration and favored maximal
pruning. We were helped by the fact that the box world generator
produces problems with the following structure. Cities are divided
into clusters, all of which are composed of at least one airport city.
Furthermore each cluster has at least one hamiltonian circuit which
trucks can follow. The control knowledge we used forced all planes
but one, and all trucks but one in each cluster to be idle. In each
cluster, the truck allowed to move could only attempt driving along
the chosen hamiltonian circuit, picking up and dropping parcels as it
went.
Table 2:
Competition Participants. Domain-specific planners are starred
|
Table 3:
Results for Goal-Based Problems.
Domain-specific planners are starred. Entries are the percentage of
runs in which the goal was reached. A blank indicates that the
planner was unable to attempt the problem. A -- indicates that the
planner attempted the problem but was never able to achieve the
goal. A ? indicates that the result is unavailable (due to
a bug in the evaluation software, a couple of the results initially announced
were found to be invalid).
|
Table 4:
Results for Reward-Based Problems.
Domain-specific planners are starred. Entries are the average reward
achieved over the 30 runs. A blank indicates that the planner was
unable to attempt the problem. A -- indicates that the planner
attempted the problem but did not achieve a strictly positive reward.
A ? indicates that the result is unavailable.
|
The planners participating in the competition are shown in
Table 2. Planners
E, G2, J1, and J2 are domain-specific: either they are
tuned for blocks and box worlds, or they use domain-specific search
control knowledge, or learn from examples. The other participating
planners are domain-independent.
Tables 3 and 4 show the
results of the competition, which we extracted from the competition
overview paper [51] and from the competition web
site http://www.cs.rutgers.edu/~mlittman/topics/ipc04-pt/. The
first of those tables concerns goal-based problems and the second the
reward-based problems. The entries in the tables represent the
goal-achievement percentage or average reward achieved by the various
planner versions (left-column) on the various problems (top two
rows). Planners in the top part of the tables are domain-specific.
Problems from the known domains lie on the left-hand side of the
tables. The colored blocks world problems are bw-c-nr
(goal-based version) and bw-c-r (reward version) with 5, 8, and
11 blocks. The non-colored blocks world problems are bw-nc-nr
(goal-based version) with 8 blocks, and bw-nc-r (reward-based
version) with 5, 8, 11, 15, 18, and 21 blocks. The box world problems
are bx-nr (goal-based) and bx-r (reward-based), with 5 or
10 cities and 10 or 15 boxes. Problems from the unknown domains lie on
the right hand side of the tables. They comprise: expl-bw, an
exploding version of the 11 block blocks world problem in which
putting down a block may destroy the object it is put on, zeno,
a probabilistic variant of a zeno travel domain problem from the IPC-3
with 1 plane, 2 persons, 3 cities and 7 fuel levels, hanoise, a
probabilistic variant of the tower of hanoi problem with 5 disks and 3
rods, file, a problem of putting 30 files in 5 randomly chosen
folders, and tire, a variant a the tire world problem with 30
cities and spare tires at 4 of them, where the tire may go flat while
driving.
Our planner NMRDPP in its G1 or G2 version, was able to attempt all
problems, achieving a strictly positive reward in all but 4 of them.
Not even FF (J3), the competition overall winner, was able to
successfully attempt that many problems. NMRDPP performed
particularly well on goal-based problems, achieving the goal in 100%
of the runs except in expl-bw, hanoise, and tire-nr
(note that for these three problems, the goal achievement probability
of the optimal policy does not exceed 65%). No other planner
outperformed NMRDPP on that scale. As pointed out before, FF behaves well on the probabilistic version of blocks and box world
because the optimal policies are very close to those for the
deterministic problem - Hoffmann [30] analyses
the reasons why the FF heuristic works well for traditional
planning benchmarks such as blocks world and logistics. On the other
hand, FF is unable to solve the unknown problems which have a
different structure and require more substantial probabilistic
reasoning, although these problems are easily solved by a number of
participating planners. As expected, there is a large discrepancy
between the version of NMRDPP allowed to use search control (G2) and
the domain-independent version (G1). While the latter performs okay
with the unknown goal-based domains, it is not able to solve any of
the known ones. In fact, to except for FF, none of the
participating domain-independent planners were able to solve these
problems.
In the reward-based case, NMRDPP with control knoweldge behaves well
on the known problems. Only the human-encoded policies (J1) performed
better. Without control knowledge NMRDPP is unable to scale on those
problems, while other participants such as FF and mGPT are. Furthermore NMRDPP appears to perform poorly on the two unknown
problems. In both cases, this might be due to the fact that it fails
to generate an optimal policy: suboptimal policies easily have a high
negative score in these domains, see [51]. For r-tire, we know that NMRDPP did indeed generate a suboptimal
policy. Additionally, it could be that NMRDPP was unlucky with the
sampling-based policy evaluation process: in tire-r in
particular, there was a high variance between the costs of various
trajectories in the optimal policy.
Alltogether, the competition results suggest that control knowledge is
likely to be essential when solving larger problems (Markovian or not)
with NMRDPP, and that, as has been observed with deterministic
planners, approaches making use of control knowledge are quite
powerful.
Sylvie Thiebaux
2006-01-20