NMRDPP in the Probabilistic Planning Competition

We now report on the behaviour of NMRDPP in the probabilistic track of the 4th International Planning Competition (IPC-4). Since the competition did not feature non-Markovian rewards, our original motivation in taking part was to further compare the solution methods implemented in NMRDPP in a Markovian setting. This objective largely underestimated the challenges raised by merely getting a planner ready for a competition, especially when that competition is the first of its kind. In the end, we decided that successfully preparing NMRDPP to attempt all problems in the competition using one solution method (and possibly search control knowledge), would be an honorable result. The most crucial problem we encountered was the translation of PPDDL [50], the probabilistic variant of PDDL used as input language for the competition, into NMRDPP's ADD-based input language. While translating PPDDL into ADDs is possible in theory, devising a translation which is practical enough for the need of the competition (small number of variables, small, quickly generated, and easily manipulable ADDs) is another matter. MTBDD, the translator kindly made available to participants by the competition organisers, was not always able to achieve the required efficiency. At other times, the translation was quick but NMRDPP was unable to use the generated ADDs efficiently. Consequently, we implemented a state-based translator on top of the PDDL parser as a backup, and opted for a state-based solution method since it did not rely on ADDs and could operate with both translators. The version of NMRDPP entered in the competition did the following:

Attempt to get a translation into ADDs using MTBDD, and if that proves infeasible, abort it and rely on the state-based translator instead.
Run FLTL expansion of the state space, taking search control knowledge into account when available. Break after 10mn if not complete.
Run value iteration to convergence. Failing to achieve any useful result (e.g. because expansion was not complete enough to even reach a goal state), go back to step 2.
Run as many of the 30 trials as possible in the remaining time,¹⁶ following the generated policy where defined, and falling back on the non-deterministic search control policy when available.

With Step 1 we were trying to maximise the instances in which the original ADD-based NMRDPP version could be run intact. In Step 3, it was decided not to use LAO* because when run with no good heuristic, it often incurs a significant overhead compared to value iteration. The problems featured in the competition can be classified into goal-based or reward-based problems. In goal-based problems, a (positive) reward is only received when a goal state is reached. In reward-based problems, action performance may also incur a (usually negative) reward. Another orthogonal distinction can be made between problems from domains that were not communicated in advance to the participants and those from domains that were. The latter consisted of variants of blocks world and logistics (or box world) problems, and gave the participating planners an opportunity to exploit knowledge of the domain, much as in the hand-coded deterministic planning track. We decided to enroll NMRDPP in a control-knowledge mode and in a domain-independent mode. The only difference between the two modes is that the first uses FLTL search control knowledge written for the known domains as additional input. Our main concern in writing the control knowledge was to achieve a reasonable compromise between the size and effectiveness of the formulae. For the blocks world domain, in which the two actions pickup-from and putdown-to had a 25% chance of dropping the block onto the table, the control knowledge we used encoded a variant of the well-known GN1 near-optimal strategy for deterministic blocks world planning [42]: whenever possible, try putting a clear block in its goal position, otherwise put an arbitrary clear block on the table. Because blocks get dropped on the table whenever an action fails, and because the success probabilities and rewards are identical across actions, optimal policies for the problem are essentially made up of optimal sequences of actions for the deterministic blocks world and there was little need for a more sophisticated strategy.¹⁷ In the colored blocks world domain, where several blocks can share the same color and the goal only refers to the color of the blocks, the control knowledge selected an arbitrary goal state of the non-colored blocks world consistent with the colored goal specification, and then used the same strategy as for the non-colored blocks world. The performance of this strategy depends entirely on the goal-state selected and can therefore be arbitrarily bad. Logistics problems from IPC-2 distinguish between airports and other locations within a city; trucks can drive between any two locations in a city and planes can fly between any two airports. In contrast, the box world only features cities, some of which have an airport, some of which are only accessible by truck. A priori, the map of the truck and plane connections is arbitrary. The goal is to get packages from their city of origin to their city of destination. Moving by truck has a 20% chance of resulting in reaching one of the three cities closest to the departure city rather than the intended one. The size of the box world search space turned out to be quite challenging for NMRDPP. Therefore, when writing search control knowledge, we gave up any optimality consideration and favored maximal pruning. We were helped by the fact that the box world generator produces problems with the following structure. Cities are divided into clusters, all of which are composed of at least one airport city. Furthermore each cluster has at least one hamiltonian circuit which trucks can follow. The control knowledge we used forced all planes but one, and all trucks but one in each cluster to be idle. In each cluster, the truck allowed to move could only attempt driving along the chosen hamiltonian circuit, picking up and dropping parcels as it went.

Table 2: Competition Participants. Domain-specific planners are starred

Part.	Description	Reference
C	symbolic LAO*	[21]
E*	first-order heuristic search in the fluent calculus	[34]
G1	NMRDPP without control knowledge	this paper
G2*	NMRDPP with control knowledge	this paper
J1*	interpreter of hand written classy policies	[23]
J2*	learns classy policies from random walks	[23]
J3	version of FF replanning upon failure	[31]
P	mGPT: LRTDP with automatically extracted heuristics	[10]
Q	PROBAPROP: conformant probabilistic planner	[39]
R	structured reachability analysis and structured PI	[44]

Table 3: Results for Goal-Based Problems. Domain-specific planners are starred. Entries are the percentage of runs in which the goal was reached. A blank indicates that the planner was unable to attempt the problem. A -- indicates that the planner attempted the problem but was never able to achieve the goal. A ? indicates that the result is unavailable (due to a bug in the evaluation software, a couple of the results initially announced were found to be invalid).

dom	bw-c-nr		bw-nc-nr	bx-nr	expl-bw	hanoise	zeno	tire-nr
prob	5	8	11	8	5-10	10-10	11	5-3	1-2-3-7	30-4	total
G2*	100	100	100	100	100	100					600
J1*	100	100	100	100	100	100					600
J2*	100	100	100	100	100	67					567
E*	100	100	100	100							400
J3	100	100	100	100	100	100	9	--	--	23	632
G1							--	50	100	30	180
R				3				57	90	30	177
P									100	53	153
C									100	?	$\geq$ 100
Q									3	23	26

Table 4: Results for Reward-Based Problems. Domain-specific planners are starred. Entries are the average reward achieved over the 30 runs. A blank indicates that the planner was unable to attempt the problem. A -- indicates that the planner attempted the problem but did not achieve a strictly positive reward. A ? indicates that the result is unavailable.

dom	bw-c-r		bw-nc-r			bx-r		file	tire-r
prob	5	8	11	5	8	11	15	18	21	5-10	10-10	10-15	30-4	30-4	total
J1*	497	487	481	494	489	480	470	462	458	419	317	129			5183
G2*	495	486	480	495	490	480	468	352	286	438	376	--			4846
E*	496	492	486	495	490										2459
J2*	497	486	482	495	490	480	468	--	455	376	--	--			4229
J3	496	487	482	494	490	481	--	--	459	425	346	279	36	--	4475
P				494	488	466	397			184	--		58	--	2087
C				495										?	$\geq$ 495
G1				495									--	--	495
R				494											494
Q				180										11	191

The planners participating in the competition are shown in Table 2. Planners E, G2, J1, and J2 are domain-specific: either they are tuned for blocks and box worlds, or they use domain-specific search control knowledge, or learn from examples. The other participating planners are domain-independent. Tables 3 and 4 show the results of the competition, which we extracted from the competition overview paper [51] and from the competition web site http://www.cs.rutgers.edu/~mlittman/topics/ipc04-pt/. The first of those tables concerns goal-based problems and the second the reward-based problems. The entries in the tables represent the goal-achievement percentage or average reward achieved by the various planner versions (left-column) on the various problems (top two rows). Planners in the top part of the tables are domain-specific. Problems from the known domains lie on the left-hand side of the tables. The colored blocks world problems are bw-c-nr (goal-based version) and bw-c-r (reward version) with 5, 8, and 11 blocks. The non-colored blocks world problems are bw-nc-nr (goal-based version) with 8 blocks, and bw-nc-r (reward-based version) with 5, 8, 11, 15, 18, and 21 blocks. The box world problems are bx-nr (goal-based) and bx-r (reward-based), with 5 or 10 cities and 10 or 15 boxes. Problems from the unknown domains lie on the right hand side of the tables. They comprise: expl-bw, an exploding version of the 11 block blocks world problem in which putting down a block may destroy the object it is put on, zeno, a probabilistic variant of a zeno travel domain problem from the IPC-3 with 1 plane, 2 persons, 3 cities and 7 fuel levels, hanoise, a probabilistic variant of the tower of hanoi problem with 5 disks and 3 rods, file, a problem of putting 30 files in 5 randomly chosen folders, and tire, a variant a the tire world problem with 30 cities and spare tires at 4 of them, where the tire may go flat while driving. Our planner NMRDPP in its G1 or G2 version, was able to attempt all problems, achieving a strictly positive reward in all but 4 of them. Not even FF (J3), the competition overall winner, was able to successfully attempt that many problems. NMRDPP performed particularly well on goal-based problems, achieving the goal in 100% of the runs except in expl-bw, hanoise, and tire-nr (note that for these three problems, the goal achievement probability of the optimal policy does not exceed 65%). No other planner outperformed NMRDPP on that scale. As pointed out before, FF behaves well on the probabilistic version of blocks and box world because the optimal policies are very close to those for the deterministic problem - Hoffmann [30] analyses the reasons why the FF heuristic works well for traditional planning benchmarks such as blocks world and logistics. On the other hand, FF is unable to solve the unknown problems which have a different structure and require more substantial probabilistic reasoning, although these problems are easily solved by a number of participating planners. As expected, there is a large discrepancy between the version of NMRDPP allowed to use search control (G2) and the domain-independent version (G1). While the latter performs okay with the unknown goal-based domains, it is not able to solve any of the known ones. In fact, to except for FF, none of the participating domain-independent planners were able to solve these problems. In the reward-based case, NMRDPP with control knoweldge behaves well on the known problems. Only the human-encoded policies (J1) performed better. Without control knowledge NMRDPP is unable to scale on those problems, while other participants such as FF and mGPT are. Furthermore NMRDPP appears to perform poorly on the two unknown problems. In both cases, this might be due to the fact that it fails to generate an optimal policy: suboptimal policies easily have a high negative score in these domains, see [51]. For r-tire, we know that NMRDPP did indeed generate a suboptimal policy. Additionally, it could be that NMRDPP was unlucky with the sampling-based policy evaluation process: in tire-r in particular, there was a high variance between the costs of various trajectories in the optimal policy. Alltogether, the competition results suggest that control knowledge is likely to be essential when solving larger problems (Markovian or not) with NMRDPP, and that, as has been observed with deterministic planners, approaches making use of control knowledge are quite powerful.

Sylvie Thiebaux 2006-01-20