The Problem

Markov decision processes (MDPs) are now widely accepted as the preferred model for decision-theoretic planning problems [11]. The fundamental assumption behind the MDP formulation is that not only the system dynamics but also the reward function are Markovian. Therefore, all information needed to determine the reward at a given state must be encoded in the state itself. This requirement is not always easy to meet for planning problems, as many desirable behaviours are naturally expressed as properties of execution sequences, see e.g., [19,27,4,40]. Typical cases include rewards for the maintenance of some property, for the periodic achievement of some goal, for the achievement of a goal within a given number of steps of the request being made, or even simply for the very first achievement of a goal which becomes irrelevant afterwards. For instance, consider a health care robot which assists ederly or disabled people by achieving simple goals such as reminding them to do important tasks (e.g. taking a pill), entertaining them, checking or transporting objects for them (e.g. checking the stove's temperature or bringing coffee), escorting them, or searching (e.g. for glasses or for the nurse) [14]. In this domain, we might want to reward the robot for making sure a given patient takes his pill exactly once every 8 hours (and penalise it if it fails to prevent the patient from doing this more than once within this time frame!), we may reward it for repeatedly visiting all rooms in the ward in a given order and reporting any problem it detects, it may also receive a reward once for each patient's request answered within the appropriate time-frame, etc. Another example is the elevator control domain [35], in which an elevator must get passengers from their origin to their destination as efficiently as possible, while attempting to satisfying a range of other conditions such as providing priority services to critical customers. In this domain, some trajectories of the elevator are more desirable than others, which makes it natural to encode the problem by assigning rewards to those trajectories. A decision process in which rewards depend on the sequence of states passed through rather than merely on the current state is called a decision process with non-Markovian rewards (NMRDP) [2]. A difficulty with NMRDPs is that the most efficient MDP solution methods do not directly apply to them. The traditional way to circumvent this problem is to formulate the NMRDP as an equivalent MDP, whose states result from augmenting those of the original NMRDP with extra information capturing enough history to make the reward Markovian. Hand crafting such an MDP can however be very difficult in general. This is exacerbated by the fact that the size of the MDP impacts the effectiveness of many solution methods. Therefore, there has been interest in automating the translation into an MDP, starting from a natural specification of non-Markovian rewards and of the system's dynamics [2,3]. This is the problem we focus on.

Sylvie Thiebaux 2006-01-20