Conclusion, Related, and Future Work

In this paper, we have examined the problem of solving decision processes with non-Markovian rewards. We have described existing approaches which exploit a compact representation of the reward function to automatically translate the NMRDP into an equivalent process amenable to MDP solution methods. The computational model underlying this framework can be traced back to work on the relationship between linear temporal logic and automata in the areas of automated verification and model-checking [48,49]. While remaining in this framework, we have proposed a new representation of non-Markovian reward functions and a translation into MDPs aimed at making the best possible use of state-based anytime heuristic search as the solution method. Our representation extends future linear temporal logic to express rewards. Our translation has the effect of embedding model-checking in the solution method. It results in an MDP of the minimal size achievable without stepping outside the anytime framework, and consequently in better policies by the deadline. We have described NMRDPP, a software platform that implements such approaches under a common interface, and which proved a useful tool in their experimental analysis. Both the system and the analysis are the first of their kind. We were able to identify a number of general trends in the behaviours of the methods and to provide advice as to which are the best suited to certain circumstances. For obvious reasons, our analysis has focused on artificial domains. Additional work should examine a wider range of domains of more practical interest, to see what form these results take in that context. Ultimately, we would like our analysis to help NMRDPP automatically select the most appropriate method. Unfortunately, because of the difficulty of translating between PLTL and $FLTL, it is likely that NMRDPP would still have to maintain both a PLTL and a $FLTL version of the reward formulae. A detailed comparison of our approach to solving NMRDPs with existing methods [2,3] can be found in Sections 3.10 and 5. Two important aspects of future work would help take the comparison further. One is to settle the question of the appropriateness of our translation to structured solution methods. Symbolic implementations of the solution methods we consider, e.g. symbolic LAO* [21], as well as formula progression in the context of symbolic state representations [40] could be investigated for that purpose. The other is to take advantage of the greater expressive power of $FLTL to consider a richer class of decision processes, for instance with uncertainty as to which rewards are received and when. Many extensions of the language are possible: adding eventualities, unrestricted negation, first-class reward propositions, quantitative time, etc. Of course, dealing with them via progression without backtracking is another matter. We should investigate the precise relationship between our line of work and recent work on planning for temporally extended goals in non-deterministic domains. Of particular interest are `weak' temporally extended goals such as those expressible in the Eagle language [16], and temporally extended goals expressible in $\pi$ -CTL* [7]. Eagle enables the expression of attempted reachability and maintenance goals of the form ``try-reach

'' and ``try-maintain

'', which add to the goals ``do-reach

'' and ``do-maintain

'' already expressible in CTL. The idea is that the generated policy should make every attempt at satisfying proposition

. Furthermore, Eagle includes recovery goals of the form ``

fail

'', meaning that goal

must be achieved whenever goal

fails, and cyclic goals of the form ``repeat

'', meaning that

should be achieved cyclically until it fails. The semantics of these goals is given in terms of variants of Büchi tree automata with preferred transitions. Dal Lago et al. [16] present a planning algorithm based on symbolic model-checking which generates policies achieving those goals. Baral and Zhao [7] describe $\pi$ -CTL*, an alternative framework for expressing a subset of Eagle goals and a variety of others. $\pi$ -CTL* is a variant of CTL* which allows for formulae involving two types of path quantifiers: quantifiers tied to the paths feasible under the generated policy, as is usual, but also quantifiers more generally tied to the paths feasible under any of the domain actions. Baral and Zhao [7] do not present any planning algorithm. It would be very interesting to know whether Eagle and $\pi$ -CTL* goals can be encoded as non-Markovian rewards in our framework. An immediate consequence would be that NMRDPP could be used to plan for them. More generally, we would like to examine the respective merits of non-deterministic planning for temporally extended goals and decision-theoretic planning with non-Markovian rewards. In the pure probabilistic setting (no rewards), recent related research includes work on planning and controller synthesis for probabilistic temporally extended goals expressible in probabilistic temporal logics such as CSL or PCTL [52,6]. These logics enable expressing statements about the probability of the policy satisfying a given temporal goal exceeding a given threshold. For instance, Younes and Simmons [52] describe a very general probabilistic planning framework, involving concurrency, continuous time, and temporally extended goals, rich enough to model generalised semi-Markov processes. The solution algorithms are not directly comparable to those presented here. Another exciting future work area is the investigation of temporal logic formalisms for specifying heuristic functions for NMRDPs or more generally for search problems with temporally extended goals. Good heuristics are important to some of the solution methods we are targeting, and surely their value ought to depend on history. The methods we have described could be applicable to the description and processing of such heuristics. Related to this is the problem of extending search control knowledge to fully operate under the presence of temporally extended goals, rewards, and stochastic actions. A first issue is that branching or probabilistic logics such as CTL or PCTL variants should be preferred to FLTL when describing search control knowledge, because when stochastic actions are involved, search control often needs to refer to some of the possible futures and even to their probabilities.¹⁸ Another major problem is that the GOALP modality, which is the key to the specification of reusable search control knowledge is interpreted with respect to a fixed reachability goal¹⁹[5], and as such, is not applicable to domains with temporally extended goals, let alone rewards. Kabanza and Thiébaux [33] present a first approach to search control in the presence of temporally extended goals in deterministic domains, but much remains to be done for a system like NMRDPP to be able to support a meaningful extension of GOALP. Finally, let us mention that related work in the area of databases uses a similar approach to PLTLSTR to extend a database with auxiliary relations containing sufficient information to check temporal integrity constraints [15]. The issues are somewhat different from those raised by NMRDPs: as there is only ever one sequence of databases, what matters is more the size of these auxiliary relations than avoiding making redundant distinctions.

Sylvie Thiebaux 2006-01-20