Blind Minimality

The size of the MDP obtained, i.e. the number of e-states it contains is a key issue for us, as it has to be amenable to state-based solution methods. Ideally, we would like the MDP to be of minimal size. However, we do not know of a method building the minimal equivalent MDP incrementally, adding parts as required by the solution method. And since in the worst case even the minimal equivalent MDP can be larger than the NMRDP by a factor exponential in the length of the reward formulae [2], constructing it entirely would nullify the interest of anytime solution methods. However, as we now explain, Definition 5 leads to an equivalent MDP exhibiting a relaxed notion of minimality, and which is amenable to incremental construction. By inspection, we may observe that wherever an e-state $\langle s, \phi\rangle$ has a successor $\langle s',\phi'\rangle$ via action $a$, this means that in order to succeed in rewarding the behaviours described in $\phi$ by means of execution sequences that start by going from $s$ to $s'$ via $a$, it is necessary that the future starting with $s'$ succeeds in rewarding the behaviours described in $\phi'$. If $\langle s, \phi\rangle$ is in the minimal equivalent MDP, and if there really are such execution sequences succeeding in rewarding the behaviours described in $\phi$, then $\langle s',\phi'\rangle$ must also be in the minimal MDP. That is, construction by progression can only introduce e-states which are a priori needed. Note that an e-state that is a priori needed may not really be needed: there may in fact be no execution sequence using the available actions that exhibits a given behaviour. For instance, consider the response formula $\mbox{$\Box$}(p
\rightarrow (\raisebox{0.6mm}{$\scriptstyle \bigcirc$}^{k}q \rightarrow \raisebox{0.6mm}{$\scriptstyle \bigcirc$}^{k}\mbox{\$}))$, i.e., every time trigger $p$ is true, we will be rewarded $k$ steps later provided $q$ is true then. Obviously, whether $p$ is true at some stage affects the way future states should be rewarded. However, if the transition relation happens to have the property that $k$ steps from a state satisfying $p$, no state satisfying $q$ can be reached, then a posteriori $p$ is irrelevant, and there was no need to label e-states differently according to whether $p$ was true or not - observe an occurrence of this in the example in Figure 7, and how this leads FLTL to produce an extra state at the bottom left of the Figure. To detect such cases, we would have to look perhaps quite deep into feasible futures, which we cannot do while constructing the e-states on the fly. Hence the relaxed notion which we call blind minimality does not always coincide with absolute minimality. We now formalise the difference between true and blind minimality. For this purpose, it is convenient to define some functions $\rho$ and $\mu$ mapping e-states $e$ to functions from $S^*$ to $\mbox{I$\!$R}$ intuitively assigning rewards to sequences in the NMRDP starting from $\tau(e)$. Recall from Definition 1 that $\tau$ maps each e-state of the MDP to the underlying NMRDP state.

Definition 6   Let $D$ be an NMRDP. Let $S'$ be the set of e-states in an equivalent MDP $D'$ for $D$. Let $e$ be any reachable e-state in $S'$. Let $\Gamma'(i)$ be a sequence of e-states in $\widetilde{D'}(s'_0)$ such that $\Gamma'(i) = e$. Let $\Gamma(i)$ be the corresponding sequence in $\widetilde{D}(s_0)$ obtained under $\tau$ in the sense that, for each $j \leq i$, $\Gamma(j) = \tau(\Gamma'_j)$. Then for any $\Delta
\in S^*$, we define

\begin{displaymath}
\rho(e): \Delta \mapsto \left\{\begin{array}{ll}
R(\Gamma(...
..._0 = \Gamma_i$} \\
0 & \mbox{otherwise}
\end{array}\right.
\end{displaymath}

and

\begin{displaymath}
\mu(e): \Delta \mapsto \left\{\begin{array}{ll}
R(\Gamma(i...
...{D}(\Gamma_i)$} \\
0 & \mbox{otherwise}
\end{array}\right.
\end{displaymath}

For any unreachable e-state $e$, we define both $\rho(e)(\Delta)$ and $\mu(e)(\Delta)$ to be 0 for all $\Delta$.

Note carefully the difference between $\rho$ and $\mu$. The former describes the rewards assigned to all continuations of a given state sequence, while the latter confines rewards to feasible continuations. Note also that $\rho$ and $\mu$ are well-defined despite the indeterminacy in the choice of $\Gamma'(i)$, since by clause 4 of Definition 1, all such choices lead to the same values for $R$.

Theorem 3   Let $S'$ be the set of e-states in an equivalent MDP $D'$ for $D=\langle S,s_0,A,\Pr ,R\rangle$. $D'$ is minimal iff every e-state in $S'$ is reachable and $S'$ contains no two distinct e-states $s'_1$ and $s'_2$ with $\tau(s'_1) = \tau(s'_2)$ and $\mu(s'_1) = \mu(s'_2)$.

Proof: See Appendix B. $\Box$
Blind minimality is similar, except that, since there is no looking ahead, no distinction can be drawn between feasible trajectories and others in the future of $s$:

Definition 7   Let $S'$ be the set of e-states in an equivalent MDP $D'$ for $D=\langle S,s_0,A,\Pr ,R\rangle$. $D'$ is blind minimal iff every e-state in $S'$ is reachable and $S'$ contains no two distinct e-states $s'_1$ and $s'_2$ with $\tau(s'_1) = \tau(s'_2)$ and $\rho(s'_1) = \rho(s'_2)$.

Theorem 4   Let $D'$ be the translation of $D$ as in Definition 5. $D'$ is a blind minimal equivalent MDP for $D$.

Proof: See Appendix B. $\Box$
The size difference between the blind-minimal and minimal MDPs will depend on the precise interaction between rewards and dynamics for the problem at hand, making theoretical analyses difficult and experimental results rather anecdotal. However, our experiments in Section 5 and 6 will show that from a computation time point of view, it is often preferable to work with the blind-minimal MDP than to invest in the overhead of computing the truly minimal one. Finally, recall that syntactically different but semantically equivalent reward function specifications define the same e-state. Therefore, neither minimality nor blind minimality can be achieved in general without an equivalence check at least as complex as theorem proving for LTL. In pratical implementations, we avoid theorem proving in favour of embedding (fast) formula simplification in our progression and regression algorithms. This means that in principle we only approximate minimality and blind minimality, but this appears to be enough for practical purposes.
Sylvie Thiebaux 2006-01-20