next up previous contents
Next: Value Function Approximation for Up: Value Function Approximation for Previous: Literature Review

VFA in Acyclic Domains: ``ROUT''

 

I have presented two reasons why working strictly backwards may be desirable: efficiency, because updates need only be done on the ``frontier'' rather than all over state space; and robustness, because correct tex2html_wrap_inline1400 values, once assigned, need never again be changed. I have therefore investigated generalizations of the Dijkstra and DAG-SP algorithms specifically modified to accommodate huge state spaces and value function approximation. My variant of Dijkstra's algorithm, called Grow-Support, was presented in [Boyan and Moore1995] and will not be explored further in this thesis. My variant of DAG-SP is an algorithm called ROUT [Boyan and Moore1996], which this thesis will develop further.

In the huge domains for which ROUT is designed, DAG-SP's key preprocessing step--topologically sorting the entire state space--is no longer tractable. Instead, ROUT must expend some extra effort to identify states on the current frontier. Once identified (as described below), a frontier state is assigned its optimal tex2html_wrap_inline1400 value by a simple one-step backup, and this {state tex2html_wrap_inline1394 value} pair is added to a training set for a function approximator. Thus, ROUT's main loop consists of identifying a frontier state; determining its tex2html_wrap_inline1400 value; and retraining the approximator (see Appendix A.2). The training set, constructed adaptively, grows backwards from the goal.

ROUT's key subroutine, HUNTFRONTIERSTATE, is responsible for identifying a good state x to add to the training set. In particular:

  1. All states reachable from x should already have their tex2html_wrap_inline1400 values correctly approximated by the function approximator. This ensures that the policy from x onward is optimal, and that a correct target value for tex2html_wrap_inline1454 can be assigned.
  2. x itself should not already have its tex2html_wrap_inline1400 value correctly approximated. This condition aims to keep the training set as small as possible, by excluding states whose values are correct anyway thanks to good generalization.
  3. x should be a state that we care to learn about. For that reason, ROUT considers only states which occur on trajectories emanating from a starting state of the MDP.
The HUNTFRONTIERSTATE operation returns a state which with high probability satisfies these properties. It begins with some state x and generates a number of trajectories from x, each time checking to see whether all states along the trajectory are self-consistent (i.e., satisfy Equation 4 to some tolerance tex2html_wrap_inline1592 ). If all states after x on all sample trajectories are self-consistent, then x is deemed ready, and ROUT will add x to its training set. If, on the other hand, a trajectory from x reveals any inconsistencies in the approximated value function, then we flag that trajectory's last such inconsistent state, and restart HUNTFRONTIERSTATE from there. Figure 2.3 illustrates how the routine works.

   figure145
Figure 1: A schematic of ROUT working on an acyclic two-dimensional navigation domain, where the allowable actions are only tex2html_wrap_inline1394 , tex2html_wrap_inline1396 , and tex2html_wrap_inline1606 . Suppose that ROUT has thus far established training values for tex2html_wrap_inline1400 at the triangles, and that the function approximator has successfully generalized tex2html_wrap_inline1400 throughout the shaded region. Now, when HUNTFRONTIERSTATE generates a trajectory from the start state to termination (solid line), it finds that several states along that trajectory are inconsistent (marked by crosses). The last such cross becomes the new starting point for HUNTFRONTIERSTATE. From there, all trajectories generated (dashed lines) are fully self-consistent, so that state gets added to ROUT's training set. When the function approximator is re-trained, the shaded region of validity should grow, backwards from the goal.

The parameters of the ROUT algorithm are H, the number of trajectories generated to certify a state's readiness, and tex2html_wrap_inline1592 , the tolerated Bellman residual. ROUT's convergence to the optimal tex2html_wrap_inline1400 , assuming the function approximator can fit the tex2html_wrap_inline1400 training set perfectly, can be guaranteed in the limiting case where tex2html_wrap_inline1620 (assuring exploration of all states reachable from x) and tex2html_wrap_inline1624 . In practice, of course, we want to be tolerant of some approximation error. Typical settings I used were H=20 and tex2html_wrap_inline1633 roughly 5% of the range of tex2html_wrap_inline1400 ).

In experiments on two medium-sized domains ( tex2html_wrap_inline1632 states), ROUT learned evaluation functions which were as good or better than those learned by TD( tex2html_wrap_inline2400 ) with the same neural-net approximator, and used an order of magnitude less training data to do so. These results are detailed in Appendix B.1.

When a function approximator is capable of fitting tex2html_wrap_inline1400 , ROUT will, in the limit, find it. However, for ROUT to be efficient, the frontier must grow backward from the goal quickly, and this depends on good extrapolation from the training set. When good extrapolation does not occur, ROUT may become stuck, repeatedly adding points near the goal region and never progressing backwards. Some function approximators may be especially well-suited to ROUT's required extrapolation from accurate training data, and I plan to explore this. I will also explore algorithmic refinements to make ROUT more robust in general. One such refinement is to adapt the tolerance level tex2html_wrap_inline1592 , thereby guaranteeing progress at the expense of accuracy.

With such improvements, I hope to develop ROUT into the algorithm of choice for learning evaluation functions for large-scale acyclic domains. I will validate it empirically on the control problem of factory scheduling (Section 5.3), the game of backgammongif, and a variety of combinatorial optimization problems formulated as acyclic MDPs as described below.


next up previous contents
Next: Value Function Approximation for Up: Value Function Approximation for Previous: Literature Review

Justin A. Boyan
Sat Jun 22 20:49:48 EDT 1996