The function is designed to reward states which are good starting places for algorithm A. We can optimize using any method; most straightforward is to apply the same graph-search procedure as A, but to use instead of f as the evaluation function. Whatever optimization procedure is chosen (call it B), performing it should produce a state y which is a good starting state for A; we can then run A starting from y to complete our search. We use the notation ``B;A'' to represent this concatenation of the two searches. Our ``STAGE'' algorithm, given in Appendix A.3, interleaves function-approximator training (by either batch fitting or TD( )) with running the two-stage learned optimization procedure B;A.
The following example illustrates how STAGE can produce better performance than the optimization method which it is learning about. Consider minimizing the one-dimensional function over the domain X = [-10,10], as depicted in Figure 4. Assuming a graph structure on this domain where tiny moves to the left or right are allowed, hillclimbing search clearly leads to a suboptimal local minimum for all but the luckiest of starting points. However, the quality of the local minima reached does correlate strongly with the starting position of x, making it possible to learn useful predictions.
Figure 4: A one-dimensional function minimization domain (left) and the
value function which predicts hillclimbing's performance on that
domain (right). The value function is approximately linear in
the feature |x|. TD-learning easily identifies that
structure, resulting in an improved evaluation function for
guiding search to a global minimum.
We encode this problem for STAGE using a single input feature |x|, which makes the value function approximately linear. Not surprisingly, running STAGE with a linear function approximator performs efficiently on this problem: after only a few trajectories through the space, it learns . Hillclimbing on rather than f(x), then, leads directly to the global optimum.
This problem is contrived, but we think its essential property--that features of the state help to predict the performance of an optimizer--does indeed hold in many practical domains. We have gotten excellent preliminary results in the domain of VLSI channel routing. The learned evaluation function soundly outperforms the hand-tuned function mentioned above in Equation 1 (page ). Details of these results are given in Appendix B.2.
Unlike the work of Zhang and Dietterich, which learns a search strategy from scratch using online value iteration, this method assumes an already-given search strategy for the domain and uses prediction to learn to improve on it. This simpler approach may serve as a better starting point for analysis.
My thesis will develop the STAGE algorithm further and apply it to the optimization domains of Section 5. One interesting direction for exploration is to apply STAGE recursively to itself, resulting in an automatically-generated multi-stage optimization algorithm. These stages, interleaving policy evaluation with execution, are reminiscent of the iterations of policy iteration, and my thesis will elaborate this relationship.