Next: Literature Review Up: Learning Evaluation Functions Previous: Using the Predictions

Direct Meta-Optimization of Evaluation Functions

Standing in contrast to evaluation function learning methods based on approximating the theoretically-optimal are what I call direct meta-optimization methods. Such methods assume a fixed parametric form for the evaluation function and optimize it directly with respect to the ultimate objective, sampled by Monte Carlo simulation. In symbols, given an evaluation function parametrized by weights , we seek to learn by directly optimizing the meta-objective function

The functions learned by such methods are not constrained by the Bellman equations: the evaluations they produce for any given state have no semantic interpretation akin to the definition of in Equation 2 (page ). The lack of such constraints means that less information for training the function can be gleaned from a simulation run. The temporal-difference goal of explicitly caching values from lookahead search into the static evaluation function is discarded; only the final costs of completed simulation runs are available. For these reasons, the reinforcement-learning community has largely ignored the direct meta-optimization approach.

Nevertheless, I want to give this approach a fair comparison against VFA methods, for several reasons:

The approach is simple to implement and understand.
Not having to meet the Bellman constraints may actually make learning easier. For example, even if a domain's correct is very jagged and discontinuous, meta-optimization may discover a quite different, smooth that performs well. (This point was also made in [Utgoff and Clouse1991].)
Although this approach promises to require a heavy computational burden, it must be admitted that the temporal-difference approaches have not set a high standard in this regard.
Finally, new stochastic optimization algorithms designed to be data-efficient may make this approach more competitive.

Justin A. Boyan
Sat Jun 22 20:49:48 EDT 1996