Standing in contrast to evaluation function learning methods based on approximating the theoretically-optimal are what I call direct meta-optimization methods. Such methods assume a fixed parametric form for the evaluation function and optimize it directly with respect to the ultimate objective, sampled by Monte Carlo simulation. In symbols, given an evaluation function parametrized by weights , we seek to learn by directly optimizing the meta-objective function
The functions learned by such methods are not constrained by the Bellman equations: the evaluations they produce for any given state have no semantic interpretation akin to the definition of in Equation 2 (page ). The lack of such constraints means that less information for training the function can be gleaned from a simulation run. The temporal-difference goal of explicitly caching values from lookahead search into the static evaluation function is discarded; only the final costs of completed simulation runs are available. For these reasons, the reinforcement-learning community has largely ignored the direct meta-optimization approach.
Nevertheless, I want to give this approach a fair comparison against VFA methods, for several reasons: