Standing in contrast to evaluation function learning methods based on
approximating the theoretically-optimal are what I call
direct meta-optimization methods. Such methods assume a fixed
parametric form for the evaluation function and optimize it directly
with respect to the ultimate objective, sampled by Monte Carlo
simulation. In symbols, given an evaluation function
parametrized by weights
, we seek to learn
by
directly optimizing the meta-objective function
The functions learned by such methods are not constrained by the
Bellman equations: the evaluations they produce for any given state
have no semantic interpretation akin to the definition of
in
Equation 2 (page
). The lack of such
constraints means that less information for training the function can
be gleaned from a simulation run. The temporal-difference goal of
explicitly caching values from lookahead search into the static
evaluation function is discarded; only the final costs of completed
simulation runs are available. For these reasons, the
reinforcement-learning community has largely ignored the direct
meta-optimization approach.
Nevertheless, I want to give this approach a fair comparison against VFA methods, for several reasons: